top | item 19790426

(no title)

savagedata | 6 years ago

There are always potential issues when a machine learning algorithm is applied over time.

Example #1: Let's say that cancer rates are increasing over time and cameras are improving over time. You might end up with a weird artifact in your model that higher resolution images are more likely to indicate cancer.

Example #2: Let's say that cancer-detecting algorithms are widely successful and so someone makes an app that lets you upload images of skin and the app tells you the probability of you having cancer. Suddenly a model that was trained on suspicious lesions is being used on normal freckles that people uploaded for fun. You end up with a lot of false positives. Maybe you try to combat that by including images uploaded to the app (that you somehow obtain labels for). But now you have a model that predicts that photos taken in brightly lit medical offices are likely to be cancer and blurry images taken in bathroom mirrors are not cancer.

You could argue that Example #2 is more about the difference between training data and data to be scored, but the fact remains that outside of tightly controlled scenarios, the way data is collected nearly always changes in time and ends up affecting model performance in unexpected ways.

discuss

order

No comments yet.