(no title)
leelin | 6 years ago
That is, will medical images of diseases we diagnose in the next 20 years look a lot like the ones from the past 20 years, or is there a danger of over-fitting on an evolving data set? Could either the technology or the biology of the disease evolve?
In a prior life I was a quant trader, and financial market data is notorious for having the non-stationary problem. On top of market rules and structures changing all the time, once someone discovers a profitable trading idea, their own actions change what the data looks like for everyone else from that point forward.
savagedata|6 years ago
Example #1: Let's say that cancer rates are increasing over time and cameras are improving over time. You might end up with a weird artifact in your model that higher resolution images are more likely to indicate cancer.
Example #2: Let's say that cancer-detecting algorithms are widely successful and so someone makes an app that lets you upload images of skin and the app tells you the probability of you having cancer. Suddenly a model that was trained on suspicious lesions is being used on normal freckles that people uploaded for fun. You end up with a lot of false positives. Maybe you try to combat that by including images uploaded to the app (that you somehow obtain labels for). But now you have a model that predicts that photos taken in brightly lit medical offices are likely to be cancer and blurry images taken in bathroom mirrors are not cancer.
You could argue that Example #2 is more about the difference between training data and data to be scored, but the fact remains that outside of tightly controlled scenarios, the way data is collected nearly always changes in time and ends up affecting model performance in unexpected ways.
1wheel|6 years ago
https://twitter.com/IAmSamFin/status/1122271463170564100
Another example of change over time:
> One difficulty in such a comparison is that Gleason grading standards have shifted over time, so that scores below six are now rarely assigned, and assigning a higher grade has become more common
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3775342/
ska|6 years ago
However - you have hit on a very real problem. Imaging systems have got better over time, imaging quality even on the nominally same system can be different from different sites. Image coverage can change by both policy and system capabilities, etc.
It's worse the more sophisticated the imaging systems are. Consider MRI, which is perhaps better thought of as equipment to perform physics experiments than as an imaging device. In that case, nominally equivalent scans from different vendors (even different generation from the same vendor) can have significantly different characteristics. And there is a ton of processing going on, there is no such thing as "raw" data here - even the vendors themselves may no longer be able to really (or at least easily) characterize what is being done.
So yes, in any machine learning applied to these data sets, you have a very real risk of learning odd characteristics of the sample data and hurting your generalization.
Biology isn't as likely to be a problem I think, but biological response to changing treatment protocols, sure.
JoshTko|6 years ago
notahacker|6 years ago
chasedehan|6 years ago