top | item 44657056

(no title)

panabee | 7 months ago

This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.

discuss

order

matusp|7 months ago

This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.

panabee|7 months ago

100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.

(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)

JumpCrisscross|7 months ago

> This is true for every subfield I have been working on for the past 10 years

Hasn’t data labelling being the bulk of the work been true for every research endeavour since forever?

PaulHoule|7 months ago

If you download data sets for classification from Kaggle or CIFAR or search ranking from TREC it is the same. Typically 1-2% of judgements in that kind of dataset are just wrong so if you are aiming for the last few points of AUC you have to confront that.

morkalork|7 months ago

I still want to jump off a bridge whenever someone thinks they can use the twitter post and movie review datasets to train sentiment models for use in completely different contexts.

panabee|7 months ago

To elaborate, errors go beyond data and reach into model design. Two simple examples:

1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".

2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.

There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.

We need way more people thinking about biomedical AI.

arbot360|7 months ago

> What was true last year may be false today. For instance, ...

Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.

Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.

bjourne|7 months ago

What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.

jacobr1|7 months ago

Would not one approach be to just ensure the system has all the data? Relevance to address systems, side effects, and legal constraints. Then when making a recommendations it can account for all factors not just prior use cases.

panabee|7 months ago

If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.

Every fact is born an opinion.

This challenge exists in most, if not all, spheres of life.

K0balt|7 months ago

I think an often overlooked aspect of training data curation is the value of accurate but oblique data. Much of the “emergent capabilities “ of LLMs comes from data embedded in the data, implied or inferred semantic information that is not readily obvious. Extraction of this highly useful information, in contrast to specific factoids, requires a lot of off axis images of the problem space, like a CT scan of the field of interest. The value of adjacent oblique datasets should not be underestimated.

TZubiri|7 months ago

I noticed this when adding citations to wikipedia.

You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.

The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.

ethan_smith|7 months ago

Synthetic data generation techniques are increasingly being paired with expert validation to scale high-quality biomedical datasets while reducing annotation burden - especially useful for rare conditions where real-world examples are limited.

TZubiri|7 months ago

Isn't labelling medical data for ai illegal as unlicensed medical practice?

Same thing with law data

iwontberude|7 months ago

Paralegals and medical assistants don’t need licenses

mh-|7 months ago

No.