It's really surprising that with all the statistical tools we have, the signal for the link between a common virus and a disease is unclear. Even if the road to a proven intervention is long, you'd think at least the link would be clear.
Once I started learning more about biology, I realized that everything is just so complex. The body repurposes chemicals a lot, so you have things like serotonin being a key neurotransmitter in the brain, but also in the gut. And you have enzymes that are coded in genes, but then there are also networks of genes that are up- or down- regulated by hundreds of other genes, and sometimes only in certain types of cells or certain physiological environments. And then of course there are epigenetic and immune-modulated effects at the genome, gene network, and individual gene levels. Not to even mention all the feedback mechanisms and meta-feedback mechanisms (the drive toward homeostasis is POWERFUL), and effects of countless chemicals in our environment.
There are certainly clear-cut cause-effect relationships in biological systems, but even they will have edge cases and random chance to muddle the picture.
I would posit that the human body is far more complex than even the largest codebase, not least because it was jury-rigged together with no architect or style guide.
Also, in general, the more common the exposure, the harder it is to find a link; try finding a control group of people who have never been exposed to PTFEs, or HSV, and who also aren’t like hunter gatherers.
The problem is simply observational. We don't even have reliable DNA and RNA sequencing of our own bodies. And we cannot reliably observe things in a host without knowing, to some extent, what we're looking for first. Even that space is so large, it's very hard to ascertain accurately. Biology is always suffering for lack of clear observations.
Also, adding complexity is the difficulty or even literal impossibility of observing the direct interactions of elements of the system, which operate at a quantum scale, that you would disturb and do disturb when attempting to observe.
"Everything in biology is more complicated than it looks."
DNA is where we get our physical attributes (modulo environment).
No, a lot of DNA is "junk," i.e. we don't yet understand what it does.
No, a lot of functional DNA is turned on or off by the epigenome.
No, a lot of our metabolism is affected by our biome — thousand of species of bacteria that turn up or down various reactions, or produce other chemicals that we need...
It's not that we haven't thought up the statistical tools. The core theoretical tools you need are there. It's that gathering the data that you need is extremely difficult and time consuming.
If you gather EHR or medical claims record data for vaccines for example, you have to take very seriously the biases and impact of missingness inherent in the data. Is that person you have no evidence of disease for truly not diseased or do they just have missing data? IS it missing because they just didnt go to the doctor because they're healthy enough to kick the disease on their own or because they're so financially unstable that they can't afford to consistently see their primary care doctor. Is the data missingness itself actually what's more correlated with the disease than the vacciation you are looking at?
Example: If your outcome is dementia then may be using cognitive tests that have a high level of variability due more to social class, education, test taking ability. Is receiving a fancy vaccine is more likely in an affluent area? Could be that correlation itself might completely explain away the positive effect that vaccine has on cognitive test scores.
In Alzheimer's you're often trying to correlate things that happen in early life with long term damage that only surfaces many many years later. Retrospective studies where you go back and ask sick or healthy people have recall bias where the sick ones remember more issues with themselves early on than healthy ones do even with the same early life issues.
Not trying to say epi is perfect or that there isn't room for improvement in tools (there absolutely is). But just like often happens when crossing over into the biological sciences there's a lot stickier problems than people outside the field realize.
Right, the data quality is usually crap. Beyond the issues you mentioned, patients often switch providers or health plans and their data doesn't get migrated. In the USA at least there is no centralized national repository for that data so the further back you try to go the more likely the data will just be missing (or incorrectly coded). In theory there are interoperability APIs and national networks to solve this problem but in practice a lot of systems still aren't properly connected.
For vaccinations specifically the CDC Immunization Gateway can be a good place to start. Most states also maintain their own immunization registries that can be queried through standard HL7 V2 Messaging and/or FHIR APIs if you have the appropriate permissions.
The issue is that we don’t have the primary data. This stuff is messy and the systems at play are extremely complicated. Often one of the most challenging parts of bio sciences is figuring out a test that will cleanly show a result that is true.
Without directly testing for a connection it’s extremely rare to get unexpected data that confirms an alternate hypothesis.
Even if the statistical tools are there, they can’t make up numbers that we haven’t collected yet.
> Authors sometimes share those with researchers conducting similar work, although they usually ignore such requests, according to recent studies of datasharing practices.
If the research was in anyway paid for with federal dollars all this data should be public. Not only that, if true it is a waste of federal dollars.
It's probable that the widening mistrust in science is due to this a sort of behavior and the resulting administration.
Waste due to inefficiencies is one thing, waste due to fraud, data hiding, misdirection is something else.
I think the link has shown up in the statistics for a long time. The article mentions Ruth Itzhaki being on it for 40 years. But things seem delayed by something along the lines of politics/corruption, or by the complexity of the situation with HSV1 not being the only cause. It can become a mess https://www.nytimes.com/2025/01/24/opinion/alzheimers-fraud-...
I'm hoping that AI helps sort this stuff out. It can read the papers and say hypothesis A is most likely even if professor Y had built an empire on it being hypothesis B.
HSV1 is estimated to affect more than 80% of the population, but less than 80% have dementia. This seems to imply there are other factors at play. Maybe it requires other factors like genetics or immune issues for it to progress.
The article clearly explains that the link isn't clear at all.
It's that certain damaging proteins are a line of defense against the HSV1 virus, that something sometimes sends those proteins into overdrive, that this is influenced by genetics broadly, further influenced by a particular gene, and that it's a second infection with shingles that can reactivate the proteins, worsening it.
Given that this is the interplay of something like at least 5 factors, and there may be more, it's not surprising it's taken this long to put together, even with all our statistical tools.
The part that might not be clear could be due to other factors, such as a genetic or lifestyle component that cause this to only progress to disease in some individuals.
bkfunk|11 months ago
There are certainly clear-cut cause-effect relationships in biological systems, but even they will have edge cases and random chance to muddle the picture.
I would posit that the human body is far more complex than even the largest codebase, not least because it was jury-rigged together with no architect or style guide.
Also, in general, the more common the exposure, the harder it is to find a link; try finding a control group of people who have never been exposed to PTFEs, or HSV, and who also aren’t like hunter gatherers.
senkora|11 months ago
> It’s more like a vibrating causal cloud than a chain of causality.
https://news.ycombinator.com/item?id=38898335
inciampati|11 months ago
Also, adding complexity is the difficulty or even literal impossibility of observing the direct interactions of elements of the system, which operate at a quantum scale, that you would disturb and do disturb when attempting to observe.
D-Coder|11 months ago
DNA is where we get our physical attributes (modulo environment).
No, a lot of DNA is "junk," i.e. we don't yet understand what it does.
No, a lot of functional DNA is turned on or off by the epigenome.
No, a lot of our metabolism is affected by our biome — thousand of species of bacteria that turn up or down various reactions, or produce other chemicals that we need...
epidemiology|11 months ago
If you gather EHR or medical claims record data for vaccines for example, you have to take very seriously the biases and impact of missingness inherent in the data. Is that person you have no evidence of disease for truly not diseased or do they just have missing data? IS it missing because they just didnt go to the doctor because they're healthy enough to kick the disease on their own or because they're so financially unstable that they can't afford to consistently see their primary care doctor. Is the data missingness itself actually what's more correlated with the disease than the vacciation you are looking at?
Example: If your outcome is dementia then may be using cognitive tests that have a high level of variability due more to social class, education, test taking ability. Is receiving a fancy vaccine is more likely in an affluent area? Could be that correlation itself might completely explain away the positive effect that vaccine has on cognitive test scores.
In Alzheimer's you're often trying to correlate things that happen in early life with long term damage that only surfaces many many years later. Retrospective studies where you go back and ask sick or healthy people have recall bias where the sick ones remember more issues with themselves early on than healthy ones do even with the same early life issues.
Not trying to say epi is perfect or that there isn't room for improvement in tools (there absolutely is). But just like often happens when crossing over into the biological sciences there's a lot stickier problems than people outside the field realize.
nradov|11 months ago
For vaccinations specifically the CDC Immunization Gateway can be a good place to start. Most states also maintain their own immunization registries that can be queried through standard HL7 V2 Messaging and/or FHIR APIs if you have the appropriate permissions.
https://www.cdc.gov/iis/iz-gateway/index.html
bognition|11 months ago
Without directly testing for a connection it’s extremely rare to get unexpected data that confirms an alternate hypothesis.
Even if the statistical tools are there, they can’t make up numbers that we haven’t collected yet.
readthenotes1|11 months ago
E.g., https://www.science.org/content/article/potential-fabricatio...
https://www.science.org/content/article/research-misconduct-...
https://arstechnica.com/science/2024/07/alzheimers-scientist...
https://stanforddaily.com/2023/02/17/internal-review-found-f...
lttlrck|11 months ago
If the research was in anyway paid for with federal dollars all this data should be public. Not only that, if true it is a waste of federal dollars.
It's probable that the widening mistrust in science is due to this a sort of behavior and the resulting administration.
Waste due to inefficiencies is one thing, waste due to fraud, data hiding, misdirection is something else.
tim333|11 months ago
I'm hoping that AI helps sort this stuff out. It can read the papers and say hypothesis A is most likely even if professor Y had built an empire on it being hypothesis B.
giantg2|11 months ago
crazygringo|11 months ago
It's that certain damaging proteins are a line of defense against the HSV1 virus, that something sometimes sends those proteins into overdrive, that this is influenced by genetics broadly, further influenced by a particular gene, and that it's a second infection with shingles that can reactivate the proteins, worsening it.
Given that this is the interplay of something like at least 5 factors, and there may be more, it's not surprising it's taken this long to put together, even with all our statistical tools.
giantg2|11 months ago