top | item 27082237

(no title)

roadnottaken | 4 years ago

This has been going on for decades, it's called GWAS [1] and it has had a few successes but basically hasn't worked as well as everyone hoped it would in the 90s. The reason it doesn't work that well is that the human genome has ~3 billion letters and human physiology is complex. So trying to establish stastitically significant correlations between genome variations and human physiology is hard and requires more than Excel. In fact, the computational tools that have been applied to this are incredibly sophisticated and are not the limiting factor. The limiting factor is that you probably need millions or billions of genomes to make it work, and we don't have that yet. Also people are beginning to realize that many disease-relevant traits are caused by rare variants (rather than obvious statistically significant correlations) which are quite hard to detect this way.

So... anyway you're right that this is a natural way to approach the question of understanding the genetic basis of disease and physiology. But it's been beaten to death and found to drive fewer insights than were hoped

[1] https://en.wikipedia.org/wiki/Genome-wide_association_study

discuss

order

disgruntledphd2|4 years ago

Lasso (L2 regularised) regression was actually invented to solve these kinds of problems. To the best of my knowledge, this has not yet appeared in Excel.