I'm a little disappointed that their linked preprint doesn't appear to include any molecular biology; i.e. they don't actually try to synthesize any of their predicted sequences and test function. It wouldn't be an outrageous synthesis task to make some of the CRISPR-Cas sequences they generated.
Also interesting that AlphaMisense is omitted from Figure 2B; it substantially outperforms the ESM-based ESM1b in our hands. But I guess the idea is that this is a general-purpose DNA language model whereas AlphaMissense is domain-specific for variant effect prediction?
Strong second for wishing they tried physically testing some model output. The importance of "model that makes outputs AlphaFold thinks look like Cas" is very different from "model that makes functional Cas variants".
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
This should really be a requirement for bio type related generative methods rather than a nice-to-have. A very high percentage of compounds generated by genai type methods have been shown not to work as intended. Anything without wetlab validation should really be taken with a large grain of salt
As you progress along the following chain: genomics-->proteomics->interactomics->metabolomics, our understanding becomes blurrier and challenges harder.
I think you are missing what the Evo project is trying to do -- create a new prokaryotic genome through a generative model. How this would work would be like the earlier hand-made synthetic genomes like Synthia (Gibson et al, 2010).
In such a system you would take an existing bacterial cell and replace its genome with the newly synthesized version. The proteins and other molecules from the existing cell would remain (before eventually being replaced) and serve to "boot" the new genome.
I tend to agree (the cell being in control and all the 4D interactions and epigenetics mechanisms etc), but out of curiosity, what would you say we also need?
while potentially interesting work, very shortsighted and premature to say this is a "GPT" moment in biology. ML people in bio need to think hard not only about what they are doing, but why are they are doing it (other than this is cool and will lead to a nice Nature publication). Their basic premise (learning from DNA is the next grand challenge in biology) is shaky. Imo, the grand challenge in biology is determining what the grand challenge is, and that is a deep scientific/philosophical question.
[+] [-] ninjha01|2 years ago|reply
[0] https://evo.nitro.bio/
[+] [-] timy2shoes|2 years ago|reply
[+] [-] jashephe|2 years ago|reply
Also interesting that AlphaMisense is omitted from Figure 2B; it substantially outperforms the ESM-based ESM1b in our hands. But I guess the idea is that this is a general-purpose DNA language model whereas AlphaMissense is domain-specific for variant effect prediction?
[+] [-] bnprks|2 years ago|reply
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
[+] [-] ackbar03|2 years ago|reply
[+] [-] rdmirza|2 years ago|reply
Your model makes predictions. Prove they’re worth salt.
[+] [-] d_silin|2 years ago|reply
As you progress along the following chain: genomics-->proteomics->interactomics->metabolomics, our understanding becomes blurrier and challenges harder.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] pfisherman|2 years ago|reply
https://www.biorxiv.org/content/10.1101/2024.02.29.582810v1
Tl;dr: DNA is NOT all you need.
[+] [-] jhbadger|2 years ago|reply
In such a system you would take an existing bacterial cell and replace its genome with the newly synthesized version. The proteins and other molecules from the existing cell would remain (before eventually being replaced) and serve to "boot" the new genome.
[+] [-] samuell|2 years ago|reply
[+] [-] t_serpico|2 years ago|reply
while potentially interesting work, very shortsighted and premature to say this is a "GPT" moment in biology. ML people in bio need to think hard not only about what they are doing, but why are they are doing it (other than this is cool and will lead to a nice Nature publication). Their basic premise (learning from DNA is the next grand challenge in biology) is shaky. Imo, the grand challenge in biology is determining what the grand challenge is, and that is a deep scientific/philosophical question.
[+] [-] dekhn|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] visarga|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]