top | item 45390814

(no title)

hashta | 5 months ago

To people outside the field, the title/abstract can make it sound like folding is just inherently simple now, but this model wouldn’t exist without the large synthetic dataset produced by the more complex AF. The "simple" architecture is still using the complex model indirectly through distillation. We didn’t really extract new tricks to design a simpler model from scratch, we shifted the complexity from the model space into the data space (think GPT-5 => GPT-5-mini, there’s no GPT-5-mini without GPT-5)

discuss

godelski|5 months ago

  > To people outside the field

So what?

It's a research paper. That's not how you communicate to a general audience. Just because the paper is accessible in terms of literal access doesn't mean you're the intended audience. Papers are how scientists communicate to other scientists. More specifically, it is how communication happens between peers. They shouldn't even be writing for just other scientists. They shouldn't be writing for even the full set of machine learning researchers nor the full set of biologists. Their intended audience is people researching computational systems that solve protein folding problems.

I'm sorry, but where do you want scientists to be able to talk directly to their peers? Behind closed doors? I just honestly don't understand these types of arguments.

Besides, anyone conflating "Simpler than You Think" as "Simple" is far from qualified from being able to read such a paper. They'll misread whatever the authors say. Conflating those two is something we'd expect from an Elementary School level reader who is unable to process comparative statements.

I don't think we should be making that the bar...

hashta|5 months ago

It’s literally called "SimpleFold". But that’s not really my point, from your earlier comment (".. go through all the complexities first to find the generalized and simpler formulations"), I got the impression you thought the simplicity came purely from architectural insights. My point was just that to compare apples to apples, a model claiming "simpler but just as good" should ideally train on the same kind of data as AF or at least acknowledge very clearly that substantial amount of its training data comes from AF.

I’m not trying to knock the work, I think it’s genuinely cool and a great engineering result. I just wanted to flag that nuance for readers who might not have the time or background to spot it, and I get that part of the "simple/simpler" messaging is also about attracting attention which clearly worked!

stavros|5 months ago

But this is just a detail, right? If we went and painstakingly catalogued millions of proteins, we'd be able to use the simple model without needing a complex model to generated data, no?

connorbrinton|5 months ago

Technically yes. But it can take months to years to experimentally obtain the structure for a single protein, and that assumes that it's possible to crystallize (X-ray), prepare grids (cryo-EM) or highly concentrate (NMR) the protein at all.

On the other hand, validating a predicted protein structure to a good level of accuracy is much easier (solvent accessibility, mutagenesis, etc.). So having a complex model that can be trained on a small dataset drastically expands the set of accurate protein structure samples available to future models, both through direct predictions and validated protein structures.

So technically yes, this dataset could have been collected solely experimentally, but in practice, AlphaFold is now part of the experimental process. Without it, the world would have less protein structure data, in terms of both directly predicted and experimentally verified protein structures

littlestymaar|5 months ago

> but this model wouldn’t exist without the large synthetic dataset produced by the more complex AF

This model could also have existed from natural data if we had access to enough of it.

inkysigma|5 months ago

Maybe, but then this seems more like an exercise in distillation rather than solving the original problem which is what the title "Folding proteins is simpler..." suggested to me at least. Part of the problem with any ML task is that data is usually limited and presumably far more limited than the amount of synthetic data you can generate.