top | item 44221005

(no title)

denhaus | 8 months ago

Short answer: It’s a way to generate structured databases for (most) scientific topics. Why? Apply data driven methods to these databases. So what? It’s a powerful way to ask and investigate scientific questions/trends otherwise hidden inside a million scientific papers.

Example: Consider what PDB has done for our understanding of protein folding, as well as the ML/computational techniques they’ve enabled (eg, Alphafold). Most scientific questions and properties are not as data-rich as protein folding. What if they could be?

Longer answer: The last 15 years in computational/ML + science have shown that structured databases open up entirely new frontiers in discovery (eg Protein Data Bank, Materials Project). But most scientific topics/properties are NOT in structured DBs, they’re scattered about in millions of papers. It’s especially a huge problem in some topics in materials science. It’s not that these problems are data scarce, but that it’s hard to actually collate their data in a structured format. You literally cannot use most ML methods because structured DBs do not exist.

This paper is a way to generate massive structured databases of specialized, intricate, and hierarchical knowledge graphs from scientific literature. Fine tuning works, prompt engineering does not (at the time, perhaps this has changed). Once you have a database, you can analyze an entire subfield or topic in science with ML or stats methods.

discuss

No comments yet.