top | item 41237248

(no title)

Potential concerns with their self-eval:

They evaluate their automated reviewer by comparing against human evaluations on human-written research papers, and then seem to extrapolate that their automated reviewer would align with human reviewers on AI-written research papers. It seems like there are a few major pitfalls with this.

First, if their systems aren't multimodal, and their figures are lower-quality than human-created figures (which they explicitly list as a limitation), the automated reviewer would be biased in favor of AI-generated papers (only having access to the text). This is an obvious one but I think there could easily be other aspects of papers where the AI and human reviewers align on human-written papers, but not on AI papers.

Additionally, they note:

> Furthermore, the False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52). Hence, the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR [sic]), on the other hand, is higher (0.31 vs. 0.17)

It seems like false positive rate is the more important metric here. If a paper is truly high-quality, it is likely to have success w/ a rebuttal, or in getting acceptance at another conference. On the other hand, if this system leads to more low-quality submissions or acceptances via a high FPR, we're going to have more AI slop and increased load on human reviewers.

I admit I didn't thoroughly read all 185 pages, maybe these concerns are misplaced.

discuss

happypumpkin|1 year ago

Also a concern about the paper generation process itself:

> In a similar vein to idea generation, The AI Scientist is allowed 20 rounds to poll the Semantic Scholar API looking for the most relevant sources to compare and contrast the near-completed paper against for the related work section. This process also allows The AI Scientist to select any papers it would like to discuss and additionally fill in any citations that are missing from other sections of the paper.

So... they don't look for related work until the paper is "near-completed." Seems a bit backwards to me.

jalman|1 year ago

great point. I think the AI scientist is already a winner. If the likelihood of false outcome is FNR+FPR, then machine would fail 0.7 and humans 0.69 times. Humans do win nominally. In terms of costs humans loose. For every FPR 0.31-0.17 = 0.14 you spend additionally, you'd gain FNR 0.52-0.39 = 0.13. The paper production costs discrepancy is at least factor 100. The value of the least useful research typically drives factor two or more benefit in comparison to production and validation costs. So the final balance is 0.014 to 0.36 -> x25 gain in favor of AI.