(no title)
rjakob | 9 months ago
You're right: In the current version each "agent" essentially loads the whole paper, applies a specialized prompt, and calls the OpenAI API. The specialization lies in how each prompt targets a specific dimension of peer review (e.g., methodological soundness, novelty, citation quality). While it’s not specialization via architecture yet (i.e., different models), it’s prompt-driven specialization, essentially simulating a review committee, where each member is focused on a distinct concern. We’re currently using a long-context, cost-efficient model (GPT-4.1-nano style) for these specialized agents to keep it viable for now. Think of it as an army of reviewers flagging areas for potential improvement.
To synthesize and refine feedback, we also run Quality Control agents (acting like an associate editor), which reviews all prior outputs from the individual agents to reduce redundancy and surface the most constructive insights (and filter out less relevant feedback).
On your point about nitpicking: we’ve tested the system on several well-regarded, peer-reviewed papers. While the output is generally reasonable and we did not discover "made up" issues yet, there are occasional instances where feedback is misaligned. We're convinced, however, we can almost fully reduce such noise in future iterations (Community Feedback is super important to achieve this).
On the code side: 100% agree. This is very much an MVP focused on testing potential value to researchers, and the repeated agent classes were helpful for fast iteration. However, your suggestion of switching to template-based prompt loading and dynamic agent registration is great and would improve maintainability and scalability. We'll 100% consider it in the next version.
The _determine_research_type method is indeed a stub. Good catch. Also, lol @ the JS comment hashes, touché.
If you're open to contributing or reviewing, we’d love to collaborate!
No comments yet.