Will definitely do.
I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc.
Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
bazooka5798|7 months ago
westurner|7 months ago
- SWE-bench leaderboard: https://www.swebench.com/
- Which metrics for e.g. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork"? https://news.ycombinator.com/item?id=43101314
- MetaGPT, MGX: https://github.com/FoundationAgents/MetaGPT :
> Software Company as Multi-Agent System
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
- Mutation-Guided LLM-based Test Generation: https://news.ycombinator.com/item?id=42953885
- https://news.ycombinator.com/item?id=41333249 :
- codefuse-ai/Awesome-Code-LLM > Analysis of AI-Generated Code, Benchmarks: https://github.com/codefuse-ai/Awesome-Code-LLM :
> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding
- underlines/awesome-ml/tools.md > Benchmarking: https://github.com/underlines/awesome-ml/blob/master/llm-too...
- formal methods workflows, coverage-guided fuzzing: https://news.ycombinator.com/item?id=40884466
- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350
Leave_OAI_Alone|7 months ago
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?