(no title)
btown | 13 days ago
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
ljm|13 days ago
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
LLMs are not mind readers.
btown|12 days ago
balls187|13 days ago
I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.
That is, follow my prompt, and don't bother me about it.
Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.
pitched|13 days ago
xdotli|12 days ago
> opaque verifier Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?
> No problems involving existing codebases, refactors, or anything of the like, Also not true and we have many tasks e.g.https://www.skillsbench.ai/tasks/fix-build-google-auto, https://www.skillsbench.ai/tasks/fix-build-agentops, https://www.skillsbench.ai/tasks/react-performance-debugging
jwpapi|13 days ago
The headline is really bullshit, yes, I like the testing tho.
rapind|13 days ago
Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!
xdotli|12 days ago