top | item 47044313

(no title)

pamelafox | 13 days ago

This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.

I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.

discuss

order

viraptor|13 days ago

You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.

sebazzz|13 days ago

Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.

NicoJuicy|13 days ago

Don't forget to update it regularly then

imiric|13 days ago

That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.

pamelafox|13 days ago

So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.

ChrisGreenHeur|13 days ago

same with people, no matter what info you give a person you cant be sure they will follow it the same every time

averrous|13 days ago

Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.