In my own tests I have found opus to be very good at writing plans, terrible at executing them. It typically ignores half of the constraints.
https://x.com/xundecidability/status/2019794391338987906?s=2...
https://x.com/xundecidability/status/2024210197959627048?s=2...
Sammi|7 days ago
2. Have the agent review if it followed the plan and relevant skills accurately.
irthomasthomas|7 days ago
here is another one which had about 200 tokens and opus decided to change the model name i requested.
https://x.com/xundecidability/status/2005647216741105962?s=2...
opus is bad at instruction following now.