(no title)
visarga | 4 days ago
I am actively thinking about task.md like a new programming language, a markdown Turing machine we can program as we see fit, including enforcement of review at various stages and self-reflection (am I even implementing the right thing?) kind of activity.
I tested it to reliably execute 300+ gates in a single run. That is why I am sending judges on it, to refine it. For difficult cases I judge 3-4 times before working, each judge iteration surfaces new issues. We manually decide judge convergence on a task, I am in the loop.
The judge might propose bad ideas about 20% of the time, sometimes the planner agent catches them, other times I do. Efficient triage hierarchy: judge surfaces -> planner filters -> I adjudicate the hard cases.
No comments yet.