(no title)
eesmith | 5 days ago
1) the development isn't actually using red/green TDD, and
2) the result doesn't show "really good results", including not following a very well-defined specification
so doesn't work as a concrete example of your description of what the second chapter is supposed to be about.
Perhaps you could show the process of refining it more, so it actually is spec compliant and tests all the implemented features?
What's the outcome difference between this approach vs. something which isn't TDD, likes test-after with full branch coverage or mutation testing? Those at least are more automatable than manual inspection, so a better fit to agentic coding, yes?
(Of course regular branch coverage doesn't test all the regexp branches, which makes regexp use tricky to test.)
simonw|5 days ago
eesmith|5 days ago
I've seen entirely too many examples of how to use TDD which give under-specified toy problems, where the solution is annoyingly incomplete for something more realistic.
And I've seen TDD projects which didn't follow the spec, but instead implemented the developers' misconceptions about the spec.
That's exactly what we see here with Markdown, where there's a spec, along with a lot of non-conformant examples in the training set by people who didn't read the spec but instead based it on their experiences in using Markdown.
The code generated by ChatGPT is almost correct. Seeing the process of how to get from that to a valid and well-tested solution would make for a good demonstration of the full process.
I'll again add that showing how to integrate something like branch coverage or hypothesis testing for automatic test suite generation would be really useful.
eesmith|5 days ago
As it currently says:
> A significant risk with coding agents is that they might write code that doesn't work, or build code that is unnecessary and never gets used, or both.
> Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions.
while the ChatGPT generated code contains bugs, contains unnecessary code which never gets used, and the ChatGPT generated test suite is not robust.
(As an example of unnecessary code which never gets used, _FENCE_RE contains "(?P<info>.*)$" but neither the group name nor the group are used, and the pattern is unneeded -- and all of the tests pass without it.)
Your writings are widely read and influential. I think it's important that you let readers know the results produced in your experiment are not actually a complete example of a "fantastic fit" of Red/Green TDD for coding agents, and to highlight their limitations.