top | item 44196583

(no title)

glyph | 8 months ago

I haven't personally tried the specific tool that you have, but I have tried a variety of other tools and have had pretty negative experiences with them. I have received a lot of feedback telling me that if I tried out an agentic tool (or a different model, or etc etc etc, as I covered in the post, the goal posts are endlessly moving) I would like it, because the workflow is different.

I was deliberately vague about my direct experiences because I didn't want anyone to do… well, basically this exact reply, "but you didn't try my preferred XYZ workflow, if you did, you'd like it".

What I saw reflected in your repo history was the same unpleasantness that I'd experienced previously, scaled up into a production workflow to be even more unpleasant than I would have predicted. I'd assumed that the "agentic" stuff I keep hearing about would have reduced this sort of "no you screwed up" back-and-forth. Made particularly jarring was that it was from someone for whom I have a lot of respect (I was a BIG fan of Sandstorm, and really appreciated the design aesthetic of Cap'n Proto, although I've never used it).

As a brutally ironic coda about the capacity of these tools for automated self-delusion at scale, I believed the line "Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.", and in the post, I accepted the premise that it worked. You're not a novice here, you're on a short list of folks with world-class appsec chops that I would select for a dream team in that area. And yet, as others pointed out to me post-publication, CVE-2025-4143 and CVE-2025-4144 call into question the efficacy of "thorough review" as a mechanism to spot the sort of common errors likely to be generated by this sort of workflow, that 0xabad1dea called out 4 years ago now: https://gist.github.com/0xabad1dea/be18e11beb2e12433d93475d7...

Having hand-crafted a few embarrassing CVEs myself with no help from an LLM, I want to be sure to contextualize the degree to which this is a "gotcha" that proves anything. The main thrust of the post is that it is grindingly tedious to conclusively prove anything at all in this field right now. And even experts make dumb mistakes, this is why the CVE system exists. But it does very little to disprove my general model of the likely consequences of scaled-up LLM use for coding, either.

discuss

kentonv|8 months ago

I do feel that the agentic thing is what made all the difference to me. The stuff I tried before that seemed pretty lame. Sorry, I know you were trying to avoid that exact comment, but it is true in my case. To be clear, I am not saying that I think you will like it. Many people don't, and that's fine. I am just saying that I didn't think I would like it, and I turned out wrong. So it might be worth trying.

The CVE is indeed embarrassing, particularly because the specific bug was on my list of things to check for... and somehow I didn't. I don't know what happened. And now it's undermining the whole story. Sigh.

glyph|8 months ago

I appreciate your commitment to being open to the possibility of being surprised. And I do wish I _could_ find a context in which I could be comfortable doing this type of personal experiment. But, I do remain confident in my own particular course of action chosen in the face of incomplete information.

Again, it's tough to talk about this while constantly emphasizing that the CVE at best a tiny little data point, not anywhere close to a confirmation bullseye, but my model of this process would account for it. And the way it accounts for it is in what I guess I need to coin a term for, "vigilance decay". Sort of like alert fatigue, except there are no alerts, or hedonic adaptation, for when you're not actually happy. You need to keep doing the same kinds of checks, over and over, at the same level of intensity forever to use one of these tools, and humans are super bad at that; so, at some point in your list, you developed the learned behavior "hey, this thing is actually getting most of this stuff right, I am going to be a little less careful". Resisting this is nigh impossible. The reason it's less of a problem with human code review is that as the human seems to be getting better at not making the mistakes you've spotted before, they actually are getting better at not making those mistakes, so your relaxed vigilance is warranted.