(no title)
nyanpasu64 | 8 months ago
I don't know if this was referring to Zopfli's sorter or sorting in general, but I have heard of a subtle sorting bug in Timsort: https://web.archive.org/web/20150316113638/http://envisage-p...
nyanpasu64 | 8 months ago
I don't know if this was referring to Zopfli's sorter or sorting in general, but I have heard of a subtle sorting bug in Timsort: https://web.archive.org/web/20150316113638/http://envisage-p...
klabb3|8 months ago
This just rings of famous last words to me. There are many errors that pass this test. Edge cases in arbitrary code is not easy.
Makes me wonder how fuzzers do it. Just random data? How guided is it?
_flux|8 months ago
One of the better known "new gen fuzzers" is AFL. Wikipedia has a high-level overview of its fuzzing algorithm https://en.wikipedia.org/wiki/American_Fuzzy_Lop_(software)#...
With AFL you can use a JPEG decoder and come up with a "valid" JPEG picture, i.e. one acceptable by the decoder: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
rjpower9000|8 months ago
Indeed, this is exactly the type of subtle case you'd worry about when porting. Fuzzing would be unlikely to discover a bug that only occurs on giant inputs or needs a special configuration of lists.
In practice I think it works out okay because most of the time the LLM has written correct code, and when it doesn't it's introduced a dumb bug that's quickly fixed.
Of course, if the LLM introduces subtle bugs, that's even harder to deal with...
hansvm|8 months ago
What domain do you work in?
I hope I'm just misusing the tool, but I don't think so (math+ML+AI background, able to make LLMs perform in other domains, able to make LLMs sing and dance for certain coding tasks, have seen other people struggle in the same ways I do trying to use LLMs for most coding tasks, haven't seen evidence of anyone doing better yet). On almost any problem where I'd be faster letting an LLM attempt it rather than just banging out a solution myself, it only comes close to being correct with intensive, lengthy prompting -- after much more effort than just typing the right thing in the first place. When it's wrong, the bugs often take more work to spot than to just write the right thing since you have to carefully scrutinize each line anyway while simultaneously reverse engineering the rationale for each decision (the API is structured and named such that you expect pagination to be handled automatically, but that's actually an additional requirement the caller must handle, leading to incomplete reads which look correct in prod ... till they aren't; when moving code from point A to point B it removes a critical safety check but the git diff is next to useless and you have to hand-review that sort of tedium and have to actually analyze every line instead of trusting the author when they say that a certain passage is a copy-paste job; it can't automatically pick up on the local style (even when explicitly prompted as to that style's purpose) and requires a hand-curated set of examples to figure out what a given comptime template should actually be doing, violating all sorts of invariants in the generated code, like running blocking syscalls inside an event loop implementation but using APIs which make doing so _look_ innocuous).
I've shipped a lot of (curated, modified) LLM code to prod, but I haven't yet seen a single model or wrapper around such models capable of generating nearly-correct code "most" of the time.
I don't doubt that's what you've actually observed though, so I'm passionately curious where the disconnect lies.
awesome_dude|8 months ago
I have a concern about peoples' over confidence in fuzz testing.
It's a great tool, sure, but all it is is something that selects (and tries) inputs at random from the set of all possible inputs that can be generated for the API.
For a strongly typed system that means randomly selecting ints from all the possible ints for an API that only accepts ints.
If the API accepts any group of bytes possible, fuzz testing is going to randomly generate groups of bytes to try.
The only advantage this has over other forms of testing is that it's not constrained by people thinking "Oh these are the likely inputs to deal with"