top | item 45336463

(no title)

This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming.

discuss

CuriouslyC|5 months ago

You can just fuzz names and switch to a whitespace compact representation.

Uehreka|5 months ago

If you fuzz the names they won’t mean the same thing anymore, and then it’s no longer the same test. If you remove the whitespace the LLM will just run a formatter on the code. It’s not like the LLM just loads in all the code and then starts appending its changes.

flare_blitz|5 months ago

And your basis for saying this is...?