This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming.
CuriouslyC|5 months ago
Uehreka|5 months ago
flare_blitz|5 months ago