top | item 43892913

(no title)

sigtstp | 10 months ago

I feel this makes some fundamental conceptual mistakes and is just riding the LLM wave.

"Semantics" is literally behavior under execution. This is syntactical analysis by a stochastic language model. I know the NLP literature uses "semantics" to talk about representations but that is an assertion which is contested [1].

Coming back to testing, this implicitly relies on the strong assumption of the LLM correctly associating the code (syntax) with assertions of properties under execution (semantic properties). This is a very risky assumption considering, once again, these things are stochastic in nature and cannot even guarantee syntactical correctness, let alone semantic. Being generous with the former, there is a track record of the latter often failing and producing subtle bugs [2][3][4][5]. Not to mention the observed effect of LLMs often being biased to "agree" with the premise presented to them.

It also kind of misses the point of testing, which is the engineering (not automation) task of reasoning about code and doing QC (even if said tests are later run automatically, I'm talking about their conception). I feel it's a dangerous, albeit tempting, decision to relegate that to an LLM. Fuzzing, sure. But not assertions about program behavior.

[1] A Primer in BERTology: What we know about how BERT works https://arxiv.org/abs/2002.12327 (Layers encode a mix of syntactic and semantic aspects of natural language, and it's problem-specific.)

[2] Large Language Models of Code Fail at Completing Code with Potential Bugs https://arxiv.org/abs/2306.03438

[3] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? https://arxiv.org/abs/2502.12115 (best models unable to solve the majority of coding problems)

[4] Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT https://arxiv.org/abs/2304.10778

[5] Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions https://arxiv.org/abs/2308.02312v4

EDIT: Added references

discuss

No comments yet.