top | item 40862969

(no title)

jungsteven | 1 year ago

Nice question! The paper acknowledges that LLMs generate mutations slower and are more costly than traditional methods like PIT and Major, which are traditional testing tools. They did include metrics like cost per 1K mutations. However, the researchers focused on the effectiveness and high quality of the mutations generated by LLMs. For instance, GPT-3.5 achieves a 96.7% real bug detectability rate compared to Major’s 91.6% (not to mention GPT-4 outperformed all of them). All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

discuss

order

vlovich123|1 year ago

> All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

The problem with PIT and Major is that they don’t do profile guided mutation testing [0] which in theory would raise the detectability rate without a meaningful cost increase. Other works explore the use of GANs [1] which would probably be cheaper and likely as effective but not as sexy as LLMs.

[0] https://arxiv.org/pdf/2102.11378

[1] https://ar5iv.labs.arxiv.org/html/2303.07546

jungsteven|1 year ago

Thanks for sharing the papers! I remember reading the first one from Google and can’t wait to dive into the new one. Appreciate the insights!