top | item 47057006

(no title)

mbh159 | 12 days ago

The 8% one-shot / 50% unbounded injection numbers from the system card are more honest than most labs publish, and they highlight exactly why you can't evaluate safety with static tests. An attacker doesn't get one shot — they iterate. The right metric isn't "did it resist this prompt" but "how many attempts until it breaks." That's inherently an adversarial, multi-turn evaluation. Single-pass safety benchmarks are measuring the wrong thing for the same reason single-pass capability benchmarks are: real-world performance is sequential and adaptive.

discuss

No comments yet.