(no title)
mbh159
|
12 days ago
The 8% one-shot / 50% unbounded injection numbers from the system card are more honest than most labs publish, and they highlight exactly why you can't evaluate safety with static tests. An attacker doesn't get one shot — they iterate. The right metric isn't "did it resist this prompt" but "how many attempts until it breaks." That's inherently an adversarial, multi-turn evaluation. Single-pass safety benchmarks are measuring the wrong thing for the same reason single-pass capability benchmarks are: real-world performance is sequential and adaptive.
No comments yet.