top | item 45824302

(no title)

zlatkov | 3 months ago

That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.

discuss

No comments yet.