top | item 45286645

(no title)

blndrt | 5 months ago

I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.

The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.

That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.

discuss

No comments yet.