top | item 43343013

(no title)

noddybear | 11 months ago

These models were not post-trained - all off-the-shelf.

We can fit about 128 pairs maximum in the context, but this performed the same as 32, which we ultimately decided on (for cost, latency purposes).

Encoding the input/outputs to make them shorter degraded performance. It seems that descriptive names is helpful for pretrained models because they have an intuition on what they do.

discuss

jxjnskkzxxhx|11 months ago

Follow up. Do you have an hypothesis why Claude performs much better than the rest at these tasks?

Is it just because Clause is the best at coding and the API is code? (not very interesting). Maybe if the API required the llms to write in poems, the best LLM at poetry would win...

Or is it because whatever makes claude good at coding, also makes it good at mathematical-like tasks. This is more interesting, as it would show some transfer learning. It would also suggest if you're doing training for a specific task, you would also benefit from training adjacent tasks e.g. if you're training for maths you could benefit from training coding. I believe this is actually true for humans.

And would you know how to check whether if any of the above hypothesis is correct?