(no title)
withinboredom | 4 days ago
Claude 4.5: not overfitted too much -- does the right thing 6/10 times.
Claude 4.6: overfitted -- does the right thing 2/10 times.
OpenAI 5.3: overfitted -- does the right thing 3/10 times.
These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.
My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.
Kim_Bruning|4 days ago
At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.
withinboredom|4 days ago