(no title)
windsignaling | 1 year ago
Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.
I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.
IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.
michaelt|1 year ago
Imagine a model that isn't sure if 9.11 is greater than 9.9 - which is difficult to reason about, because tokens.
Could such a model coach kids in math? Could it proofread a paper, or sense-check a business plan? Could it summarise a long document about carbon emissions? Could it generate a GUI? Could it spot mistakes in an OCRed document? Spot an off-by-one error or divide-by-zero in computer code?
aprilthird2021|1 year ago
In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"
enum|1 year ago
- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.
- There are two examples were R1 often gets stuck "thinking forever"
If these failures happen on these questions, where else can happen? We'll start to find out soon enough.
Workaccount2|1 year ago
"Here are a variety of personal documents about John Doe. Fill out the McDonalds job application with information retrieved from the document set."