top | item 42996931

(no title)

I'm not a fan of these "gotchas" because they don't test for what we really care about.

Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.

I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.

IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.

discuss

michaelt|1 year ago

The problem is some of the "gotchas" seem rather important in nontrivial applications.

Imagine a model that isn't sure if 9.11 is greater than 9.9 - which is difficult to reason about, because tokens.

Could such a model coach kids in math? Could it proofread a paper, or sense-check a business plan? Could it summarise a long document about carbon emissions? Could it generate a GUI? Could it spot mistakes in an OCRed document? Spot an off-by-one error or divide-by-zero in computer code?

aprilthird2021|1 year ago

The gotchas are good to help outline where the risk is when using these models. What you and I care about might change and one day counting letters in strings or solving trivia puzzles may be something we care about. It's nice to know the fuzzy edges of the system we are relying on day to day.

In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"

enum|1 year ago

The problems are not important, but they illustrate failures that are. For example:

- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.

- There are two examples were R1 often gets stuck "thinking forever"

If these failures happen on these questions, where else can happen? We'll start to find out soon enough.

Workaccount2|1 year ago

Someone needs to make a data transformation benchmark.

"Here are a variety of personal documents about John Doe. Fill out the McDonalds job application with information retrieved from the document set."