top | item 42994565

(no title)

rahidz | 1 year ago

What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.

Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."

Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."

discuss

Workaccount2|1 year ago

The other day I watched Gemini reasoning go in circles for ~30 seconds trying to determine if 9.11 was greater than 9.9.

----

"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.

Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.

Re-examine the numbers digit by digit from left to right, after the decimal:

9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).

9.11: 1 in the tenths place, 1 in the hundredths place.

Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.

Correctly read and compare tenths digits:

9.9 - Tenths digit is 9.

9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."

----

This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.

hto2i4j234234|1 year ago

I wonder how much of this arithmetic confusion comes from software versioning numbers where the decimal character is actually a tuple-delimiter.

fzzzy|1 year ago

Training a model to reason that it should write and execute code is the solution here. Synthetic datasets showing lots of math questions followed by reasoning that writes and executes the program for the answer.

Validark|1 year ago

Obviously very stupid reasoning going on, but reasoning nonetheless? It makes me think we're on the right track that it basically seems to know what steps should be taken and how to step through the steps. I don't know why it is getting so incredibly tripped up, maybe it's extremely uncertain about whether it can map "9.9"["tenths place"] => "9". But this is still impressive to me that a machine is doing this.

empath75|1 year ago

This and it's struggles with spelling questions are both artifacts of tokenization and not really a failure of reasoning. I think there's probably a simple solution that solves both this and the "how many r's are there in strawberry" problem, though I don't know what it would be.

photonthug|1 year ago

This is hilarious, and makes me wonder whether there’s some main place where people are archiving examples of AI fails now. It would be amusing, but also seems like a public service and might help to avoid billions of dollars getting burnt at the altar of hype.

sd9|1 year ago

I wonder if RLHF interfered with 9.11 (which could be interpreted as a date), preventing the model from considering it naturally.

Wonder if the same thing would have happened with 9.12.

What was your original prompt?

armcat|1 year ago

It feels like lot of the reasoning tokens go to waste on pure brute force approach - plugging in numbers and evaluating and comparing against the answer. "Nope, that didn't work, let's try 4 instead of 6 this time", etc. What if the reward function instead focuses on diversity of procedures within a token budged (10k - 20k tokens). I.e. RL rewards the model in trying different methods or generating different hypotheses, rather than brute forcing its way through, and potentially getting stuck in loops.

ANighRaisin|1 year ago

I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning.

enum|1 year ago

The nature of the problems makes it relatively easy to follow along with the models' reasoning and reasoning errors. For example, on this problem (answer "New England"):

> Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

R1 fixates on answers of the form "CITY, STATE" and eventually returns some confidently wrong nonsense. It doesn't try to explore answers that don't fit the "CITY, STATE" template.

empath75|1 year ago

o1 high got this after i gave it the hint that the first five letters are not a single word.

viraptor|1 year ago

I hope the new models will be trained with better words to continue the thought process. Right now it seems like "wait", "but", "let me think again", etc. are the main ones, which seem to encourage self-doubt too much. They need some good balance instead.

dr_kiszonka|1 year ago

It would be fun to experiment with, e.g., positive self-talk like "you've got this", "you've trained for this," etc.