top | item 41527464

(no title)

helmsb | 1 year ago

I did a few tests and asked it some legal questions. 4o gave me the correct answer immediately.

o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law

discuss

order

AhtiK|1 year ago

That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.

elicksaur|1 year ago

This is interesting since they claim it does well on STEM questions, which I’d assume would be a similar level of reasoning complexity for a human.

abernard1|1 year ago

This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.

There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.

My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.

This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.

waveBidder|1 year ago

A difficult to guess fraction of all of these results are training to the test in various forms

m101|1 year ago

Perhaps the smaller model used in o1 is over trained on arxiv and code relative to 4o (or undertrained on legal text)