top | item 41192247

Qwen2-Math

128 points| limoce | 1 year ago |qwenlm.github.io

38 comments

order
[+] vessenes|1 year ago|reply
Sample solution for Balkan MO 2023 seems .. questionable?

The problem involves players removing stones sequentially and asking which will win with perfect play: the listed answer definitely doesn’t list all possible types of strategies.

The answer it gives may be right; in fact I bet it is correct (the second player), but does the qwen team offer the solution as correct including the logic? And is the solution logic correct?

[+] sterlind|1 year ago|reply
Using human language models for math problems is a bad idea for exactly this reason: judging correctness of proofs in English is subjective, whereas proofs in Lean are objective and easy to machine-check. Deepmind's AlphaProof was so remarkable precisely because it spoke Lean, so its proofs were verifiable. Why waste time on anything else?
[+] pathsjs|1 year ago|reply
I was going to write the same thing. I checked the first three problems and all solutions are partial at best. Now, don't get me wrong, this is still impressive. But putting the problems there with the implication that qwen solves them correctly when it doesn't does not really inspire trust
[+] tveita|1 year ago|reply
Same with the Lusophon Mathematical Olympiad 2023 example.

It makes a bunch of specious claims about parity. (Adding 4 to a number changes the parity? Therefore, the parity of each colour always changes for each turn? Therefore, the parity will always be the same as it was initially?) And then it concludes that since the parity is right, 2022 of each colour must be a reachable state.

Which, as you say, is quite possible the correct answer, but it's really weird to put it out there as an example with no comment on the reasoning.

[+] meroes|1 year ago|reply
This is my experience doing RLHF for math for LLMs. So many times it’s a partial answer. It may find the minimal case for example, but it doesn’t say why it’s the minimal one, even when the prompt specially asks for the why.
[+] ivancho|1 year ago|reply
It's complete nonsense. "The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000."

Similarly on the Martian question "If we transform 1 red and 1 green Martian, we get 4 blue Martians. This changes the parity of red and green Martians from even to odd, and the parity of blue Martians from even to odd." - is complete nonsense too.

"the sum of the digits of a number k modulo 2 is equivalent to k mod 2" - 12?

Basically all "solutions" are regurgitated techniques from math competitions which are used completely incorrectly but with a lot of confidence

[+] tempfile|1 year ago|reply
First solution (IMO 2002) is completely wrong. It shows that 1,2,3 cubes are not sufficient, and provide an obstacle that doesn't rule out 4 cubes, but does not prove that there actually are 4 cubes that sum to the given number. This is much harder (and I don't know the true answer)
[+] kevinventullo|1 year ago|reply
I think it’s about the same level of difficulty as showing you need at least four, but yeah I agree that what is written is insufficient. One solution is that you can write

2002 = 10^3 + 10^3 + 1^3 + 1^3

Then, multiply through by 2002^2001, which is itself a cube since 2001 is divisible by 3.

[+] ipnon|1 year ago|reply
These solutions aren't perfect, but imagine how many more people can become mathematicians now that the price of an elite IMO medal winning tutor can be quantified as Hugging Face hosting costs!
[+] Strix97|1 year ago|reply
I don't think that's this is useful tool to become a mathematician. Becoming one does not necessarily involve solving these kinds of puzzle exercises.

It's a useful to hone your creative thinking and learning how to approach a mathematical problem, but it wont make you a mathematician.

[+] meroes|1 year ago|reply
A mathematician tutor that doesn’t know how to write proofs though. Or at the most charitable, not how human mathematicians write proofs. I’m not talking about using Lean or similar, but the common parlance and rigor outside of theorem provers.
[+] ziofill|1 year ago|reply
I see that they do some decontamination of the datasets, in the hope that the models won't just recite answers from the training data. But in the recent interview with Subbarao Kambhampati on MLST (https://www.youtube.com/watch?v=y1WnHpedi2A) they explain that models fail as soon as one slightly rephrases the test problems (indicating that they are indeed mostly reciting). I expect this to be the case with this model too.
[+] refulgentis|1 year ago|reply
There's a lot of bunko 3rd-time-rephrased "wisdom" about AI, and enough interest in it that if you have the right title, you can get away with repeating it.

I'm a bit surprised that an HN reader, who presumably has first hand experience with them, isn't sure if its just a hashmap lookup.

[+] eightysixfour|1 year ago|reply
This is just not true. There are plenty of private tests with problems that are not in the training set. GPQA is an excellent example.
[+] beyondCritics|1 year ago|reply
It is obious that all of these problems are still way too hard, although sometimes it has ideas. It flawlessly demonstrates how to simplify (2002^2002) mod 9. I recall that there was once a scandalous university exam for future math teachers in germany, which asked to do tasks like that, but all failed the test. With Qwen-2 at hand this might not have happened.
[+] next_xibalba|1 year ago|reply
Kind of surprised this was released in English first given it was produced by a Chinese group (Alibaba Cloud). I wonder why that is.
[+] m3kw9|1 year ago|reply
Maybe Training material also mostly English
[+] karmasimida|1 year ago|reply
I mean I think 90% of AI researchers in China, use English as the first choice for publication ... so not surprised?
[+] qrian|1 year ago|reply
The solution for IMO 2022 is barely a 1/7 solution. It just says ‘ might not satisfy the inequality for all y’ without a proof. That was the point of the question.
[+] azinman2|1 year ago|reply
> This model mainly supports English. We will release bilingual (English and Chinese) math models soon

The irony.

[+] allanren|1 year ago|reply
Qwen2 has been quite good, but still can't compare 9.9 and 9.11