top | item 37405341

(no title)

It failed my basic benchmark that only GPT4 passes:

Q: I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

Falcon180B: First, fill the 6-liter jug completely. Then, pour it into the 12-liter jug. Fill the 6-liter jug again, and pour it into the 12-liter jug until the 12-liter jug is full. At this point, there will be exactly 6 liters of water in the 6-liter jug, which you can then use.

Q: Using only a 6-liter jug, how can I get exactly 6 liters of water?

Falcon180B: You cannot get exactly 6 liters of water in a 6-liter jug because it would already be full. However, if you had another container that could hold at least 6 liters of water, you could follow these steps:

1. Fill the 6-liter jug with water. 2. Pour the water from the 6-liter jug into the other container. 3. Refill the 6-liter jug with water and pour it into the other container. 4. Repeat step 3 until you have poured 6 liters of water into the other container.

This way, you can get exactly 6 liters of water by combining the capacity of the 6-liter jug and the other container.

discuss

Roark66|2 years ago

Probably every single thread comparing gpt (3.5 or especially 4) needs a copy paste caveat that it's not really fair to compare a "bare" model like falcon (not even its instruct version) with chatgpt _service_ the service part is crucial, because it includes advanced features such as "secret sauce" output sampling (while on huggingface all models by default use basic greedy sampling). No preprompt, depending on subject detection and so on. For a fair comparison we would need to compare with the exact same prompt and exact same output sampling. And tgats not going to happen with (not at all) OpenAI

swyx|2 years ago

> "secret sauce" output sampling

are you referring to beam search? something else?

M4v3R|2 years ago

I just asked GPT-4 your first question and got this in response. It's still a correct result, but with unnecessary steps:

> Fill the 12-liter jug and pour water into the 6-liter jug until it's full. Empty the 6-liter jug and pour the remaining water from the 12-liter jug into it. You'll have 6 liters in the 6-liter jug.

ggsp|2 years ago

My exchange:

Me:

  I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

ChatGPT:

  Using the 12-liter jug and the 6-liter jug, you can get exactly 6 liters of water in the following way:

  1. Fill the 12-liter jug to its maximum capacity.

  2. Pour the water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. This will leave you with 6 liters of water in the 12-liter jug.

  Now you have 6 liters of water in the 12-liter jug.

Me:

  Can you think of a simpler way to achieve this?

ChatGPT:

  Yes, a simpler way to achieve this is:

  1. Fill the 6-liter jug to its maximum capacity.

  That's it. You now have 6 liters of water in the 6-liter jug!

I wish I understood why prompting it a second time gets a better answer.

saberience|2 years ago

Isn't the right answer just fill the 6-liter jug? I don't get it. Is it supposed to be a trick question?

glitchc|2 years ago

What about the ketchup test? Ask it to tell you how many times the letter e appears in the word ketchup. Llama always tells me it's two.

aqme28|2 years ago

Spelling challenges are always going to be inherently difficult for a token-based LM. It doesn't actually "see" letters. It's not a good test for performance (unless this is actually the kind of question you're going to ask it regularly).

gsuuon|2 years ago

I've found it's more reliable to ask it to write some javascript that returns how many letters are in a word. Works even with Llama 7b with some nudging.

ttul|2 years ago

Falcon fails. GPT-3.5 also fails this test. GPT-4 gets it right. I suspect that GPT-4 is just large enough to have developed a concept of counting, whereas the others are not. Alternatively, it's possible that GPT-4 has memorized the answer from its more extensive training set.

neel8986|2 years ago

Bard can also give correct result

saberience|2 years ago

Is this supposed to be a trick question? How can it be a good question for testing an AI if a human cannot understand it either?

I think if you ask this question on different websites (to humans) you will get many different and confused answers. So why bother asking an AI? I don't even know what the right answer is.

SkyPuncher|2 years ago

I don’t think this is a particularly useful benchmark.

It’s well known that LLMs are bad at math. The token based weighting can’t properly account for numbers that can vary wildly. Numbers are effectively wildcards in the LLM world.

ben_w|2 years ago

Surely this is a "didn't read the question properly" problem rather than a "didn't maths right" problem?

And that (understanding a natural language question) is the USP for LLMs.

TylerE|2 years ago

I don't buy it. In any common usdage "6 liter jug" means a jug capable of holding 6 liters, not with a volume of 6 liters including the walls.

sp332|2 years ago

I don't understand your comment. Falcon said that it's impossible to measure 6 liters of water with a 6 liter jug.

ProjectArcturis|2 years ago

Surely the reason LLMs fail here is because this is an adaptation of a common word problem, except your version has been tweaked so that there is a trivial answer.

sp332|2 years ago

Yes, that's the point of the question. We want to know if it's actually doing some reasoning, or if it has just memorized an answer.

rmbyrro|2 years ago

This does not look like a good benchmark test for an LLM capability.

saberience|2 years ago

I, a human, have no idea how to answer this weird question, why do you suppose an AI would do better?

I can’t work out if it’s a joke question or a serious question?

unknown|2 years ago

[deleted]