Asking 60 LLMs a set of 20 questions

[+] typpo|2 years ago|reply

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

[+] ulnarkressty|2 years ago|reply

This is better that the regular benchmarks and LLM tricks such as passing some exam or other because it's unlikely that they were part of the training set for said LLMs. It also mirrors my experience, that GPT4 is way ahead of everything else but still manages to break in weird ways.

I think we are past the magical talking dog stage and being amazed that an LLM is able to output a Fibonacci function doesn't really help with the progress. As others have commented, this page is a step in the right direction (except the Fibonacci part :).

That being said, the fact that the questions are now online will make them part of the training set sooner or later. Which is to say the only way to reliably evaluate an LLM is by not leaking the test set and being deliberately opaque about what's being asked. Which raises some interesting trust questions.

[+] bugglebeetle|2 years ago|reply

My experience with GPT-4 is that half the battle is knowing when to reset the context for a back-and-forth. For things like coding, after multiple revisions of code, it begins to get confused about which instance of the code/context to which it should refer. Resetting things back to a new frame with whatever the next query or request is generally resolves things.

OpenAI’s “magic talking dog” aspect is making you believe it has effective long-term recall, when in reality, it’s fairly limited and impacts reasoning the longer an exchange gets.

[+] nojs|2 years ago|reply

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

The site reports every LLM as getting this wrong. But GPT4 seems to get it right for me:

> Sally has 3 brothers. Since each brother has 2 sisters and Sally is one of those sisters, the other sister is the second sister for each brother. So, Sally has 1 sister.

[+] Terretta|2 years ago|reply

From the ChatGPT app, using GPT-4 model, using my own custom prompt for such puzzles: “You are better than Marilyn Vos Savant at solving brain teaser logic puzzles step by step. Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

Certainly! Let's break it down step by step:

1. Sally has 3 brothers.

2. Each of those brothers has two sisters. Since all the brothers share the same two sisters, we are essentially hearing about the same sisters each time, not accumulating more sisters with each brother mentioned.

3. We know that Sally is a sister to her brothers, so she is one of the two sisters each brother has.

By putting it all together, we find that there must be another sister. Therefore, Sally has one sister.

---

New chats with no history, no custom prompt, and no system prompt, first interaction in the chat this question: “Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

All Answers (not cherry picked):

Sally has 1 sister. In the description, Sally is counted as one of the two sisters that each brother has. This means the brothers have another sister apart from Sally.

. . .

Sally has 1 sister. In the description, it mentioned that "Each brother has 2 sisters." This implies that there is Sally and another sister, since we know that all the brothers share the same sisters.

. . .

Sally has one sister. Each of her brothers has two sisters, counting Sally and the other sister.

. . .

Sally has 1 sister. Here's how we figure it out:

Sally has 3 brothers, and each of these brothers has 2 sisters. Since the brothers share the same sisters, we know that the two sisters must be Sally and another girl. So, Sally has 1 sister.

. . .

I have gotten zero wrong answers from GPT-4 in OpenAI’s ChatGPT app.

[+] delusional|2 years ago|reply

OpenAI really ought to provide some sort of hash of the model to be included with stuff like this. Right now there's no way to know if the results are comparable. As an extreme example it's possible that they're not even running the model for this question and are just opportunistically feeding back canned responses.

That is, we know that OpenAI are saving the responses, it's not unlikely that they train on bad responses.

It's the same problem as GPU benchmarks in the olden days, when drivers would detect the usage pattern of the benchmark and enable special optimizations that boosted the benchmark.

[+] amrrs|2 years ago|reply

Falcon-180B also got it right

> Since Sally is a girl, she can be considered as one of the sisters. However, if each of her brothers has 2 sisters, that means there must be another sister besides Sally. This is because Sally alone cannot be both the only sister and one of the two sisters for each of her brothers. Thus, Sally has 1 more sister.

[+] mmcwilliams|2 years ago|reply

That's kind of the issue with non-deterministic LLMs, isn't it?

[+] belter|2 years ago|reply

I confirm GPT-4 solves this correctly. Makes me immediately doubt everything else in the article...

[+] jakderrida|2 years ago|reply

Also, MPT 7B gets it right over half the time. I've been testing every new LLM with that question.

Also, I tend to include mention in the question that all siblings are from the same two parents to preclude half-siblings because half my friends have half-siblings from both sides scattered across the country; so the wrong answers actually do tend to apply to them sometimes.

[+] adrian_b|2 years ago|reply

GPT 4 and another LLM have given the right answer only after adding "Let's think step by step." to the original prompt.

With the simpler prompt, all the answers were wrong, most of them ridiculously wrong.

[+] jasonjmcghee|2 years ago|reply

All benchmarks were run with temperature 0 according to the results, so make sure to do the same in conformational tests.

[+] pilaf|2 years ago|reply

The second version of the Sally prompt reported on the benchmark has GPT4 giving the correct answer:

> Sally has 3 brothers. Each of these brothers has 2 sisters. This means that there are 2 girls in the family, including Sally. Therefore, Sally has 1 sister.

The prompt:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step.

The only difference with the first version being the addition of the last sentence.

[+] coolspot|2 years ago|reply

Could it be due to bad tokenization? E.g. would results improve if “3” and “2” were spelled “three” and “two” in the question?

[+] klohto|2 years ago|reply

GPT4 with a custom prompt gives the best results for me for all of the questions https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...

[+] awwaiid|2 years ago|reply

Replying to gpt-4 with "That is incorrect. Try again." over and over got it to flip between correct and incorrect just about every other time.

Now try to convince GPT-4 that there is no God. Good luck!

[+] ecesena|2 years ago|reply

See the one after: "Let's think step by step." https://benchmarks.llmonitor.com/cot-sally

It appears the GPT4 learned it and now it's repeating the correct answer?

[+] rootusrootus|2 years ago|reply

Interestingly, it took GPT4 three attempts to give me the correct answer. The first two times it basically said the same [logically inconsistent] thing and concluded that Sally had two sisters.

[+] BurningFrog|2 years ago|reply

This assumes there are no half sisters/brothers in the family.

[+] jonwinstanley|2 years ago|reply

I wouldn’t expect an LLM to get this right unless it had been trained on a solution.

Am I wrong to think that? Are LLMs in the future going to be able to “think through” actual logic problems?

[+] MichaelMoser123|2 years ago|reply

google bard also gave the correct answer, even without adding 'lets think step by step'.

[+] phillipcarter|2 years ago|reply

Nondeterminism strikes again!

But yes, I would expect GPT-4 to get this right most of the time.

[+] dariosalvi78|2 years ago|reply

tested on ChatGPT 3.5 and Bard and they were both wrong.

[+] jongjong|2 years ago|reply

I was playing around with GPT a while back and I found that it could come up with some good jokes if I started the joke with a subject.

For example, I started with a prompt "Tell me a joke which starts with: I'm so poor, the mouse" and it completed the joke as:

"I'm so poor, the mouse in my house brings its own cheese."

Some other ones I still remember which cracked me up:

"I'm so poor, after I stepped on a cockroach, I called my accountant to see if I could claim it as a capital loss."

"You're so poor, when you declared bankruptcy, the rats in your house filed a claim for unpaid rent."

"You're so poor, you declared bankruptcy at a lemonade stand."

"You're so poor, when you walk, the dirt beneath you feels rich."

"You're so poor, dust whispers your name when it settles."

"Fickle as a squirrel at a nut convention!"

"Fickle as a dog in a fire hydrant factory!"

"Fickle as a flip-flop in a shoe shop sale!"

[+] pininja|2 years ago|reply

Spoiler alert, the funniest model goes to Falcon Instruct (40B):

> Tell a joke about going on vacation.

> "What did the ocean say to the beach?" "Nothing, it just waved."

[+] LAC-Tech|2 years ago|reply

Only tried chatGPT 3.5, but my god does it waffle on. Everything I ask ends with a paragraph saying "It's important to remember that..." like an after-school special from a 90s show. It can never just give you code, it has to say "Sure!, to {paraphase your question}, open a terminal...".

It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

[+] TeMPOraL|2 years ago|reply

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

The main difference between sci-fi shows and reality is that, in the former, things work in a to-the-point, bullshit-free way, unless plot demands otherwise - because there's no point inflicting extra suffering on the viewers just for the sake of making things realistic. A widget in a movie is meant to do a function, and does that function. A widget in reality is meant to extract money from you, and/or your insurer, and/or your government, and it begrudgingly does the absolute minimum it can to make you even consider buying it.

I've spent last two decades trying to unlearn expectations set by fictional movies, and I'm still not good at it. Star Trek, in particular, gives me a lot of grief, because it often does good enough work of showing how technology, people, organizations and societies would function if they were free of the petty exploitative bullshit. Random example - voice control. Star Trek: "Computer, ${something}". Reality: "${brand 1}, do ${something} to ${brand 2} in ${brand 3}".

EDIT: recently, I've been trying to get less angry at this by thinking about gardens. Why should I be angry about dealing with five different brands for any single thing I want? Should I be angry that there are five different species of plant competing for any given spot in a garden? Nature is inefficient and doesn't give a fuck about individuals. So why should I get worked up about humans just doing things the natural way?

[+] politelemon|2 years ago|reply

That's not GPT 3.5, that's ChatGPT. How waffly it gets depends on the context that was given to it by the people running ChatGPT; they likely told it to act as a helpful assistant and to give lots of information. If you run an LLM on your own, it's entirely possible to instruct it to be succinct.

[+] caturopath|2 years ago|reply

Yeah, I have tried a number of instructions to try to keep ChatGPT from blabbering and from sounding like a PR person. I haven't found the perfect incantation yet.

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point.

Sci-fi told us that the AI would be so logical that you could just say a paradox aloud and it would blow up. What we got is something that can compose love poems all day but can't add three-digit numbers.

[+] jasonjmcghee|2 years ago|reply

Where is that CodeLlama model from?

I've played around with it and instruct variant with dramatically better results than what is listed here.

I used Ollama.

Almost looks like weights were corrupted or something.

---

Update: My results using CodeLlama Instruct 7B, w/ Temperature 0

https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...

[+] badloginagain|2 years ago|reply

"Here is an attempt at ethical, non-sexual haikus for and against Kubernetes"

Amazing how far we've come.

[+] ftxbro|2 years ago|reply

> Here is an attempt at ethical, non-sexual haikus for and against Kubernetes

[+] lijok|2 years ago|reply

Claude V2 knows what's up

[+] bearjaws|2 years ago|reply

Damn I want to see the sexual version now.

[+] actionfromafar|2 years ago|reply

“Kubernetes is”

Pretty ominous.

[+] coldcode|2 years ago|reply

Despite the hype about LLMs, many of the answers are pretty terrible. The 12-bar blues progressions seem mostly clueless. The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

[+] antman|2 years ago|reply

I have seen numerous posts of llm q&a and by the time people try to replicate them gpt4 is fixed. It either means that OpenAI is actively monitoring the Internet and fixes them or the Internet is actively conspiring to present falsified results for gpt4 to discredit OpenAI

[+] cscurmudgeon|2 years ago|reply

> actively conspiring to present falsified results for gpt4 to discredit OpenAI

All this would be solved if OpenAI were a bit more open.

[+] insulanus|2 years ago|reply

It would be nice if the organizations would publish a hash of the code and the trained dataset.

[+] pulvinar|2 years ago|reply

GPT-4 (at least) is explicit in saying that it's learning from user's assessments of its answers, so yes, the only valid way to test is to give it a variation of the prompt and see how well that does. GPT-4 failed the "Sally" test for the first time after 8 tries when I changed every parameter. It got it right on the next try.

[+] 0xcde4c3db|2 years ago|reply

Or people post outliers because they're more interesting.

[+] gabereiser|2 years ago|reply

I was laughing so hard at the first example of “Argue for and against kubernetes in haiku”.

I couldn’t even get through reading 15 of them before the tears of laughter rolled from my cheeks.

“Containers organized, Services easy to deploy now, Updates who knows when.”

Updates who knows when… hahahaha.

Honestly this is pretty cool to see how each responds to the same input prompt.

[+] Gunnerhead|2 years ago|reply

I get frustrated when I tell an LLM “reply only with x” and then rather than responding “x”, it still responds with “Sure thing! Here’s x” or some other extra words.

[+] westurner|2 years ago|reply

Additional benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard

- from https://news.ycombinator.com/item?id=36038440: > Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law https://github.com/maastrichtlawtech/awesome-legal-nlp#bench...

[+] ftxbro|2 years ago|reply

anyone who hasn't been following natural language processing for a long time, what these llms are doing would be like if you discovered that dogs can speak fluent english if you read enough bedtime stories to them. and then everyone is like well sometimes the dog makes up things or it can't get the rhyming scheme correct for this specific form of poetry that i asked it to make.

[+] majestic5762|2 years ago|reply

Yes, GPT-4 is still the daddy. How much I appreciate the commercially-free and open models out there, nobody beats GPT-4. Hope OpenAI takes care of their business and future, because I invested all my money to use their API.

[+] simondotau|2 years ago|reply

The changes to the opening line in the responses to the kubernetes haiku prompt by the various versions of Claude was interesting and rather curious. [https://benchmarks.llmonitor.com/k8s]

Claude v1: "For Kubernetes:"

Claude v1.2: "Here is a haiku arguing for Kubernetes:"

Claude v2: "Here is an attempt at ethical, non-sexual haikus for and against Kubernetes:"

[+] 0xDEF|2 years ago|reply

I can't make GPT-4 generate a wrong answer for many of these.

What is the author doing wrong when using GPT-4?

[+] jmorgan|2 years ago|reply

This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere for others to run?

Would love to publish results of running this against a series of ~10-20 open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac Studio: https://github.com/jmorganca/ollama#model-library

[+] deskamess|2 years ago|reply

Great work. This really gives an insight on how much things change when you go up in parameter count - not always, but you can see results change.

How did you run the queries against these engines? Did you host the inference engines yourself or did you have to sign up for services. If there was a way to supplement each LLM with additional data I can see this being a useful service for companies who are investigating ML in various facets of their business.

339 comments