top | item 43788069

(no title)

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

discuss

tlb|10 months ago

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

TZubiri|10 months ago

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

mopierotti|10 months ago

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

latentsea|10 months ago

> what would you answer if I asked you this question?

I don't know.

namaria|10 months ago

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

rovr138|10 months ago

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

golergka|10 months ago

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

mobilejdral|10 months ago

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

henryway|10 months ago

Can God create something so heavy that he can’t lift it?