top | item 46905735

(no title)

ck_one | 24 days ago

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

discuss

order

grey-area|23 days ago

Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:

https://www.wizardemporium.com/blog/complete-list-of-harry-p...

Why is this impressive?

Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.

sigmoid10|23 days ago

Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

MarcellusDrum|23 days ago

So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".

vercaemert|23 days ago

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

fastasucan|22 days ago

Since it got 49 of 50 right its worse than what you would get using a simple google search. People would immediately disregard a conventional source that only listed 49 out of 50.

ehatr|23 days ago

The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.

The poster knows all of that, this is plain marketing.

rlt|23 days ago

They should try the same thing but replace the original spell names with something else.

zaphirplane|23 days ago

Why doesn’t you ask it and find out ;)

hereonout2|23 days ago

I was playing about with Chat GPT the other day, uploading screen shots of sheet music and asking it to convert it to ABC notation so I could make a midi file of it.

The results seemed impressive until I noticed some of the "Thinking" statements in the UI.

One made it apparent the model / agent / whatever had read the title from the screenshot and was off searching for existing ABC transcripts of the piece Ode to Joy.

So the whole thing was far less impressive after that, it wasn't reading the score anymore, just reading the title and using the internet to answer my query.

nobodywillobsrv|23 days ago

Yes I have found that grok for example actually suddenly becomes quite sane when you tell it to stop querying the internet And just rethink the conversation data and answer the question.

It's weird, it's like many agents are now in a phase of constantly getting more information and never just thinking with what they've got.

anomaly_|23 days ago

Sounds pretty human like! Always searching for a shortcut

kouunji|23 days ago

For structured outputs like that wouldn’t it be better to get the LLM to create a script to repeatably make the translation?

xiomrze|24 days ago

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

ozim|24 days ago

Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

petercooper|24 days ago

One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.

ck_one|24 days ago

When I tried it without web search so only internal knowledge it missed ~15 spells.

golfer|24 days ago

There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:

https://harrypotter.fandom.com/wiki/List_of_spells

ck_one|24 days ago

It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.

qwertytyyuu|23 days ago

Hmm… maybe he could switch out all the spells names slightly different ones and see how that goes

meroes|24 days ago

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

muzani|24 days ago

There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.

rvz|24 days ago

> What is this supposed to show exactly?

Nothing.

You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.

Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

matt_lo|23 days ago

use AI to rewrite all the spells from all the books, then try to see if AI can detect the rewritten ones. This will ensure it's not pulling from it's trained data set.

gbalduzzi|23 days ago

Neat idea, but why should I use AI for a find and replace?

It feels like shooting a fly with a bazooka

LeoPanthera|23 days ago

That won't help. The AI replacing them will probably miss the same ones as the AI finding them.

zamadatix|24 days ago

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

sobjornstad|24 days ago

I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?

In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.

ck_one|24 days ago

Then it's fair that id didn't find it

guluarte|24 days ago

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

ck_one|24 days ago

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

irishcoffee|24 days ago

The top comment is about finding basterized latin words from childrens books. The future is here.

Geste|24 days ago

I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.

mhink|23 days ago

> basterized

And yet, it's still somewhat better than the Hacker News comment using bastardized English words.

kmacdough|23 days ago

What are we testing here?

It feels like a very odd test because it's such an unreasonable way to answer this with an LLM. Nothing about the task requires more than a very localized understanding. It's not like a codebase or corporate documentation, where there's a lot of interconectedness and context that's important. It also doesn't seem to poke at the gap between human and AI intelligence.

Why are people excited? What am I missing?

dwa3592|24 days ago

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

bartman|24 days ago

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

dom96|24 days ago

I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?

big-chungus4|20 days ago

Now edit the books and replace all spell names with different ones, and try again

matt-p|22 days ago

Now try it without giving it the books as context. I'm sure it probably knows there are 49.

ActionHank|23 days ago

The books were likely in the training data, I don't know that it's that impressive.

SebastianSosa|23 days ago

now thx to this post (and the infra provider inclination to appeal to hacker news) we will never know if the model actually discovered the 50 spells or memorized it. Since it will be trained on this. :( But what can you do, this is interesting

TheRealPomax|24 days ago

That doesn't seem a super useful test for a model that's optimized for programming?

kylehotchkiss|23 days ago

I love the fun metric.

My hope is that locally run models can pass this test in the next year or two!

LanceJones|24 days ago

Assuming this experiment involved isolating the LLM from its training set?

grey-area|23 days ago

Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.

So the test seems like a nonsensical test to me.

siwatanejo|24 days ago

> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

koakuma-chan|24 days ago

You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.

psychoslave|23 days ago

Ah and no one thrown TOAC in it yet?

polynomial|23 days ago

You need to publish this tbh

dr_dshiv|23 days ago

Comparison to another model?

dudewhocodes|23 days ago

There are websites with the spells listed... which makes this a search problem. Why is an LLM used here?

bilekas|23 days ago

It's just a benchmark test excersize.

hansmayer|23 days ago

> Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

Clearly a very useful, grounded and helpful everyday use case of LLMs. I guess in the absence of real-world use cases, we'll have to do AI boosting with such "impressive" feats.

Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.