It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
Being that it has the books memorized (huh, just learned another US/UK spelling quirk), I would suppose feeding it the books with altered spells would get you a confused mishmash of data in the context and data in the weights.
This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.
I'm not sure what your knowledge level of the inner workings of LLMs is, but a model doesn't need search or even an internet connection to "know" the information if it's in its training dataset. In your example, it's almost guaranteed that the LLM isn't searching books - it's just referencing one of the hundreds of lists of those spells in it's training data.
This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
>The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
A music playback device's purpose is to allow you hear Elvis' voice. A good device does it well: you hear Elvis' voice (maybe with some imperfections). Whether a real Elvis is inside of it or not, doesn't matter - its purpose is fulfilled regardless. By your analogy, an LLM simply reproduces what an intelligent person would say on the matter. If it does its job more-less, it doesn't matter either, whether it's "truly intelligent" or not, its output is already useful. I think it's completely irrelevant in both cases to the question "how well does it do X?" If you think about it, 95% we know we learned from school/environment/parents, we didn't discover it ourselves via some kind of scientific method, we just parrot what other intelligent people said before us, mostly. Maybe human "intelligence" itself is 95% parroting/basic pattern matching from training data? (18 years of training during childhood!)
Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.
The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
> But for sure it has some internal knowledge already.
Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.
Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.
Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
Just a few casual thoughts.
I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.
viraptor|24 days ago
Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...
jazzyjackson|23 days ago
ribosometronome|24 days ago
joshmlewis|24 days ago
unknown|24 days ago
[deleted]
obirunda|24 days ago
NiloCK|24 days ago
IAmGraydon|23 days ago
This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
kgeist|23 days ago
A music playback device's purpose is to allow you hear Elvis' voice. A good device does it well: you hear Elvis' voice (maybe with some imperfections). Whether a real Elvis is inside of it or not, doesn't matter - its purpose is fulfilled regardless. By your analogy, an LLM simply reproduces what an intelligent person would say on the matter. If it does its job more-less, it doesn't matter either, whether it's "truly intelligent" or not, its output is already useful. I think it's completely irrelevant in both cases to the question "how well does it do X?" If you think about it, 95% we know we learned from school/environment/parents, we didn't discover it ourselves via some kind of scientific method, we just parrot what other intelligent people said before us, mostly. Maybe human "intelligence" itself is 95% parroting/basic pattern matching from training data? (18 years of training during childhood!)
Trasmatta|24 days ago
soulofmischief|24 days ago
altmanaltman|23 days ago
Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.
eek2121|24 days ago
Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
Just a few casual thoughts.
I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
clhodapp|24 days ago