(no title)
ck_one | 24 days ago
All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).
Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).
Freaking impressive!
grey-area|23 days ago
https://www.wizardemporium.com/blog/complete-list-of-harry-p...
Why is this impressive?
Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.
sigmoid10|23 days ago
MarcellusDrum|23 days ago
vercaemert|23 days ago
There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).
This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.
(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)
fastasucan|22 days ago
ehatr|23 days ago
The poster knows all of that, this is plain marketing.
rlt|23 days ago
zaphirplane|23 days ago
hereonout2|23 days ago
The results seemed impressive until I noticed some of the "Thinking" statements in the UI.
One made it apparent the model / agent / whatever had read the title from the screenshot and was off searching for existing ABC transcripts of the piece Ode to Joy.
So the whole thing was far less impressive after that, it wasn't reading the score anymore, just reading the title and using the internet to answer my query.
nobodywillobsrv|23 days ago
It's weird, it's like many agents are now in a phase of constantly getting more information and never just thinking with what they've got.
anomaly_|23 days ago
kouunji|23 days ago
xiomrze|24 days ago
If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.
ozim|24 days ago
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
petercooper|24 days ago
ck_one|24 days ago
clanker_fluffer|24 days ago
golfer|24 days ago
https://harrypotter.fandom.com/wiki/List_of_spells
ck_one|24 days ago
qwertytyyuu|23 days ago
meroes|24 days ago
muzani|24 days ago
rvz|24 days ago
Nothing.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.
matt_lo|23 days ago
gbalduzzi|23 days ago
It feels like shooting a fly with a bazooka
LeoPanthera|23 days ago
zamadatix|24 days ago
> The smug look on Malfoy’s face flickered.
> “No one asked your opinion, you filthy little Mudblood,” he spat.
> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.
> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.
> “Ron! Ron! Are you all right?” squealed Hermione.
> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.
sobjornstad|24 days ago
In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.
ck_one|24 days ago
NeroVanbierv|19 days ago
> ChatGPT: "Generate a two page short story like harry potter, but don´t mention anyting harry potter related. make up 4 unique spells in the story that are used"
Response see https://chatgpt.com/share/698af9cd-f628-800d-9250-b260f1478c...
> Claude: "What unique wizarding spells can you find in this story? [story]"
Response = https://i.imgur.com/Jzzs3PC.png
guluarte|24 days ago
ck_one|24 days ago
irishcoffee|24 days ago
Geste|24 days ago
mhink|23 days ago
And yet, it's still somewhat better than the Hacker News comment using bastardized English words.
kmacdough|23 days ago
It feels like a very odd test because it's such an unreasonable way to answer this with an LLM. Nothing about the task requires more than a very localized understanding. It's not like a codebase or corporate documentation, where there's a lot of interconectedness and context that's important. It also doesn't seem to poke at the gap between human and AI intelligence.
Why are people excited? What am I missing?
muzani|24 days ago
I guess they have to add more questions as these context windows get bigger.
kybernetikos|24 days ago
My standard test for that was "Who ends up with Bilbo's buttons?"
dwa3592|24 days ago
bartman|24 days ago
dom96|24 days ago
unknown|23 days ago
[deleted]
big-chungus4|20 days ago
big-chungus4|20 days ago
matt-p|22 days ago
ActionHank|23 days ago
SebastianSosa|23 days ago
TheRealPomax|24 days ago
kylehotchkiss|23 days ago
My hope is that locally run models can pass this test in the next year or two!
LanceJones|24 days ago
grey-area|23 days ago
So the test seems like a nonsensical test to me.
siwatanejo|24 days ago
How do you know? Each word is one token?
koakuma-chan|24 days ago
psychoslave|23 days ago
polynomial|23 days ago
unknown|23 days ago
[deleted]
huangmeng|23 days ago
unknown|24 days ago
[deleted]
unknown|22 days ago
[deleted]
unknown|22 days ago
[deleted]
dr_dshiv|23 days ago
adarsh2321|24 days ago
[deleted]
IhateAI|24 days ago
[deleted]
dudewhocodes|23 days ago
bilekas|23 days ago
hansmayer|23 days ago
Clearly a very useful, grounded and helpful everyday use case of LLMs. I guess in the absence of real-world use cases, we'll have to do AI boosting with such "impressive" feats.
Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.