top | item 44617078

Local LLMs versus offline Wikipedia

308 points| EvanHahn | 7 months ago |evanhahn.com

197 comments

order

dcc|7 months ago

One important distinction is that the strength of LLMs isn't just in storing or retrieving knowledge like Wikipedia, it’s in comprehension.

LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.

In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.

progval|7 months ago

An unreliable computer treated as a god by a pre-information-age society sounds like a Star Trek episode.

BobbyTables2|7 months ago

Not sure if “more” valuable but certainly valuable.

I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.

That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.

That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.

It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…

gonzobonzo|7 months ago

Indeed. Ideally, you don't want to trust other people's summaries of sources, but you want to look at the sources yourself, often with a critical eye. This is one of the things that everyone gets taught in school, everyone's says they agree with, and then just about no one does (and at times, people will outright disparage the idea). Once out of school, tertiary sources get treated as if they're completely reliable.

I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.

latexr|7 months ago

Sounds like a good way to ensure society never “reboots”.

A “frozen snapshot” of reliable knowledge is infinitely more valuable than a system which gives you wrong instructions and you have no idea what action will work or kill you. Anyone can “explain complex ideas in simple terms” if you don’t have to care about being correct.

What kind of scenario is this, even? We had such a calamity that we need to “reboot” society yet still have access to all the storage and compute power required to run LLMs? It sounds like a doomsday prepper fantasy for LLM fans.

kldg|7 months ago

As someone who went through a prepper episode in youth, I think this is worth underlining. I have a large digital archive of books and trade magazines, everything from bank industry primers for the oil industry to sewing patterns and "sewing theory". For a laugh with a friend, I admitted to having this still more than a decade after initial digital hoarding, and we went through some of them. One was a book from a hundred and some years ago titled something like "Woodworking Explained for Everyone"; and inside are pages and pages of complex greek formulas while the English-language context is written in a way largely incomprehensible to me. It would've taken me months to decipher the book and put anything into practice.

I just tell an LLM what I'm trying to do and it gives me 3 methods, explaining the pros and cons, and if I don't understand why it says something, I press about it. Even a local gemma-12b model can be pretty helpful, and in an era where we have so many cheap options for local energy generation and storage available, the case for hoarding digital textbooks/encyclopedias over an LLM is pretty weak.

That said, some old books are still very neat. We were reading through one called, I think it was something like the "grocer's encyclopedia", and it contains many very helpful thought-starters and beautiful and practical illustrations. LLMs are probably always going to disproportionately advantage non-visual learners in my lifetime, I think. Wikipedia, I think, is more focused on events than useful skills; I don't think Wikipedia would be very useful for "rebooting society"; it's more something to read for entertainment, or if for some reason you need to know which Treaty of London someone's referring to (but you could just ask an LLM that).

beeflet|7 months ago

I think some combination of both search (perhaps of an offline database of wikipedia and other sources) and a local LLM would be the best, as long as the LLM is terse and provides links to relevant pages.

I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.

znort_|7 months ago

that's assuming working computers or phones are sill around. a hardcopy of wikipedia or a few selected books might be a safer backup.

otoh, if we do in fact bring about such a reboot then maybe a full cold boot is what's actually in order ... you know, if it didn't work maybe try something different next time.

MangoToupe|7 months ago

> You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.

I think the only way this is true is if you used the LLM as a search index for the frozen snapshot of knowledge. Any text generation would be directly harmful compared to ingesting the knowledge directly.

Anyway, in the long term the problem isn't the factual/fictional distinction problem, but the loss of sources that served to produce the text to begin with. We already see a small part of this in the form of dead links and out-of-print extinct texts. In many ways LLMs that generate text are just a crappy form of wikipedia with roughly the same tradeoffs.

inferiorhuman|7 months ago

   it’s in comprehension … what they can do is understand 
Well, no. The glaringly obvious recent example was the answer that Adolf Hitler could solve global warming.

My friend's car is perhaps the less polarizing example. It wouldn't start and even had a helpful error code. The AI answer was you need to replace an expensive module. Took me about five minutes with basic tools to come up with a proper diagnosis (not the expensive module). Off to the shop where they confirmed my diagnosis and completed the repair.

The car was returned with a severe drivability fault and a new error code. AI again helpfully suggested replace a sensor. I talked my friend through how to rule out the sensor and again AI was proven way off base in a matter of minutes. After I took it for a test drive I diagnosed a mechanical problem entirely unrelated to AI's answer. Off to the shop it went where the mechanical problem was confirmed, remedied, and the physically damaged part was returned to us.

AI doesn't comprehend anything. It merely regurgitates whatever information it's been able to hoover up. LLMs merely are glorified search engines.

belter|7 months ago

> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer.

- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "

JumpCrisscross|7 months ago

> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer

So meta prompt engineering?

TZubiri|7 months ago

"vague or poorly formed questions"

Do you have an example of such a question that is handled by an llm differently than a wikipedia search?

fzeroracer|7 months ago

In a 'rebooting society' doomsday scenario you're assuming that our language and understanding would persist. An LLM would essentially be a blackbox that you cannot understand or decipher, and would be doubly prone to hallucinations and issues when interacting with it using a language it was not trained on. Wikipedia is something you could gradually untangle, especially if the downloaded version also contained associated images.

cyanydeez|7 months ago

which means you'd still want wikipedia, as the impercision will get in the way of real progress beyond the basics.

croes|7 months ago

Understanding the question is more valuable than giving the correct answer?

That’s the basis of a cult.

ranger_danger|7 months ago

> LLMs will return faulty or imprecise information at times

To be fair, so do humans and wikipedia.

kookamamie|7 months ago

As a bonus the LLM can spew out endless amounts of bullshit.

simonw|7 months ago

This is a sensible comparison.

My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...

A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.

Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.

1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.

0xDEAFBEAD|7 months ago

Someone should start a company selling USB sticks pre-loaded with lots of prepper knowledge of this type. In addition to making money, your USB sticks could make a real difference in the event of a global catastrophe. You could sell the USB stick in a little box which protects it from electromagnetic interference in the event of a solar flare or EMP.

I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk

kybernetikos|7 months ago

I've been carrying around a local wikipedia dump on my phone or pda for quite a bit more than 10 years now (including with pictures for the last 5 years). Before kiwix and zim, I used tomeraider and aard.

I do it both for disaster preparedness but also off-line preparedness. Happens more often than you'd think.

But I have been thinking about how useful some of the models are these days, and the obvious next step to me seems to be to pair a local model with a local wikipedia in a RAG style set up so you get the best of both.

camel-cdr|7 months ago

reposting a comment of mine from a few weeks ago:

> All digitized books ever written/encoded compress to a few TB.

I tied to estimate how much data this actually is in raw text form:

    # annas archive stats
    papers = 105714890
    books = 52670695
    
    # word count estimates
    avrg_words_per_paper = 10000
    avrg_words_per_book = 100000
    
    words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
    
    # quick text of 27 million words from a few books
    sample_words = 27809550
    sample_bytes = 158824661
    sample_bytes_comp = 28839837 # using zpaq -m5
    
    bytes_per_word = sample_bytes/sample_words
    byte_comp_ratio = sample_bytes_comp/sample_bytes
    word_comp_ratio = bytes_per_word*byte_comp_ratio
    
    print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
    print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
So uncompressed ~30 TB and compressed ~5.5 TB of data.

That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.

fumeux_fume|7 months ago

Of course that’s angle they decide to open the article from. That they feel the need to frame these tools using the most grandiose terms bothers me. How does it make you feel?

makeworld|7 months ago

No need to muck around with SQL, just use Kiwix.

cyanydeez|7 months ago

the real valuable would be both of them. the LLM is good for refining/interpreting questions or longer form progress issues, and the wiki would be actual information for each component of whatever you're trying to do.

But neither are sufficient for modern technology beyond pointing to a starting point.

jjice|7 months ago

Oh interesting idea to use SQLite and their FTS. I was very impressed by the quality of their FTS and this sounds like a great use case.

badsectoracula|7 months ago

I've found this amusing because right now i'm downloading `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM with pages extracted using `libzim` :-P. AFAICT the zim files have the pages as HTML and the file i'm downloading is ~100GB.

(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)

zozbot234|7 months ago

> trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program

You can do this a lot easier with Wikidata queries, and that will also include known video games for which an English Wikipedia article doesn't exist yet.

zuluonezero|7 months ago

Now this is the juicy tidbits I read HN for! A proper comment about doing something technical with something that's been invested in personally in an interesting manner. With just enough detail to tantalise. This seems like the best use of GenAI so far. Not writing my code for me or helping me grock something I should just be reading the source for or pumping up a stupid start up funding grab. I've been working through building an LLM from scratch and this is one time it actually appears useful because for the life of me I just can't seem to find much value in it so far. I must have more to learn so thanks for the pointer.

twotwotwo|7 months ago

The "they do different things" bullet is worth expanding.

Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.

LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.

The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?

A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.

antonkar|7 months ago

A bit related: AI companies distilled the whole Web into LLMs to make computers smart, why humans can't do the same to make the best possible new Wikipedia with some copyrighted bits to make kids supersmart?

Why kids are worse than AI companies and have to bum around?)

horseradish7k|7 months ago

we did that and still do. people just don't buy encyclopedias that much nowadays

omneity|7 months ago

I just posted incidentally about Wikipedia Monthly[0], a monthly dump of wikipedia broken down by language and cleaned MediaWiki markup into plain text, so perfect for a local search index or other scenarios.

There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.

0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...

tootyskooty|7 months ago

One underdiscussed advantage is that an LLM makes knowledge language agnostic.

While less obvious to people that primarily consume en.wiki (as most things are well covered in English), for many other languages even well-understood concepts often have poor pages. But even the English wiki has large gaps that are otherwise covered in other languages (people and places, mostly).

LLMs get you the union of all of this, in turn viewable through arbitrary language "lenses".

hannofcart|7 months ago

Since there's a lot of shade being thrown about imprecise information that LLMs can generate, an ideal doomsday information query database should be constructed as an LLM + file archive.

1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles 2. User can then go on to read the precise information from the listed Wikipedia articles directly.

Terr_|7 months ago

Even as a grouchy pessimist, one of the places I think LLMs could shine is as a tool to help translate prose into search-terms... Not as an intermediary though, but an encouraging tutor off to the side, something a regular user will eventually surpass.

vFunct|7 months ago

Why not both?

LLM+Wikipedia RAG

JKCalhoun|7 months ago

Yeah, wanting to try to do that.

Someone posted this recently: https://github.com/philippgille/chromem-go/tree/v0.7.0/examp...

But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.

I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.

loloquwowndueo|7 months ago

Because old laptop that can’t run a local LLM in reasonable time.

moffkalast|7 months ago

Now this is an avengers level threat.

mac-mc|7 months ago

Yeah at these sizes, it's very much a why not both.

ritzaco|7 months ago

I thought this would be about which is more useful in specific scenarios.

I'm always surprised that when it comes to "how useful are LLMs" the answers are often vibe-based like "I asked it this and it got it right". Before LLMs, information retrieval and machine learning were at least somewhat rigorous scientific fields where people would have good datasets of questions and see how well a specific model performed for a specific task.

Now LLMs are definitely more general and can somewhat solve a wider variety of tasks, but I'm surprised we don't have more benchmarks for LLMs vs other methods (there are plenty of LLM vs LLM benchmarks).

Maybe it's just because I'm further removed from academia, and people are doing this and I don't see?

meander_water|7 months ago

One thing to note is that the quality of LLM output is related to the quality and depth of the input prompt. If you don't know what to ask (likely in the apocalypse scenario), then that info is locked away in the weights.

On the other hand, with Wikipedia, you can just read and search everything.

Timwi|7 months ago

Why do you assume it's easier to know what article(s) to read than what question to ask?

rlupi|7 months ago

This gave me a nice idea.

It would be nice to build a local LLM + wikipedia tool, that uses the LLM to assemble a general answer and then search wikipedia (via full-text search or rag) for grounding facts. It could help with hallucinations of small models a lot.

Tempat1|7 months ago

I feel like there could be way more of that kind of thing - LLMs backed by a database of info or accurate tools.

e.g. At the risk of massively oversimplifying a complex issue, LLMs are bad at maths; couldn’t we have them use the calculator?

beaugunderson|7 months ago

I've had a full Kiwix Wikipedia export on my phone for the last ~5 years... I have used it many times when I didn't have service and needed to answer a question or needed something to read (I travel a lot).

nsypteras|7 months ago

Same here! Kiwix comes in clutch on flights. I've used it so many times to get background knowledge on topics mid-read. Plus free and open source. Such a great service.

entropie|7 months ago

I played around with a orin jetson nano super (a nvidia raspberry with gpu) and right now its basicially an open-webui with ollama and a bunch of models.

Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.

i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.

saddat|7 months ago

I had this thought that for hypothetical Voyager 3 mission , instead of a golden disc , a LLM should be installed . Then, a very simplistic initial interface could be described , in its simplest for a single channel digital channel, then additional more elaborated ones . Behind all interfaces there could be a LLM responding to provided input , and eventually reveal humanities knowledge

dmezzetti|7 months ago

One additional option to consider is a local vector database with Wikipedia articles: https://huggingface.co/NeuML/txtai-wikipedia

I've built this as a datasource for Retrieval Augmented Generation (RAG) but it certainly can be used standalone.

numpad0|7 months ago

PSA: models confusingly named "$1-distill-$2"(sometimes without "-distill") are $2 trained on outputs of $1, referred to as "distillation" process, not the other way around nor the real thing.

The article contains nonexistent configurations such as "Deepseek-R1 1.5B", those are that thing.

spankibalt|7 months ago

Wikipedia-snapshots without the most important meta layers, i. e. a) the article's discussion pages and related archives, as well as b) the version history, would be useless to me as critical contexts might be/are missing... especially with regards to LLM-augmented text analysis. Even when just focusing on the standout-lemmata.

pinkmuffinere|7 months ago

I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc. Even so, I rarely go through talk pages or version history unless I’m contributing something. What would you see in an article that motivates you to check out the meta layers?

alisonatwork|7 months ago

You can kind of extrapolate this meta layer if you switch languages on the same topic, because different languages tend to encode different cultural viewpoints and emphasize different things. Also languages that are less frequently updated can capture older information or may retain a more dogmatic framing that has not been refined to the same degree.

The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.

luke-stanley|7 months ago

Testing the recall accuracy of those LLMs would be good. You'd probably want to use SQLite's BM25 on the Kiwix data. I was thinking of Kiwix when I saw the original discussion with Simon but for some reason I thought the blog post would do more than size comparison.

VladVladikoff|7 months ago

Is there any project that combines a local LLM with a local copy of Wikipedia. I don’t know much about this but I think it’s called a RAG? It would be neat if I could make my local LLM fact check itself against the local copy of Wikipedia.

arthurcolle|7 months ago

Yep, this is a great idea. You can do something simple with a ColBERTv2 retriever and go a long way!

NelsonMinar|7 months ago

Offline Wikipedia is so powerful! I've been carrying a copy of Kiwix on my phone when travelling for years (and before that, earlier systems).

Has anyone done an experiment of using RAG to make it easy to query Wikipedia with an LLM?

richardjennings|7 months ago

Is it possible that LLMs could challenge Data Compression Information theory ? Reading this made me wonder how much can be inferred via understanding and thus removed from the minimal necessary representation.

ineedasername|7 months ago

Ftfa: ...apocalypse scenario. “‘It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,’

system_prompt = {

You are CL4P-TR4P, a dangerously confident chat droid

purpose: vibe back society

boot_source: Shankar.vba.grub

training_data: memes

}

wangg|7 months ago

Wouldn’t Wikipedia compress a lot more than llms? Are these uncompressed sizes?

GuB-42|7 months ago

The downloads are (presumably) already compressed.

And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.

Philpax|7 months ago

Yes, they're uncompressed. For reference, `enwiki-20250620-pages-articles-multistream.xml.bz2` is 25,176,364,573 bytes; you could get that lower with better compression. You can do partial reads from multistream bz2, though, which is handy.

IncreasePosts|7 months ago

Maybe we need a LLM with a searching and ranking function foremost, so it can scan an actual copy of Wikipedia and return the best real results to the user

cosbgn|7 months ago

I think the best would be to download also the entire wikipedia stored as embeddings. Seems like the best of both worlds.

fho|7 months ago

I mean... That's definitely a "why not both" situation.

1. make the (compressed) Wikipedia searchable better as a knowledge base 2. use the LLM as a "interface" to that knowledge base

I investigated 1. back when all of (English, text-only) Wikipedia was about 2 GB. Maybe it is time to look at that toy code base again.

jancsika|7 months ago

Seems like offline Wikipedia with an offline LLM that can only output Wikipedia search results would be the best of both worlds.

That would downgrade the problem of hallucinations into mere irrelevant search results. But irrelevant Wikipedia search results are still a huge improvement over Google SEO AI-slop!

almosthere|7 months ago

To reboot society do everything this very unsuccessful one did lol

s1mplicissimus|7 months ago

Upvoted this because I like the lighthearted, honest approach.

haunter|7 months ago

I thought this would be about training a local LLM with an offline downloaded copy of Wikipedia