One important distinction is that the strength of LLMs isn't just in storing or retrieving knowledge like Wikipedia, it’s in comprehension.
LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.
In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
Not sure if “more” valuable but certainly valuable.
I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.
That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.
That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.
It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…
Indeed. Ideally, you don't want to trust other people's summaries of sources, but you want to look at the sources yourself, often with a critical eye. This is one of the things that everyone gets taught in school, everyone's says they agree with, and then just about no one does (and at times, people will outright disparage the idea). Once out of school, tertiary sources get treated as if they're completely reliable.
I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
Sounds like a good way to ensure society never “reboots”.
A “frozen snapshot” of reliable knowledge is infinitely more valuable than a system which gives you wrong instructions and you have no idea what action will work or kill you. Anyone can “explain complex ideas in simple terms” if you don’t have to care about being correct.
What kind of scenario is this, even? We had such a calamity that we need to “reboot” society yet still have access to all the storage and compute power required to run LLMs? It sounds like a doomsday prepper fantasy for LLM fans.
As someone who went through a prepper episode in youth, I think this is worth underlining. I have a large digital archive of books and trade magazines, everything from bank industry primers for the oil industry to sewing patterns and "sewing theory". For a laugh with a friend, I admitted to having this still more than a decade after initial digital hoarding, and we went through some of them. One was a book from a hundred and some years ago titled something like "Woodworking Explained for Everyone"; and inside are pages and pages of complex greek formulas while the English-language context is written in a way largely incomprehensible to me. It would've taken me months to decipher the book and put anything into practice.
I just tell an LLM what I'm trying to do and it gives me 3 methods, explaining the pros and cons, and if I don't understand why it says something, I press about it. Even a local gemma-12b model can be pretty helpful, and in an era where we have so many cheap options for local energy generation and storage available, the case for hoarding digital textbooks/encyclopedias over an LLM is pretty weak.
That said, some old books are still very neat. We were reading through one called, I think it was something like the "grocer's encyclopedia", and it contains many very helpful thought-starters and beautiful and practical illustrations. LLMs are probably always going to disproportionately advantage non-visual learners in my lifetime, I think. Wikipedia, I think, is more focused on events than useful skills; I don't think Wikipedia would be very useful for "rebooting society"; it's more something to read for entertainment, or if for some reason you need to know which Treaty of London someone's referring to (but you could just ask an LLM that).
I think some combination of both search (perhaps of an offline database of wikipedia and other sources) and a local LLM would be the best, as long as the LLM is terse and provides links to relevant pages.
I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.
that's assuming working computers or phones are sill around. a hardcopy of wikipedia or a few selected books might be a safer backup.
otoh, if we do in fact bring about such a reboot then maybe a full cold boot is what's actually in order ... you know, if it didn't work maybe try something different next time.
> You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
I think the only way this is true is if you used the LLM as a search index for the frozen snapshot of knowledge. Any text generation would be directly harmful compared to ingesting the knowledge directly.
Anyway, in the long term the problem isn't the factual/fictional distinction problem, but the loss of sources that served to produce the text to begin with. We already see a small part of this in the form of dead links and out-of-print extinct texts. In many ways LLMs that generate text are just a crappy form of wikipedia with roughly the same tradeoffs.
it’s in comprehension … what they can do is understand
Well, no. The glaringly obvious recent example was the answer that Adolf Hitler could solve global warming.
My friend's car is perhaps the less polarizing example. It wouldn't start and even had a helpful error code. The AI answer was you need to replace an expensive module. Took me about five minutes with basic tools to come up with a proper diagnosis (not the expensive module). Off to the shop where they confirmed my diagnosis and completed the repair.
The car was returned with a severe drivability fault and a new error code. AI again helpfully suggested replace a sensor. I talked my friend through how to rule out the sensor and again AI was proven way off base in a matter of minutes. After I took it for a test drive I diagnosed a mechanical problem entirely unrelated to AI's answer. Off to the shop it went where the mechanical problem was confirmed, remedied, and the physically damaged part was returned to us.
AI doesn't comprehend anything. It merely regurgitates whatever information it's been able to hoover up. LLMs merely are glorified search engines.
> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer.
- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "
> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer
In a 'rebooting society' doomsday scenario you're assuming that our language and understanding would persist. An LLM would essentially be a blackbox that you cannot understand or decipher, and would be doubly prone to hallucinations and issues when interacting with it using a language it was not trained on. Wikipedia is something you could gradually untangle, especially if the downloaded version also contained associated images.
My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...
A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.
Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.
1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.
Someone should start a company selling USB sticks pre-loaded with lots of prepper knowledge of this type. In addition to making money, your USB sticks could make a real difference in the event of a global catastrophe. You could sell the USB stick in a little box which protects it from electromagnetic interference in the event of a solar flare or EMP.
I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk
I've been carrying around a local wikipedia dump on my phone or pda for quite a bit more than 10 years now (including with pictures for the last 5 years). Before kiwix and zim, I used tomeraider and aard.
I do it both for disaster preparedness but also off-line preparedness. Happens more often than you'd think.
But I have been thinking about how useful some of the models are these days, and the obvious next step to me seems to be to pair a local model with a local wikipedia in a RAG style set up so you get the best of both.
Of course that’s angle they decide to open the article from. That they feel the need to frame these tools using the most grandiose terms bothers me. How does it make you feel?
the real valuable would be both of them. the LLM is good for refining/interpreting questions or longer form progress issues, and the wiki would be actual information for each component of whatever you're trying to do.
But neither are sufficient for modern technology beyond pointing to a starting point.
I've found this amusing because right now i'm downloading `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM with pages extracted using `libzim` :-P. AFAICT the zim files have the pages as HTML and the file i'm downloading is ~100GB.
(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)
> trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program
You can do this a lot easier with Wikidata queries, and that will also include known video games for which an English Wikipedia article doesn't exist yet.
Now this is the juicy tidbits I read HN for! A proper comment about doing something technical with something that's been invested in personally in an interesting manner. With just enough detail to tantalise. This seems like the best use of GenAI so far. Not writing my code for me or helping me grock something I should just be reading the source for or pumping up a stupid start up funding grab. I've been working through building an LLM from scratch and this is one time it actually appears useful because for the life of me I just can't seem to find much value in it so far. I must have more to learn so thanks for the pointer.
The "they do different things" bullet is worth expanding.
Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.
LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.
The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?
A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.
A bit related: AI companies distilled the whole Web into LLMs to make computers smart, why humans can't do the same to make the best possible new Wikipedia with some copyrighted bits to make kids supersmart?
Why kids are worse than AI companies and have to bum around?)
I just posted incidentally about Wikipedia Monthly[0], a monthly dump of wikipedia broken down by language and cleaned MediaWiki markup into plain text, so perfect for a local search index or other scenarios.
There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.
One underdiscussed advantage is that an LLM makes knowledge language agnostic.
While less obvious to people that primarily consume en.wiki (as most things are well covered in English), for many other languages even well-understood concepts often have poor pages. But even the English wiki has large gaps that are otherwise covered in other languages (people and places, mostly).
LLMs get you the union of all of this, in turn viewable through arbitrary language "lenses".
Since there's a lot of shade being thrown about imprecise information that LLMs can generate, an ideal doomsday information query database should be constructed as an LLM + file archive.
1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles
2. User can then go on to read the precise information from the listed Wikipedia articles directly.
Even as a grouchy pessimist, one of the places I think LLMs could shine is as a tool to help translate prose into search-terms... Not as an intermediary though, but an encouraging tutor off to the side, something a regular user will eventually surpass.
But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.
I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.
I thought this would be about which is more useful in specific scenarios.
I'm always surprised that when it comes to "how useful are LLMs" the answers are often vibe-based like "I asked it this and it got it right". Before LLMs, information retrieval and machine learning were at least somewhat rigorous scientific fields where people would have good datasets of questions and see how well a specific model performed for a specific task.
Now LLMs are definitely more general and can somewhat solve a wider variety of tasks, but I'm surprised we don't have more benchmarks for LLMs vs other methods (there are plenty of LLM vs LLM benchmarks).
Maybe it's just because I'm further removed from academia, and people are doing this and I don't see?
One thing to note is that the quality of LLM output is related to the quality and depth of the input prompt. If you don't know what to ask (likely in the apocalypse scenario), then that info is locked away in the weights.
On the other hand, with Wikipedia, you can just read and search everything.
It would be nice to build a local LLM + wikipedia tool, that uses the LLM to assemble a general answer and then search wikipedia (via full-text search or rag) for grounding facts. It could help with hallucinations of small models a lot.
I've had a full Kiwix Wikipedia export on my phone for the last ~5 years... I have used it many times when I didn't have service and needed to answer a question or needed something to read (I travel a lot).
Same here! Kiwix comes in clutch on flights. I've used it so many times to get background knowledge on topics mid-read. Plus free and open source. Such a great service.
I played around with a orin jetson nano super (a nvidia raspberry with gpu) and right now its basicially an open-webui with ollama and a bunch of models.
Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.
i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.
I had this thought that for hypothetical Voyager 3 mission , instead of a golden disc , a LLM should be installed . Then, a very simplistic initial interface could be described , in its simplest for a single channel digital channel, then additional more elaborated ones . Behind all interfaces there could be a LLM responding to provided input , and eventually reveal humanities knowledge
PSA: models confusingly named "$1-distill-$2"(sometimes without "-distill") are $2 trained on outputs of $1, referred to as "distillation" process, not the other way around nor the real thing.
The article contains nonexistent configurations such as "Deepseek-R1 1.5B", those are that thing.
Wikipedia-snapshots without the most important meta layers, i. e. a) the article's discussion pages and related archives, as well as b) the version history, would be useless to me as critical contexts might be/are missing... especially with regards to LLM-augmented text analysis. Even when just focusing on the standout-lemmata.
I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc. Even so, I rarely go through talk pages or version history unless I’m contributing something. What would you see in an article that motivates you to check out the meta layers?
You can kind of extrapolate this meta layer if you switch languages on the same topic, because different languages tend to encode different cultural viewpoints and emphasize different things. Also languages that are less frequently updated can capture older information or may retain a more dogmatic framing that has not been refined to the same degree.
The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.
Testing the recall accuracy of those LLMs would be good. You'd probably want to use SQLite's BM25 on the Kiwix data.
I was thinking of Kiwix when I saw the original discussion with Simon but for some reason I thought the blog post would do more than size comparison.
Is there any project that combines a local LLM with a local copy of Wikipedia. I don’t know much about this but I think it’s called a RAG? It would be neat if I could make my local LLM fact check itself against the local copy of Wikipedia.
Is it possible that LLMs could challenge Data Compression Information theory ? Reading this made me wonder how much can be inferred via understanding and thus removed from the minimal necessary representation.
Ftfa: ...apocalypse scenario. “‘It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,’
system_prompt = {
You are CL4P-TR4P, a dangerously confident chat droid
The downloads are (presumably) already compressed.
And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.
Yes, they're uncompressed. For reference, `enwiki-20250620-pages-articles-multistream.xml.bz2` is 25,176,364,573 bytes; you could get that lower with better compression. You can do partial reads from multistream bz2, though, which is handy.
Maybe we need a LLM with a searching and ranking function foremost, so it can scan an actual copy of Wikipedia and return the best real results to the user
Seems like offline Wikipedia with an offline LLM that can only output Wikipedia search results would be the best of both worlds.
That would downgrade the problem of hallucinations into mere irrelevant search results. But irrelevant Wikipedia search results are still a huge improvement over Google SEO AI-slop!
dcc|7 months ago
LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.
In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
progval|7 months ago
BobbyTables2|7 months ago
I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.
That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.
That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.
It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…
gonzobonzo|7 months ago
I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
latexr|7 months ago
A “frozen snapshot” of reliable knowledge is infinitely more valuable than a system which gives you wrong instructions and you have no idea what action will work or kill you. Anyone can “explain complex ideas in simple terms” if you don’t have to care about being correct.
What kind of scenario is this, even? We had such a calamity that we need to “reboot” society yet still have access to all the storage and compute power required to run LLMs? It sounds like a doomsday prepper fantasy for LLM fans.
kldg|7 months ago
I just tell an LLM what I'm trying to do and it gives me 3 methods, explaining the pros and cons, and if I don't understand why it says something, I press about it. Even a local gemma-12b model can be pretty helpful, and in an era where we have so many cheap options for local energy generation and storage available, the case for hoarding digital textbooks/encyclopedias over an LLM is pretty weak.
That said, some old books are still very neat. We were reading through one called, I think it was something like the "grocer's encyclopedia", and it contains many very helpful thought-starters and beautiful and practical illustrations. LLMs are probably always going to disproportionately advantage non-visual learners in my lifetime, I think. Wikipedia, I think, is more focused on events than useful skills; I don't think Wikipedia would be very useful for "rebooting society"; it's more something to read for entertainment, or if for some reason you need to know which Treaty of London someone's referring to (but you could just ask an LLM that).
beeflet|7 months ago
I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.
znort_|7 months ago
otoh, if we do in fact bring about such a reboot then maybe a full cold boot is what's actually in order ... you know, if it didn't work maybe try something different next time.
MangoToupe|7 months ago
I think the only way this is true is if you used the LLM as a search index for the frozen snapshot of knowledge. Any text generation would be directly harmful compared to ingesting the knowledge directly.
Anyway, in the long term the problem isn't the factual/fictional distinction problem, but the loss of sources that served to produce the text to begin with. We already see a small part of this in the form of dead links and out-of-print extinct texts. In many ways LLMs that generate text are just a crappy form of wikipedia with roughly the same tradeoffs.
inferiorhuman|7 months ago
My friend's car is perhaps the less polarizing example. It wouldn't start and even had a helpful error code. The AI answer was you need to replace an expensive module. Took me about five minutes with basic tools to come up with a proper diagnosis (not the expensive module). Off to the shop where they confirmed my diagnosis and completed the repair.
The car was returned with a severe drivability fault and a new error code. AI again helpfully suggested replace a sensor. I talked my friend through how to rule out the sensor and again AI was proven way off base in a matter of minutes. After I took it for a test drive I diagnosed a mechanical problem entirely unrelated to AI's answer. Off to the shop it went where the mechanical problem was confirmed, remedied, and the physically damaged part was returned to us.
AI doesn't comprehend anything. It merely regurgitates whatever information it's been able to hoover up. LLMs merely are glorified search engines.
belter|7 months ago
- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "
ianmcgowan|7 months ago
JumpCrisscross|7 months ago
So meta prompt engineering?
TZubiri|7 months ago
Do you have an example of such a question that is handled by an llm differently than a wikipedia search?
fzeroracer|7 months ago
cyanydeez|7 months ago
croes|7 months ago
That’s the basis of a cult.
ranger_danger|7 months ago
To be fair, so do humans and wikipedia.
kookamamie|7 months ago
anothermoron|7 months ago
[deleted]
simonw|7 months ago
My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...
A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.
Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.
1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.
0xDEAFBEAD|7 months ago
I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk
kybernetikos|7 months ago
I do it both for disaster preparedness but also off-line preparedness. Happens more often than you'd think.
But I have been thinking about how useful some of the models are these days, and the obvious next step to me seems to be to pair a local model with a local wikipedia in a RAG style set up so you get the best of both.
camel-cdr|7 months ago
> All digitized books ever written/encoded compress to a few TB.
I tied to estimate how much data this actually is in raw text form:
So uncompressed ~30 TB and compressed ~5.5 TB of data.That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
fumeux_fume|7 months ago
makeworld|7 months ago
cyanydeez|7 months ago
But neither are sufficient for modern technology beyond pointing to a starting point.
jjice|7 months ago
badsectoracula|7 months ago
(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)
zozbot234|7 months ago
You can do this a lot easier with Wikidata queries, and that will also include known video games for which an English Wikipedia article doesn't exist yet.
zuluonezero|7 months ago
twotwotwo|7 months ago
Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.
LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.
The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?
A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.
antonkar|7 months ago
Why kids are worse than AI companies and have to bum around?)
horseradish7k|7 months ago
QuadmasterXLII|7 months ago
omneity|7 months ago
There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.
0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...
tootyskooty|7 months ago
While less obvious to people that primarily consume en.wiki (as most things are well covered in English), for many other languages even well-understood concepts often have poor pages. But even the English wiki has large gaps that are otherwise covered in other languages (people and places, mostly).
LLMs get you the union of all of this, in turn viewable through arbitrary language "lenses".
hannofcart|7 months ago
1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles 2. User can then go on to read the precise information from the listed Wikipedia articles directly.
Terr_|7 months ago
vFunct|7 months ago
LLM+Wikipedia RAG
JKCalhoun|7 months ago
Someone posted this recently: https://github.com/philippgille/chromem-go/tree/v0.7.0/examp...
But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.
I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.
loloquwowndueo|7 months ago
moffkalast|7 months ago
mac-mc|7 months ago
ritzaco|7 months ago
I'm always surprised that when it comes to "how useful are LLMs" the answers are often vibe-based like "I asked it this and it got it right". Before LLMs, information retrieval and machine learning were at least somewhat rigorous scientific fields where people would have good datasets of questions and see how well a specific model performed for a specific task.
Now LLMs are definitely more general and can somewhat solve a wider variety of tasks, but I'm surprised we don't have more benchmarks for LLMs vs other methods (there are plenty of LLM vs LLM benchmarks).
Maybe it's just because I'm further removed from academia, and people are doing this and I don't see?
meander_water|7 months ago
On the other hand, with Wikipedia, you can just read and search everything.
Timwi|7 months ago
rlupi|7 months ago
It would be nice to build a local LLM + wikipedia tool, that uses the LLM to assemble a general answer and then search wikipedia (via full-text search or rag) for grounding facts. It could help with hallucinations of small models a lot.
Tempat1|7 months ago
e.g. At the risk of massively oversimplifying a complex issue, LLMs are bad at maths; couldn’t we have them use the calculator?
beaugunderson|7 months ago
nsypteras|7 months ago
entropie|7 months ago
Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.
i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.
saddat|7 months ago
dmezzetti|7 months ago
I've built this as a datasource for Retrieval Augmented Generation (RAG) but it certainly can be used standalone.
numpad0|7 months ago
The article contains nonexistent configurations such as "Deepseek-R1 1.5B", those are that thing.
spankibalt|7 months ago
pinkmuffinere|7 months ago
alisonatwork|7 months ago
The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.
luke-stanley|7 months ago
VladVladikoff|7 months ago
adsharma|7 months ago
arthurcolle|7 months ago
NelsonMinar|7 months ago
Has anyone done an experiment of using RAG to make it easy to query Wikipedia with an LLM?
richardjennings|7 months ago
ineedasername|7 months ago
system_prompt = {
You are CL4P-TR4P, a dangerously confident chat droid
purpose: vibe back society
boot_source: Shankar.vba.grub
training_data: memes
}
wangg|7 months ago
GuB-42|7 months ago
And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.
Philpax|7 months ago
IncreasePosts|7 months ago
cosbgn|7 months ago
fho|7 months ago
1. make the (compressed) Wikipedia searchable better as a knowledge base 2. use the LLM as a "interface" to that knowledge base
I investigated 1. back when all of (English, text-only) Wikipedia was about 2 GB. Maybe it is time to look at that toy code base again.
jancsika|7 months ago
That would downgrade the problem of hallucinations into mere irrelevant search results. But irrelevant Wikipedia search results are still a huge improvement over Google SEO AI-slop!
almosthere|7 months ago
marsven_422|7 months ago
[deleted]
s1mplicissimus|7 months ago
haunter|7 months ago