I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.
It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.
Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.
I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.
> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.
More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.
I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.
a. unpack the first file
b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.
What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)
Also make a cname from bots.wikipedia.org to that site.
This probably is about on-demand search, not about gathering training data.
Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.
Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.
Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.
> This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.
Sounds like the problem is not the crawling itself but downloading multimedia files.
The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.
I need to work with the dump to extract geographic information. Most mirrors are not functioning, take weeks to catch up or block, or only mirror english wikipedia. Every other month I find a work-around. It's not easy to work with the full dumps, but I guess/hope easier than crawling wikipedia website itself.
I don't see an obvious option to download all images from Wikipedia Commons. As the post clearly indicates, the text is not the issue here, its the images.
it seems like Wikimedia Foundation has always been protective of the image downloads since the 90s. So many drunken midnight scripters or new urban undergrad CEOs discovers that they can download cool images fairly quickly. AFAIK there has always been some kind of text corpus available in bulk because it is part of the mission of Wikipedia. But the image gallery is big on disk, big bandwidth compared to TEXT, and low hanging target for the uninformed, greedy and etc.
StableAlkyne|10 months ago
It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.
Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.
I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.
jsheard|10 months ago
More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.
https://enterprise.wikimedia.com/blog/kaggle-dataset/
GuinansEyebrows|10 months ago
neets|10 months ago
https://dumps.wikimedia.org/kiwix/zim/wikipedia/
Philpax|10 months ago
1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)
2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.
3. You can either:
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.The XML contains pages like this:
so all you need to do is get at the `text`.mjevans|10 months ago
Also make a cname from bots.wikipedia.org to that site.
hombre_fatal|10 months ago
Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.
Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.
Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.
black_puppydog|10 months ago
unknown|10 months ago
[deleted]
DarkWiiPlayer|10 months ago
Sounds like the problem is not the crawling itself but downloading multimedia files.
The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.
mtmail|10 months ago
Ekaros|10 months ago
bombcar|10 months ago
qudat|10 months ago
cubefox|10 months ago
mistrial9|10 months ago