top | item 43869138

(no title)

ianso | 10 months ago

The dumbest part of this is that all Wikimedia projects already export a dump for bulk downloading: https://dumps.wikimedia.org/

So it's not like you need to crawl the sites to get content for training your models...

discuss

StableAlkyne|10 months ago

I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.

Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.

I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.

jsheard|10 months ago

> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/

GuinansEyebrows|10 months ago

I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.

neets|10 months ago

Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/

Philpax|10 months ago

Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly

4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>

so all you need to do is get at the `text`.

mjevans|10 months ago

What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.

hombre_fatal|10 months ago

This probably is about on-demand search, not about gathering training data.

Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.

Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.

Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.

black_puppydog|10 months ago

Yeah. Much easier to tragedy-of-the-commons the hell out of what is arguably one of the only consistently great achievements on the web...

unknown|10 months ago

[deleted]

DarkWiiPlayer|10 months ago

> This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

Sounds like the problem is not the crawling itself but downloading multimedia files.

The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.

mtmail|10 months ago

I need to work with the dump to extract geographic information. Most mirrors are not functioning, take weeks to catch up or block, or only mirror english wikipedia. Every other month I find a work-around. It's not easy to work with the full dumps, but I guess/hope easier than crawling wikipedia website itself.

Ekaros|10 months ago

Why use screwdriver when you have sledge hammer and everything is a nail?

bombcar|10 months ago

nAIl™ - the network AI library. For sledgehammering all your screwdriver needs.

qudat|10 months ago

I thought that as well but maybe this is more for indexing search engines? In which case you want more realtime updates?

cubefox|10 months ago

I don't see an obvious option to download all images from Wikipedia Commons. As the post clearly indicates, the text is not the issue here, its the images.

mistrial9|10 months ago

it seems like Wikimedia Foundation has always been protective of the image downloads since the 90s. So many drunken midnight scripters or new urban undergrad CEOs discovers that they can download cool images fairly quickly. AFAIK there has always been some kind of text corpus available in bulk because it is part of the mission of Wikipedia. But the image gallery is big on disk, big bandwidth compared to TEXT, and low hanging target for the uninformed, greedy and etc.