Show HN: Static.wiki – read-only Wikipedia using a 43GB SQLite file

[+] phiresky|4 years ago|reply

Author of the referenced blog (and library) here. This is great!

The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.

But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.

Here's the demo also for Wikipedia: https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht... I'm not sure if it's more efficient than the SQlite version in this form, but it definitely has more upward potential and is more fun to work with.

And the corresponding draft PR: https://github.com/tantivy-search/tantivy/pull/1067

I sadly haven't gotten around to working on it more and writing an article about it.

Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.

Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`

[+] segfall|4 years ago|reply

Thank you, Phiresky. My little side project only exists because of your work.

[+] walrus01|4 years ago|reply

How does it do for performance if you throw the whole 43GB into RAM? Plenty of very affordable workstation systems out there today gently used with 128GB in them.

[+] peterhunt|4 years ago|reply

You can also get efficient FTS with this method if you implement indexing in user space and avoid BM25. The Lucene practical scoring function works well with this method in my experience: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

[+] dheera|4 years ago|reply

Yeah I was just thinking that. Why not just a static filesystem? Just mount a ext4 image full of .html files and browse away. It would be remarkably efficient, likely much more efficient than SQLite.

[+] tored|4 years ago|reply

I have been thinking of using FTS5 with SQLite to search emails. Not close of 43 GB ofc, so I would probably not have any major performance problems, but still is FTS5 any good? Or should I look into other solutions?

[+] neocodesoftware|4 years ago|reply

how are the xml dumps generated?

[+] arianon|4 years ago|reply

Shocked to find out that this is very much a static website, it's "merely" downloading a small portion of the 43GB SQLite file in question with HTTP Range requests, and then it uses a WASM-compiled copy of SQLite to query the requested data. Very impressive.

[+] simonw|4 years ago|reply

It's the same trick that was described here - it's absolutely brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

[+] k__|4 years ago|reply

Didn't know you could pump 43GB of data into a browser.

[+] dnamlin|4 years ago|reply

I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which reduced it 75% to 10 GiB. This plug-in also supports HTTP access [2] in the spirit of phiresky's, however, I don't have a WebAssembly build of mine yet, so it's a library for desktop/command-line apps for now. You can try it out on Linux or macOS x86-64:

  pip3 install genomicsqlite
  genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db "select text from wiki_articles where title = 'SQLite'"

("genomicsqlite" is the CLI for my Genomics Extension [3], which is built around these Zstandard compression & web layers.)

[1] https://github.com/mlin/sqlite_zstd_vfs

[2] https://github.com/mlin/sqlite_web_vfs

[3] https://mlin.github.io/GenomicSQLite

EDITS: I expanded on this comment in this gist https://gist.github.com/mlin/ee20d7c5156baf9b12518961f36590c...

If you want to download the whole en.zstd.db, then please kindly get it from zenodo (which doesn't support HTTP range requests, but is free): https://zenodo.org/record/5149677

[+] jart|4 years ago|reply

Great work but why not compress with deflate if you are serving http requests since then you could directly copy the database content to the wire as gzip encoded responses.

[+] segfall|4 years ago|reply

OP here. Happy to take questions, with the caveat that my work here is mostly glue.

Some background: https://github.com/segfall/static-wiki

The datasets I generated: https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...

The original author of this approach: https://news.ycombinator.com/item?id=28013514

[+] segfall|4 years ago|reply

For added fun, you can try switching between different databases:

http://static.wiki/en/United_States

http://static.wiki/zh/美国

http://static.wiki/fr/United_States

http://static.wiki/enwikiquote/United_States

http://static.wiki/simple/United_States

[+] ddtaylor|4 years ago|reply

Is there a way to get this simple theme / stylesheet to work for the regular Wikipedia? I really like it.

[+] johnwheeler|4 years ago|reply

Is the db in memory?

[+] simonw|4 years ago|reply

I'm nervous on your behalf for your S3 bill, have you done napkin maths for how much this will cost to host, especially given the relative inefficiency of the autocomplete queries?

A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against. That might give you queries that use smaller ranges of the file.

[+] ignoramous|4 years ago|reply

> A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against.

Another one: build a directed-acyclic-word-graph [0][1] of just the page-titles and download it to browser's localStorage for prefix-based searches.

In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M articles [2].

[0] https://github.com/smhanov/dawg

[1] http://stevehanov.ca/blog/?id=115

[2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

[+] sitkack|4 years ago|reply

I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.

[+] segfall|4 years ago|reply

Great catch, simonw (and everyone else in this thread).

I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.

[+] capableweb|4 years ago|reply

Seems to be hosted at Surge,sh which is hosted at Digital Ocean, so seems this is not OPs problem.

    $ nslookup static.wiki
    
    Server:  8.8.8.8
    Address: 8.8.8.8#53
    
    Non-authoritative answer:
    static.wiki canonical name = na-west1.surge.sh.
    Name: na-west1.surge.sh
    Address: 188.166.132.94
    
    $ whois 188.166.132.94
    
    % This is the RIPE Database query service.
    % The objects are in RPSL format.
    %
    % The RIPE Database is subject to Terms and Conditions.
    % See http://www.ripe.net/db/support/db-terms-conditions.pdf
    
    % Note: this output has been filtered.
    %       To receive output for a database update, use the "-B" flag.
    
    % Information related to '188.166.132.0 - 188.166.135.255'
    
    % Abuse contact for '188.166.132.0 - 188.166.135.255' is '[email protected]'
    
    inetnum:        188.166.132.0 - 188.166.135.255
    netname:        DIGITALOCEAN
    country:        NL
    admin-c:        PT7353-RIPE
    tech-c:         PT7353-RIPE
    status:         ASSIGNED PA
    mnt-by:         digitalocean
    created:        2019-04-17T14:02:52Z
    last-modified:  2019-04-17T14:02:52Z
    source:         RIPE
    
    person:         Network Operations
    address:        101 Ave of the Americas, 10th Floor
    address:        New York, NY, 10013
    address:        United States of America
    phone:          +13478756044
    nic-hdl:        PT7353-RIPE
    mnt-by:         digitalocean
    created:        2015-03-11T16:37:07Z
    last-modified:  2019-04-17T14:37:51Z
    source:         RIPE # Filtered
    org:            ORG-DOI2-RIPE
    
    % This query was served by the RIPE Database Query Service version 1.101 (HEREFORD)

But yes, it is wise to run storage/transfer heavy websites on something that doesn't charge for traffic like AWS and most other infrastructure providers do.

[+] dom96|4 years ago|reply

Why do people even use S3 if that’s a risk? I really don’t get this, do most just not worry about accidentally wasting money or do they just not understand this risk?

[+] mayli|4 years ago|reply

Correct, downloading this 40GB file could easily leads to huge s3 bills due to traffic. That's 1$ for a full download.

[+] Weryj|4 years ago|reply

Get a FusionIO drive off ebay and host the DB yourself, if you need, stick the computer behind a static ip VPN.

[+] NelsonMinar|4 years ago|reply

I've been using static copies of Wikipedia for years; they're great to have when traveling. I mostly use the Kiwix app. Their ZIM format database for all English is 83 GB; 43 GB for one without pictures. Compares nicely to the 43 GB here.

I once spent an hour in the Louvre learning about Rubens astonishing paintings of Marie de' Medici thanks to Kiwix. Without it I was just like "huh, nice paintings, I wonder what they are?" They're incredible and I owe that moment of kismet to Wikipedia, Kiwix, and my mobile phone. https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle

[+] elcritch|4 years ago|reply

That’s brilliant! Art museums are fun but there’s so much context missing for most paintings.

[+] mohiit|4 years ago|reply

I remember having it on my ipod touch way back in 2012 when it was only around 7gb.

[+] voldemort1968|4 years ago|reply

There's also https://www.kiwix.org/en/ which is great to take on long offline flights, browsing through wikipedia.

[+] rycomb|4 years ago|reply

And someone keeps a relatively up-to-date .slob file with the entire wikipedia (that can be read with Aard2), which I find more useful than kiwix!

(A link to it can be found in the GitHub page with Aard dictionaries)

[+] lmm|4 years ago|reply

Does it actually work? I tried and it would always give up partway through the downloads (and most of the times when I ask in these threads it turns out the person recommending it doesn't actually use it, they just think it's a cool idea...)

[+] punnerud|4 years ago|reply

Related post 89 days ago: https://news.ycombinator.com/item?id=27016630 (Hosting SQLite databases on GitHub Pages or any static file hoster)

[+] Scaevolus|4 years ago|reply

You should compress the individual article texts with zlib. Right now, looking up an article text for a decent size article ("Virus") takes 10 round-trips-- 5 appear to be btree traversals, but another 5 (!) are to read in the page content, as chunks of 8K 8K 16K 32K 64K. Compressing the article might cut one round trip off that

Increasing the sqlite page size to a much larger size like 64KB or 128KB may improve average latency too, since fewer btree pages would need to be fetched to find the offset of the desired row, and most page contents could be read with one fetch. This matters more when fetches over the network take ~100ms.

[+] mongol|4 years ago|reply

Perfect if stuck on a remote deserted island! Must be the closest existing thing to the Hitchhiker's guide?

[+] yreg|4 years ago|reply

I installed a reader with a Wikipedia dump (without multimedia) on the original iPhone. At the time (at least in my region) WiFi was quite rare and it really did feel like the Hitchhiker's guide and incredibly powerful.

I only kept it for a few weeks though since it consumed almost all of the space. (And I imagine it was probably my language mutation, the english Wikipedia must have been too big already.)

[+] pronoiac|4 years ago|reply

You might look at Kiwix, and the data sources they've indexed: https://www.kiwix.org/en/

(Oddly, it's crashing at launch on my iPad right now)

[+] phil294|4 years ago|reply

I'm not trying to kill the fun here, but since this thing relies on a working internet connection, it would probably be pretty useless on a remote island

[+] wyldfire|4 years ago|reply

Towel not included.

[+] exikyut|4 years ago|reply

Sigh obligatory XKCD https://xkcd.com/548/

[+] r00fus|4 years ago|reply

Hitchhikers Guide to Wikipedia? HHGTWP?

[+] nynx|4 years ago|reply

This is fantastic. I knew that we’d see things like this once that hosting sqlite databases project was posted.

[+] mwidell|4 years ago|reply

I remember having the iPhone 2G back in 2007, and that someone created a way to store a 5GB wikipedia dump (the complete wikipedia at that time, but without photos) on it, so you could have the whole Wikipedia in your pocket. At the time it felt pretty amazing.

[+] harlanji|4 years ago|reply

Bonito. I want and yet have problems downloading even a 350mb file on my carrier, tethered via 2 different devices. Being able to DL something like this should be a requirement to be called an Internet carrier. Non-mobile is not an option for some, eg. low income unemployed and homeless. Maybe they’ve been a software engineer for 20 years and can reach unreachable markets. (No doubt it’s more profitable to spy and steal from the vulnerable, if God allows).

What if a customer needs their records and can not afford to get them and they lose their life or suffer significantly? Who pays the balance?

They didn’t need use the Internet? Okay, well they did use it and they do during pandemic times.

[+] yukinon|4 years ago|reply

There's a bug when the target article is titled the same as the disambiguation page. Search for "Earthbound", and try clicking on the article for the video game. It'll bring you right back to the disambiguation page.

[+] twotwotwo|4 years ago|reply

That's super cool.

The autocomplete trickiness is fun. Looks (from links in GitHub) like it's using full-text search and currently needs to scan a lot to rank results. Doing just prefix search might be a cheap half-solution to autocomplete, and maybe layering on hacks around stopwords etc. can get you a bit further.

For another angle, enwiki's full compressed list of titles (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the whole dang thing (in the background?) and doing something with it client-side is a different sort of approach from what this page is going for but not unthinkable.

[+] suprbeast|4 years ago|reply

We put this on an e-reader once. Onboard wikipedia, it was cool. It was 2011.

[+] spiritplumber|4 years ago|reply

cool! was it able to update itself or no?

[+] jart|4 years ago|reply

[Hit #1 on HN hosting 43gb Wikipedia SQLite DB on AWS] https://justine.lol/hugemistake.gif

Each time I post a link to Hacker News that becomes popular it ends up being at least a $100 in bandwidth bills from Google Cloud, and I've got a simple tiny unbloated website. This poor guy is going to go bankrupt. The Internet dark age is like Ma Bell all over again when we'd get those $500 long distance bills for dialing into BBSs.

[+] dankwizard|4 years ago|reply

Hi, I am just commenting to mention a bug. When I look at page for The Smith's self titled album ( http://static.wiki/en/The_Smiths_(album) ) , the track listing section doesn't populate. Other than that, I love the simplicity and cleanness of the design

[+] cafxx|4 years ago|reply

[nit] you should really run analyze before vacuum, not the other way around (https://github.com/segfall/static-wiki/blob/6d455de149c24e2a...)

[+] unknown|4 years ago|reply

[deleted]

189 comments