Author of the referenced blog (and library) here. This is great!
The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.
But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.
I sadly haven't gotten around to working on it more and writing an article about it.
Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.
Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`
How does it do for performance if you throw the whole 43GB into RAM? Plenty of very affordable workstation systems out there today gently used with 128GB in them.
Yeah I was just thinking that. Why not just a static filesystem? Just mount a ext4 image full of .html files and browse away. It would be remarkably efficient, likely much more efficient than SQLite.
I have been thinking of using FTS5 with SQLite to search emails. Not close of 43 GB ofc, so I would probably not have any major performance problems, but still is FTS5 any good? Or should I look into other solutions?
Shocked to find out that this is very much a static website, it's "merely" downloading a small portion of the 43GB SQLite file in question with HTTP Range requests, and then it uses a WASM-compiled copy of SQLite to query the requested data. Very impressive.
I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which reduced it 75% to 10 GiB. This plug-in also supports HTTP access [2] in the spirit of phiresky's, however, I don't have a WebAssembly build of mine yet, so it's a library for desktop/command-line apps for now. You can try it out on Linux or macOS x86-64:
pip3 install genomicsqlite
genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db "select text from wiki_articles where title = 'SQLite'"
("genomicsqlite" is the CLI for my Genomics Extension [3], which is built around these Zstandard compression & web layers.)
If you want to download the whole en.zstd.db, then please kindly get it from zenodo (which doesn't support HTTP range requests, but is free): https://zenodo.org/record/5149677
Great work but why not compress with deflate if you are serving http requests since then you could directly copy the database content to the wire as gzip encoded responses.
I'm nervous on your behalf for your S3 bill, have you done napkin maths for how much this will cost to host, especially given the relative inefficiency of the autocomplete queries?
A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against. That might give you queries that use smaller ranges of the file.
I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.
Great catch, simonw (and everyone else in this thread).
I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.
Seems to be hosted at Surge,sh which is hosted at Digital Ocean, so seems this is not OPs problem.
$ nslookup static.wiki
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
static.wiki canonical name = na-west1.surge.sh.
Name: na-west1.surge.sh
Address: 188.166.132.94
$ whois 188.166.132.94
% This is the RIPE Database query service.
% The objects are in RPSL format.
%
% The RIPE Database is subject to Terms and Conditions.
% See http://www.ripe.net/db/support/db-terms-conditions.pdf
% Note: this output has been filtered.
% To receive output for a database update, use the "-B" flag.
% Information related to '188.166.132.0 - 188.166.135.255'
% Abuse contact for '188.166.132.0 - 188.166.135.255' is '[email protected]'
inetnum: 188.166.132.0 - 188.166.135.255
netname: DIGITALOCEAN
country: NL
admin-c: PT7353-RIPE
tech-c: PT7353-RIPE
status: ASSIGNED PA
mnt-by: digitalocean
created: 2019-04-17T14:02:52Z
last-modified: 2019-04-17T14:02:52Z
source: RIPE
person: Network Operations
address: 101 Ave of the Americas, 10th Floor
address: New York, NY, 10013
address: United States of America
phone: +13478756044
nic-hdl: PT7353-RIPE
mnt-by: digitalocean
created: 2015-03-11T16:37:07Z
last-modified: 2019-04-17T14:37:51Z
source: RIPE # Filtered
org: ORG-DOI2-RIPE
% This query was served by the RIPE Database Query Service version 1.101 (HEREFORD)
But yes, it is wise to run storage/transfer heavy websites on something that doesn't charge for traffic like AWS and most other infrastructure providers do.
Why do people even use S3 if that’s a risk? I really don’t get this, do most just not worry about accidentally wasting money or do they just not understand this risk?
I've been using static copies of Wikipedia for years; they're great to have when traveling. I mostly use the Kiwix app. Their ZIM format database for all English is 83 GB; 43 GB for one without pictures. Compares nicely to the 43 GB here.
I once spent an hour in the Louvre learning about Rubens astonishing paintings of Marie de' Medici thanks to Kiwix. Without it I was just like "huh, nice paintings, I wonder what they are?" They're incredible and I owe that moment of kismet to Wikipedia, Kiwix, and my mobile phone. https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle
Does it actually work? I tried and it would always give up partway through the downloads (and most of the times when I ask in these threads it turns out the person recommending it doesn't actually use it, they just think it's a cool idea...)
You should compress the individual article texts with zlib. Right now, looking up an article text for a decent size article ("Virus") takes 10 round-trips-- 5 appear to be btree traversals, but another 5 (!) are to read in the page content, as chunks of 8K 8K 16K 32K 64K. Compressing the article might cut one round trip off that
Increasing the sqlite page size to a much larger size like 64KB or 128KB may improve average latency too, since fewer btree pages would need to be fetched to find the offset of the desired row, and most page contents could be read with one fetch. This matters more when fetches over the network take ~100ms.
I installed a reader with a Wikipedia dump (without multimedia) on the original iPhone. At the time (at least in my region) WiFi was quite rare and it really did feel like the Hitchhiker's guide and incredibly powerful.
I only kept it for a few weeks though since it consumed almost all of the space. (And I imagine it was probably my language mutation, the english Wikipedia must have been too big already.)
I'm not trying to kill the fun here, but since this thing relies on a working internet connection, it would probably be pretty useless on a remote island
I remember having the iPhone 2G back in 2007, and that someone created a way to store a 5GB wikipedia dump (the complete wikipedia at that time, but without photos) on it, so you could have the whole Wikipedia in your pocket. At the time it felt pretty amazing.
Bonito. I want and yet have problems downloading even a 350mb file on my carrier, tethered via 2 different devices. Being able to DL something like this should be a requirement to be called an Internet carrier. Non-mobile is not an option for some, eg. low income unemployed and homeless. Maybe they’ve been a software engineer for 20 years and can reach unreachable markets. (No doubt it’s more profitable to spy and steal from the vulnerable, if God allows).
What if a customer needs their records and can not afford to get them and they lose their life or suffer significantly? Who pays the balance?
They didn’t need use the Internet? Okay, well they did use it and they do during pandemic times.
There's a bug when the target article is titled the same as the disambiguation page. Search for "Earthbound", and try clicking on the article for the video game. It'll bring you right back to the disambiguation page.
The autocomplete trickiness is fun. Looks (from links in GitHub) like it's using full-text search and currently needs to scan a lot to rank results. Doing just prefix search might be a cheap half-solution to autocomplete, and maybe layering on hacks around stopwords etc. can get you a bit further.
For another angle, enwiki's full compressed list of titles (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the whole dang thing (in the background?) and doing something with it client-side is a different sort of approach from what this page is going for but not unthinkable.
Each time I post a link to Hacker News that becomes popular it ends up being at least a $100 in bandwidth bills from Google Cloud, and I've got a simple tiny unbloated website. This poor guy is going to go bankrupt. The Internet dark age is like Ma Bell all over again when we'd get those $500 long distance bills for dialing into BBSs.
Hi, I am just commenting to mention a bug. When I look at page for The Smith's self titled album ( http://static.wiki/en/The_Smiths_(album) ) , the track listing section doesn't populate. Other than that, I love the simplicity and cleanness of the design
[+] [-] phiresky|4 years ago|reply
The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.
But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.
Here's the demo also for Wikipedia: https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht... I'm not sure if it's more efficient than the SQlite version in this form, but it definitely has more upward potential and is more fun to work with.
And the corresponding draft PR: https://github.com/tantivy-search/tantivy/pull/1067
I sadly haven't gotten around to working on it more and writing an article about it.
Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.
Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`
[+] [-] segfall|4 years ago|reply
[+] [-] walrus01|4 years ago|reply
[+] [-] peterhunt|4 years ago|reply
[+] [-] dheera|4 years ago|reply
[+] [-] tored|4 years ago|reply
[+] [-] neocodesoftware|4 years ago|reply
[+] [-] arianon|4 years ago|reply
[+] [-] simonw|4 years ago|reply
[+] [-] k__|4 years ago|reply
[+] [-] dnamlin|4 years ago|reply
[1] https://github.com/mlin/sqlite_zstd_vfs
[2] https://github.com/mlin/sqlite_web_vfs
[3] https://mlin.github.io/GenomicSQLite
EDITS: I expanded on this comment in this gist https://gist.github.com/mlin/ee20d7c5156baf9b12518961f36590c...
If you want to download the whole en.zstd.db, then please kindly get it from zenodo (which doesn't support HTTP range requests, but is free): https://zenodo.org/record/5149677
[+] [-] jart|4 years ago|reply
[+] [-] segfall|4 years ago|reply
Some background: https://github.com/segfall/static-wiki
The datasets I generated: https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...
The original author of this approach: https://news.ycombinator.com/item?id=28013514
[+] [-] segfall|4 years ago|reply
http://static.wiki/en/United_States
http://static.wiki/zh/美国
http://static.wiki/fr/United_States
http://static.wiki/enwikiquote/United_States
http://static.wiki/simple/United_States
[+] [-] ddtaylor|4 years ago|reply
[+] [-] johnwheeler|4 years ago|reply
[+] [-] simonw|4 years ago|reply
A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against. That might give you queries that use smaller ranges of the file.
[+] [-] ignoramous|4 years ago|reply
Another one: build a directed-acyclic-word-graph [0][1] of just the page-titles and download it to browser's localStorage for prefix-based searches.
In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M articles [2].
[0] https://github.com/smhanov/dawg
[1] http://stevehanov.ca/blog/?id=115
[2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
[+] [-] sitkack|4 years ago|reply
[+] [-] segfall|4 years ago|reply
I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.
[+] [-] capableweb|4 years ago|reply
[+] [-] dom96|4 years ago|reply
[+] [-] mayli|4 years ago|reply
[+] [-] Weryj|4 years ago|reply
[+] [-] NelsonMinar|4 years ago|reply
I once spent an hour in the Louvre learning about Rubens astonishing paintings of Marie de' Medici thanks to Kiwix. Without it I was just like "huh, nice paintings, I wonder what they are?" They're incredible and I owe that moment of kismet to Wikipedia, Kiwix, and my mobile phone. https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle
[+] [-] elcritch|4 years ago|reply
[+] [-] mohiit|4 years ago|reply
[+] [-] voldemort1968|4 years ago|reply
[+] [-] rycomb|4 years ago|reply
(A link to it can be found in the GitHub page with Aard dictionaries)
[+] [-] lmm|4 years ago|reply
[+] [-] punnerud|4 years ago|reply
[+] [-] Scaevolus|4 years ago|reply
Increasing the sqlite page size to a much larger size like 64KB or 128KB may improve average latency too, since fewer btree pages would need to be fetched to find the offset of the desired row, and most page contents could be read with one fetch. This matters more when fetches over the network take ~100ms.
[+] [-] mongol|4 years ago|reply
[+] [-] yreg|4 years ago|reply
I only kept it for a few weeks though since it consumed almost all of the space. (And I imagine it was probably my language mutation, the english Wikipedia must have been too big already.)
[+] [-] pronoiac|4 years ago|reply
(Oddly, it's crashing at launch on my iPad right now)
[+] [-] phil294|4 years ago|reply
[+] [-] wyldfire|4 years ago|reply
[+] [-] exikyut|4 years ago|reply
[+] [-] r00fus|4 years ago|reply
[+] [-] nynx|4 years ago|reply
[+] [-] mwidell|4 years ago|reply
[+] [-] harlanji|4 years ago|reply
What if a customer needs their records and can not afford to get them and they lose their life or suffer significantly? Who pays the balance?
They didn’t need use the Internet? Okay, well they did use it and they do during pandemic times.
[+] [-] yukinon|4 years ago|reply
[+] [-] twotwotwo|4 years ago|reply
The autocomplete trickiness is fun. Looks (from links in GitHub) like it's using full-text search and currently needs to scan a lot to rank results. Doing just prefix search might be a cheap half-solution to autocomplete, and maybe layering on hacks around stopwords etc. can get you a bit further.
For another angle, enwiki's full compressed list of titles (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the whole dang thing (in the background?) and doing something with it client-side is a different sort of approach from what this page is going for but not unthinkable.
[+] [-] suprbeast|4 years ago|reply
[+] [-] spiritplumber|4 years ago|reply
[+] [-] jart|4 years ago|reply
Each time I post a link to Hacker News that becomes popular it ends up being at least a $100 in bandwidth bills from Google Cloud, and I've got a simple tiny unbloated website. This poor guy is going to go bankrupt. The Internet dark age is like Ma Bell all over again when we'd get those $500 long distance bills for dialing into BBSs.
[+] [-] dankwizard|4 years ago|reply
[+] [-] cafxx|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]