Not the inverse, but for any SPA (not framework or library) developers seeing this, it's probably worth noting that this is not better than using document.write, window.open and simular APIs.
But could be very interesting for use cases where the main logic lives on the server and people try to manually implement some download- and/or lazy-loading logic.
Still probably bad unless you're explicitly working on init and redirect scripts.
I guess the next question will be if it does work in environments that let you share a single file, will they disable this ability once they find out people are using it.
Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.
I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.
Html is already a good single-file html format. Images can be inlined with data-uri. CSS and javascript have been inlineable since the very beginning. What more is needed? Fonts? Data-uri, once more.
Hell, html is probably what word processor apps should be saving everything as. You can get pixel-level placement of any element if you want that.
Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.
But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.
It sounds like it would be pretty easy to write a super simple app with a browser in it that you could associate with the file type to spin these up. IMO.
I mean `claude -p "spin up a python webserver in this directory please"` or alternately `python -m http.server 8080 --bind 127.0.0.1 --directory .` is not hard
In case the author is reading: Please consider to add official fields for an optional screenshot of the page in BASE64 encoding and permit to add an (optional) description. Would also help to have official fields to specify the ISO time stamp when the archival took place.
As final wish list, would be great to have multiple versions/crawls of the same URL with deduplication of static assets (images, fonts) but this is likely stretching too much for this format.
Allowing more metadata might be useful. You can add anything to the manifest at build time as assets are not required to be loaded or ever used (because this is impossible to statically check). I suppose we'd have to define an official prefix like 'gwtar-metadata-*' with like a 'gwtar-metadata-screenshot' and 'gwtar-metadata-desciption'... Not obvious what the best way forward is there, you don't want to add a whole bunch of ad hoc metadata fields, everyone will have a different one they want. Exif...?
Multiple versions or multiple pages (maybe they can be the same thing?) would be nice but also unclear how to make that. An iframe wrapper?
I considered and rejected deduplication and compression. Those can be done by the filesystem/server transparent to the format. (If there's an image file duplicated across multiple pages, then it should be trivial for any filesystem or server to detect or compress those away.)
Very cool idea. I think single-file HTML web apps are the most durable form of computer software. A few examples of Single-File Web Apps that I wrote are: https://fuzzygraph.com and https://hypervault.github.io/.
The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.
WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."
I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?
'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?
> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?
As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.
There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.
This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)
> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?
No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?
I’ve thought about doing something similar, but at the Service Worker layer so the page stays the same and all HTTP requests are intercepted.
Similar to the window.stop() approach, requests would truncate the main HTML file while the rest of that request would be the assets blob that the service worker would then serve up.
The service worker file could be a dataURI to keep this in one file.
So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?
How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).
Interesting, but I'm kind of confused why you'd need lazy loads for a local file? Like, how big are these files expected to be? (Or is the lazy loading just to support lazy loading its already doing?)
I believe the idea is that it's not local. It's a very large file on an HTTP server (required for range requests) and you don't want to download the whole thing over the network.
Of course, since it's on an HTTP server, it could easily handle doing multiple requests of different files, but sometimes that's inconvenient to manage on the server and a single file would be easier.
Maybe this is downstream of Gwern choosing to use MediaWiki for his website?
Hmm, I’m interested in this, especially since it applies no compression delta encoding might be feasible for daily scans of the data but for whatever reason my Brave mobile on iOS displays a blank page for the example page. Hmm, perhaps it’s a mobile rendering issue because Chrome and Safari on iOS can’t do it either https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...
Hmm, so this is essentially the appimage concept applied to web pages, namely:
- an executable header
- which then fuse mounts an embedded read-only heavily compressed filesystem
- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)
- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"
- and doesn't require any additional, custom infrastructure to run/serve
It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)
It's fairly common for archivers (including archive.org) to inject some extra scripts/headers into archived pages or otherwise modify the content slightly (e.g. fixing up relative links). If this happens, will it mess up the offsets used for range requests?
The range requests are to offsets in the original file, so I would think that most cases of 'live' injection do not necessarily break it. If you download the page and the server injects a bunch of JS into the 'header' on the fly and the header is now 10,000 bytes longer, then it doesn't matter, since all of the ranges and offsets in the original file remain valid: the first JPG is still located starting at offset byte #123,456 in $URL, the second one is located starting at byte #456,789 etc, no matter how much spam got injected into it.
Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.
Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)
I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.
I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).
I find that 'save as' horribly breaks a lot of web pages. There's no choice these days but to load pages with JS and serialize out the final quiescent DOM. I also spend a lot of time with uBlock Origin and AlwaysKillSticky and NoScript wrangling my archive snapshots into readability.
> Just because is requires "special" zip software on the server?
Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.
And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)
Zip stores its central directory at the end of the file. To find what's inside and where each entry starts, you need to read the tail first. That rules out issuing a single Range request to grab one specific asset.
Tar is sequential. Each entry header sits right before its data. If the JSON manifest in the Gwtar preamble says an asset lives at byte offset N with size M, the browser fires one Range request and gets exactly those bytes.
The other problem is decompression. Zip entries are individually deflate-compressed, so you'd need a JS inflate library in the self-extracting header. Tar entries are raw bytes, so the header script just slices at known offsets. No decompression code keeps the preamble small.
> The main header JS starts using range requests to first load the real HTML, and then it watches requests for resources; the resources have been rewritten to be deliberately broken 404 errors (requesting from localhost, to avoid polluting any server logs)
what if a web server on localhost happens to handle the request? why not request from a guaranteed unaccessable place like http://0.0.0.0/ or http://localhost:0/ (port zero)
Gwtar seems like a good solution to a problem nobody seemed to want to fix.
However, this website is... something else. It's full of inflated self impprtantance, overly bountiful prose, and feels like someone never learned to put in the time to write a shorter essay. Even the about page contains a description of the about page.
I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.
gwern is a legendary blogger (although blogger feels underselling it… “publisher”?) and has earned the right to self-aggrandize about solving a problem he has a vested interest in. Maybe he’s a megalomaniac and/or unemployed and/or writing too many words but after contributing so much, he has earned it.
simonw|16 days ago
Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop
Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.
Posted some more notes on my blog: https://simonwillison.net/2026/Feb/15/gwtar/
moritzwarhier|16 days ago
But could be very interesting for use cases where the main logic lives on the server and people try to manually implement some download- and/or lazy-loading logic.
Still probably bad unless you're explicitly working on init and redirect scripts.
Lerc|15 days ago
I made my own bundler skill that lets me publish artifacts https://claude.ai/public/artifacts/a49d53b6-93ee-4891-b5f1-9... that can be decomposed back into the files, but it is just a compressed base64 chunk at the end.
I guess the next question will be if it does work in environments that let you share a single file, will they disable this ability once they find out people are using it.
8n4vidtmkvmk|16 days ago
Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.
BobbyTables2|14 days ago
tym0|16 days ago
NoMoreNicksLeft|15 days ago
Hell, html is probably what word processor apps should be saving everything as. You can get pixel-level placement of any element if you want that.
avaer|16 days ago
But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.
[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...
deevus|15 days ago
qingcharles|14 days ago
vessenes|15 days ago
nunobrito|15 days ago
As final wish list, would be great to have multiple versions/crawls of the same URL with deduplication of static assets (images, fonts) but this is likely stretching too much for this format.
gwern|14 days ago
Multiple versions or multiple pages (maybe they can be the same thing?) would be nice but also unclear how to make that. An iframe wrapper?
I considered and rejected deduplication and compression. Those can be done by the filesystem/server transparent to the format. (If there's an image file duplicated across multiple pages, then it should be trivial for any filesystem or server to detect or compress those away.)
calebm|16 days ago
zetanor|16 days ago
simonw|16 days ago
obscurette|16 days ago
gildas|15 days ago
[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG
gwern|15 days ago
westurner|16 days ago
Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?
WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools
"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...
gwern|16 days ago
As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.
There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.
This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)
> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?
No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?
pseudosavant|15 days ago
Similar to the window.stop() approach, requests would truncate the main HTML file while the rest of that request would be the assets blob that the service worker would then serve up.
The service worker file could be a dataURI to keep this in one file.
unknown|16 days ago
[deleted]
mr_mitm|16 days ago
Works locally, but it does need to decompress everything first thing.
gwern|16 days ago
How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).
overgard|15 days ago
skybrian|15 days ago
Of course, since it's on an HTTP server, it could easily handle doing multiple requests of different files, but sometimes that's inconvenient to manage on the server and a single file would be easier.
Maybe this is downstream of Gwern choosing to use MediaWiki for his website?
renewiltord|16 days ago
isr|15 days ago
- an executable header
- which then fuse mounts an embedded read-only heavily compressed filesystem
- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)
- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"
- and doesn't require any additional, custom infrastructure to run/serve
Neat!
unknown|16 days ago
[deleted]
karel-3d|16 days ago
https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...
I will try on Chrome tomorrow.
woodruffw|15 days ago
Retr0id|16 days ago
gwern|16 days ago
Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.
Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)
iainmerrick|15 days ago
malkia|15 days ago
disce-pati|15 days ago
O1111OOO|16 days ago
I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).
mikae1|16 days ago
If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.
gwern|15 days ago
TiredOfLife|16 days ago
spankalee|16 days ago
gwern|15 days ago
Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.
And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)
newzino|16 days ago
Tar is sequential. Each entry header sits right before its data. If the JSON manifest in the Gwtar preamble says an asset lives at byte offset N with size M, the browser fires one Range request and gets exactly those bytes.
The other problem is decompression. Zip entries are individually deflate-compressed, so you'd need a JS inflate library in the self-extracting header. Tar entries are raw bytes, so the header script just slices at known offsets. No decompression code keeps the preamble small.
bandie91|15 days ago
what if a web server on localhost happens to handle the request? why not request from a guaranteed unaccessable place like http://0.0.0.0/ or http://localhost:0/ (port zero)
gwern|14 days ago
unknown|15 days ago
[deleted]
tefkah|15 days ago
great job
wetpaws|16 days ago
[deleted]
nullsanity|16 days ago
I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.
3rodents|16 days ago
isr|15 days ago
Its almost as if someone charged you $$ for the privilege of reading it, and you now feel scammed, or something?
Perhaps you can request a refund. Would that help?
fluidcruft|16 days ago