At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?
gwern|14 days ago
And if you choose to require separate files and break single-file, then you have many options.
> surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx"
I'm not familiar with warcviewer.js and Googling isn't showing it. Are you thinking of https://github.com/webrecorder/wabac.js ?
zetanor|14 days ago
To expand what I have in mind, it'd be a script like Gwtar, except it loads WARCs through URLs to CDX files. Alternatively, it might also load WARC files fully to memory, where an index could be constructed on the fly. In the latter case, that would allow the same viewer to be used with or without a web server. Though, I can imagine that loading archives without a web server was probably out-of-scope for Gwtar, otherwise something could have been figured out (e.g., putting the tar in a <textarea>'s RCDATA; do browsers support "binary" data in there correctly?).
While the WARC specs are a mess (sometimes quite ambiguous), I've never had much trouble reading or writing them. As for why WARC, having the option to preserve request/response metadata, as well as having interoperability with anything else in the WARC ecosystem, would be nice. Also, a separate viewer would naturally be updateable without changing the archive files themselves.