top | item 31827646

(no title)

ikreymer | 3 years ago

A bit late to this thread, but I think WARC is a reasonable format for raw HTTP traffic. We should definitely have better tools to ensure WARC files produced are valid, and that's one of the things we build at Webrecorder.

Unless you're crawling really text heavy content, most of the WARC data is binary content that doesn't really need to be in a db. However, sqlite or database as replacement for CDX is an appealing option, where WARC files can remain static data at rest and derived data (offests, full-text search, can be put into a db.

We are experimenting with a new format, WACZ, which bundles WARC files into a ZIP, while adding CDXJ and exploring sqlite as an option for full-text search. I agree that it's better to build on solid, existing formats that can be validated, especially when large amounts of data are concerned!

discuss

order

No comments yet.