(no title)
ikreymer | 3 years ago
Unless you're crawling really text heavy content, most of the WARC data is binary content that doesn't really need to be in a db. However, sqlite or database as replacement for CDX is an appealing option, where WARC files can remain static data at rest and derived data (offests, full-text search, can be put into a db.
We are experimenting with a new format, WACZ, which bundles WARC files into a ZIP, while adding CDXJ and exploring sqlite as an option for full-text search. I agree that it's better to build on solid, existing formats that can be validated, especially when large amounts of data are concerned!
No comments yet.