top | item 19714565

(no title)

netborn | 6 years ago

There are two entities trying to pull this off:

Common Crawl (non-profit): Stores regular, broad, monthly crawls as WARC files. Provides a separate index that can be used to look data up (no a fulltext index though). Used mostly in academia.

Mixnode (for-profit): Regularly crawls the web and lets users write SQL queries against the data. Not sure who the primary users are since it's in private beta.

There are some search engine APIs, but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing...

discuss

wongarsu|6 years ago

> but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing

Not for existing search machine providers, but I think there is room for new players to do this large scale. Imagine an AWS service that high performance access to crawled data as well as a number of indexes and a fairly simple search engine using this data. That would commoditize one of Google's biggest advantages, and anyone could, at least in principle, run their own search engine from the data. Because the market for this is much wider than traditional search engines just providing the data and indices for a pay-as-you-go fee could still be very profitable.

alshtico|6 years ago

I think CC used to provide full-text indices. Not sure though and can't find any posts on it.