(no title)
netborn | 6 years ago
Common Crawl (non-profit): Stores regular, broad, monthly crawls as WARC files. Provides a separate index that can be used to look data up (no a fulltext index though). Used mostly in academia.
Mixnode (for-profit): Regularly crawls the web and lets users write SQL queries against the data. Not sure who the primary users are since it's in private beta.
There are some search engine APIs, but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing...
wongarsu|6 years ago
Not for existing search machine providers, but I think there is room for new players to do this large scale. Imagine an AWS service that high performance access to crawled data as well as a number of indexes and a fairly simple search engine using this data. That would commoditize one of Google's biggest advantages, and anyone could, at least in principle, run their own search engine from the data. Because the market for this is much wider than traditional search engines just providing the data and indices for a pay-as-you-go fee could still be very profitable.
alshtico|6 years ago