top | item 7732507

(no title)

valjavec | 11 years ago

How much was cost for all that? Or how much would it cost to process 1m pages like this. Do some entity extraction on each page?

discuss

order

Encosia|11 years ago

A few years ago, I wrote a little crawler to run through the top 200k sites on Alexa, search for script references, and log them to a database, to get a sense for what the real usage of Google's jQuery CDN was in the wild[0]. IIRC, that took less than a day to run on the consumer broadband I was using at the time.

[0]: http://encosia.com/6953-reasons-why-i-still-let-google-host-...

Smerity|11 years ago

For only one million web pages, the job would likely be quite cheap. The Common Crawl corpus is hundreds of millions of pages and, given the right setup, only takes $10 to $100 to process, especially for relatively light entity extraction. More expensive operations, such as parsing using NLP tools, will obviously be more expensive.