A few years ago, I wrote a little crawler to run through the top 200k sites on Alexa, search for script references, and log them to a database, to get a sense for what the real usage of Google's jQuery CDN was in the wild[0]. IIRC, that took less than a day to run on the consumer broadband I was using at the time.
For only one million web pages, the job would likely be quite cheap. The Common Crawl corpus is hundreds of millions of pages and, given the right setup, only takes $10 to $100 to process, especially for relatively light entity extraction. More expensive operations, such as parsing using NLP tools, will obviously be more expensive.
Encosia|11 years ago
[0]: http://encosia.com/6953-reasons-why-i-still-let-google-host-...
Smerity|11 years ago