HN links to over 6 million urls in stories and comments. Many domains have expired or content is no longer available. Internet archive has much of the content but throttles requests. What's the fastest way to get the historical content?
I'm not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker.
[+] [-] arinlen|3 years ago|reply
https://github.com/HackerNews/API
I'm not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker.
[+] [-] krapp|3 years ago|reply
[0]https://hn.algolia.com/api
[+] [-] agencies|3 years ago|reply
As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot.
[+] [-] jpcapdevila|3 years ago|reply
There's a dataset containing everything: bigquery-public-data.hacker_news.full
You can write SQL and is super fast. Sample:
SELECT * FROM bigquery-public-data.hacker_news.full LIMIT 1
[+] [-] python273|3 years ago|reply