(no title)
josefcullhed | 4 years ago
I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.
Seirdy|4 years ago
Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?
marginalia_nu|4 years ago
pmarreck|4 years ago
This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?