top | item 42570010

Crawling More Politely Than Big Tech

43 points| pkghost | 1 year ago |cameronboehmer.com

17 comments

order

Mistletoe|1 year ago

Was all of our posting on the net on forums, HN, Reddit, digg, Slashdot, etc. just to train the AI of the future? I think about this a lot. AI has that "annoying forum poster" tone to everything and now I can't unsee it when I (rarely) use it. Maybe I'm just post-internet. I've been thinking about that a lot also. I'm tired of 99.75% of the internet.

userbinator|1 year ago

AI has that "annoying forum poster" tone to everything

Does it? What I've seen has been more like "annoying customer service representative" instead.

AlienRobot|1 year ago

It would be pretty sad if AI stole all the content from a forum killing it with the added load only to regurgitate it.

danpalmer|1 year ago

I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.

There are a few things that stand out, like:

> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.

Could this end up DOS'ing or being "impolite" just in robots.txt requests?

All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.

When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.

Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.

It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.

dartos|1 year ago

It’s probably a huge liability to not have very advanced and compliant crawlers.

Accidentally ddosing several businesses seems like an expensive lawsuit.

Aloisius|1 year ago

I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I'd also argue that requests that browsers don't slow down for, like following redirects to the same domain or links with the prefetch attribute, don't really necessitate a delay at all.

If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and/or you're getting cache control headers indicating you're mostly getting cached pages, I see no reason why one shouldn't speed up - at least for domains with millions of URLs.

I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since/If-None-Match. Besides, you're going to fetch the page anyway if it changed, so why issue two requests when you could do one?

Having a single crawler per process/thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).

I'd say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.

registeredcorn|1 year ago

I'm just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?

For example, I'd like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don't like the idea of "slamming" their site because I'm trying to grab 400 books at once.

pkghost|1 year ago

A few implementation details from building a hobby crawler

ndriscoll|1 year ago

If you have cache headers, why use HEAD? Are servers more likely to handle HEAD correctly than including them on the GET?

inetknght|1 year ago

> Are servers more likely to handle HEAD correctly than

In my experience there are a lot of servers that don't handle HEAD at all, let alone correctly.

thiago_fm|1 year ago

I doubt big tech cares enough if they are doing this to a website. They just want to fiercely battle the competition and make profits

dsymonds|1 year ago

If the author reads this, you have a misspelling of "diaspora" in the first sentence.

giantrobot|1 year ago

Also padding: 1em would go a long ways to making the page readable.

Karupan|1 year ago

This is timely as I’m just building out a crawler in scrapy. Thanks!