top | item 44591362

(no title)

The bots are called "crawlers" and "spiders", which to me evokes the image of tiny little things moving rapidly and mechanically from one place to another, leaving no niche unexplored. Spiders exploring a vast web.

Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.

It'd be like telling someone "I spent part of the last year travelling." and when they ask you where you went, you tell them you commuted to-and-fro your workplace five times a week. That's technically travelling, although the other person would naturally expect you to talk about a vacation or a work trip or something to that effect.

discuss

JimDabell|7 months ago

> Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.

It’s definitely not crawling as robots.txt defines the term. :

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

— https://www.robotstxt.org/orig.html

You will see that reflected in lots of software that respects robots.txt. For instance, if you fetch a URL with wget, then it won’t look at robots.txt. But if you mirror a site with wget, then it will fetch the initial URL, then it will find the links in that page, then before fetching subsequent pages it will fetch and check robots.txt.