top | item 44589709

(no title)

pi_22by7 | 7 months ago

So it did the same work that a sitemap does? Interesting.

Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.

discuss

JimDabell|7 months ago

> I didn’t realize its original purpose was to manage duplicate content penalties though.

That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

— https://www.robotstxt.org/orig.html

Quarrel|7 months ago

> It was mostly about stopping crawlers from unnecessarily consuming server resources.

Very much so.

Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).

SEO considerations came way way later.

They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.