top | item 44589372

(no title)

Falkon1313 | 7 months ago

This is kinda amusing.

robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."

It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.

Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.

That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."

discuss

MiddleMan5|7 months ago

It should be noted here that the Evil Bit proposal was an April Fools RFC https://datatracker.ietf.org/doc/html/rfc3514

Y_Y|7 months ago

While we're at it, it should be noted that Do Not Track was not, apparently, a joke.

It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.

2OEH8eoCRo0|7 months ago

I like the 128 bit strength indicator for how "evil" something is.

pi_22by7|7 months ago

So it did the same work that a sitemap does? Interesting.

Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.

JimDabell|7 months ago

> I didn’t realize its original purpose was to manage duplicate content penalties though.

That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

— https://www.robotstxt.org/orig.html

tbrownaw|7 months ago

> And you wouldn't expect the bad bots to play nicely just because you asked them.

Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.

Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.

gbalduzzi|7 months ago

I think it depends on the definition of bad.

I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.

Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions

nullc|7 months ago

> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.

The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P

pjmlp|7 months ago

Some people just believe that because someone says so, everyone will nicely obey and follow the rules, don't know maybe it is a cultural thing.

vintagedave|7 months ago

Or a positive belief in human nature.

I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.

But I kinda like having this attitude and expectation. Makes me feel healthier.

0xEF|7 months ago

It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.

There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).

PaulHoule|7 months ago

Robots.txt was created long before Google and before people were thinking about SEO:

https://en.wikipedia.org/wiki/Robots.txt

The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.

abirch|7 months ago

With the advent of AI and the notion of actually going to a website as being quaint: each website should have a humans.txt such as https://www.netflix.com/humans.txt or https://www.google.com/humans.txt

LorenPechtel|7 months ago

Yup. Robots.txt was a don't-swamp-me thing.

EPendragon|7 months ago

It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.

franga2000|7 months ago

I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.

The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.

That was a theory at least...

EPendragon|7 months ago

Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.