Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.
Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.
Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.
My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.
With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.
If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.
(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)
Accidentally deleting someone's entire website because they don't understand the difference between GET and POST requests is virtually a right of passage when writing a web crawler.
It's an easy fix if Google cared. Have an online tool that validates if the robots.txt is correct, and send out an announcement that files that don't meet spec will be penalized in terms of SEO.
I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.
I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.
Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.
If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.
In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).
> but the SEO people seem to trust these shady sites more.
It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".
The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.
For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."
And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.
Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.
It's a good example of missing standards in the wild.
> The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.
When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".
From Google's perspective it's probably too much work. I would assume this was a part of the cralwer code and extracted over time into a library, while part of the monorepo, so changesets probably didn't only touch this code, but also other parts and this code probably depended on internal libraries (now it depends on Google's public abseil library) publishing all that needs lots of review (also considering names and other personal information in commit logs, TODO comments and their like)
// A user-agent line is expected to contain only [a-zA-Z_-] characters and must
// not be empty. See REP I-D section "The user-agent line".
// https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
So you may need to adjust your bot’s UA for proper matching.
(Disclosure, I work at Google, though not on anything related to this.)
The strictness is in what may be listed in the robots txt, not the User-Agent header as sent by bots. the example given in the linked draft standard[0] makes this abundantly clear that it's on the bot to understand how to interpret the corresponding lines of robots.txt.
Of course, in practice robots.txt tend to look less like [1] and more like [2].
I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".
> how should they deal with robots.txt files that are hundreds of megabytes large?
What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - https://github.com/robots.txt - which is only about 30 kilobytes.
They enumerate every page on the site sometimes specifically for different crawlers.
Or they have a ton of auto generated pages they don’t want crawled and call them out individually because they don’t realize robots.txt supports globing.
I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.
can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...
There is a difference between robots.txt blocking a page and noindexing a page.
Blocking in robots.txt will stop Googlebot downloading that page and looking at the contents, but the page may still make it into the index on the basis of links to that page making it seem relevant (it will appear in the search results without a description snippet and will include a note about why).
To have a page not appear in the index you need to use a 'noindex' directive [1] either in the file itself or in the HTTP headers. However, if the file is blocked in robots.txt then note Google cannot read that noindex directive.
Also, in the StackOverflow response you linked to that the user agent is listed just as 'Google', but it should be 'Googlebot' as per the 'User agent token (product token)' table column listed in [2].
That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.
I expected the same (complex project structure, too many files, difficult to read, etc), but I love everything about this library. Easy to read, concise code, in two simple files. Very well tested, both by automated tests and the real world. Sticks to the Unix philosophy: does one thing and does it well.
Can you imagine how many billions of time this code has been executed? I love software like this.
Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.
I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.
What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)
This is the sort of code you write a binding to and call it a day, since the entire point is to absolutely precisely match the behavior of this code, which is basically a specification-by-code. You can never be sure a re-implementation would be absolutely precisely the same in behavior, so it's not worth doing.
[+] [-] randomstring|6 years ago|reply
Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.
Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.
My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.
With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.
If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.
(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)
Here's the Perl for the Blekko robots.txt parser. https://github.com/randomstring/ParseRobotsTXT
[+] [-] saalweachter|6 years ago|reply
[+] [-] asdfman123|6 years ago|reply
[+] [-] vinay_ys|6 years ago|reply
[+] [-] jxcl|6 years ago|reply
I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.
[0] https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...
[1]https://www.stonetemple.com/does-google-respect-robots-txt-n...
[2] https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...
[+] [-] jxcl|6 years ago|reply
:D
[+] [-] bhartzer|6 years ago|reply
If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.
In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).
[+] [-] Youden|6 years ago|reply
It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".
[+] [-] inh|6 years ago|reply
Google has now clarified that they're removing the code behind the undocumented items, with noindex called out explicitly.
https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...
It wasn't officially supported / the recommended way - but it worked (in many cases.)
[+] [-] adrianmonk|6 years ago|reply
For those who (like me) don't know a lot about this, which side of the argument is bullshit? Have you just been proved right or wrong?
[+] [-] seanlinmt|6 years ago|reply
[+] [-] wybiral|6 years ago|reply
For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."
And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.
Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.
It's a good example of missing standards in the wild.
[0] https://www.robotstxt.org/robotstxt.html
[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
[+] [-] pbowyer|6 years ago|reply
That is changing, and was announced today: https://news.ycombinator.com/item?id=20326067
[+] [-] C1sc0cat|6 years ago|reply
Disallow
The D got mangled and that disallow directive got ignored.
[+] [-] simonw|6 years ago|reply
When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".
[+] [-] lucasmullens|6 years ago|reply
[+] [-] circular_logic|6 years ago|reply
This may[0] be because it is exported from there monorepo
-[0] https://news.ycombinator.com/item?id=14811937
[+] [-] douglasfshearer|6 years ago|reply
Whilst I am sure there are good reasons for the omission, it would have been interesting to see the entirety of the commit history for this library.
[+] [-] johannes1234321|6 years ago|reply
From Google's perspective it's probably too much work. I would assume this was a part of the cralwer code and extracted over time into a library, while part of the monorepo, so changesets probably didn't only touch this code, but also other parts and this code probably depended on internal libraries (now it depends on Google's public abseil library) publishing all that needs lots of review (also considering names and other personal information in commit logs, TODO comments and their like)
[+] [-] rasmi|6 years ago|reply
[+] [-] glenneroo|6 years ago|reply
[+] [-] jchw|6 years ago|reply
https://github.com/google/robotstxt/blob/master/robots_test....
So you may need to adjust your bot’s UA for proper matching.(Disclosure, I work at Google, though not on anything related to this.)
[+] [-] steventhedev|6 years ago|reply
Of course, in practice robots.txt tend to look less like [1] and more like [2].
[0]: https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
[1]: https://github.com/robots.txt
[2]: https://wpengine.com/robots.txt
[+] [-] Causality1|6 years ago|reply
[+] [-] hitpointdrew|6 years ago|reply
[+] [-] rococode|6 years ago|reply
What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - https://github.com/robots.txt - which is only about 30 kilobytes.
[+] [-] jedberg|6 years ago|reply
Or they have a ton of auto generated pages they don’t want crawled and call them out individually because they don’t realize robots.txt supports globing.
[+] [-] AceJohnny2|6 years ago|reply
http://www.antipope.org/charlie/blog-static/2009/06/how_i_go...
(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful notoriety from a less innocent program)
[1] and former code monkey from the dot-com era
[+] [-] orf|6 years ago|reply
1. https://github.com/google/robotstxt/blob/master/robots.cc#L6...
[+] [-] badrequest|6 years ago|reply
[+] [-] noir-york|6 years ago|reply
But I'm sure someone out there will fuzz it...
[+] [-] dev_dull|6 years ago|reply
[+] [-] pedrorijo91|6 years ago|reply
[+] [-] H8crilA|6 years ago|reply
[+] [-] jhabdas|6 years ago|reply
[+] [-] TomAnthony|6 years ago|reply
Blocking in robots.txt will stop Googlebot downloading that page and looking at the contents, but the page may still make it into the index on the basis of links to that page making it seem relevant (it will appear in the search results without a description snippet and will include a note about why).
To have a page not appear in the index you need to use a 'noindex' directive [1] either in the file itself or in the HTTP headers. However, if the file is blocked in robots.txt then note Google cannot read that noindex directive.
Also, in the StackOverflow response you linked to that the user agent is listed just as 'Google', but it should be 'Googlebot' as per the 'User agent token (product token)' table column listed in [2].
Good luck! :)
[1] https://support.google.com/webmasters/answer/93710?hl=en [2] https://support.google.com/webmasters/answer/1061943
[+] [-] nn3|6 years ago|reply
[+] [-] leftnode|6 years ago|reply
Can you imagine how many billions of time this code has been executed? I love software like this.
[+] [-] cmrdporcupine|6 years ago|reply
Honestly, excessive cleverness does not generally pass code review @ Google. Especially something that would get this many eyes.
[+] [-] brainfog|6 years ago|reply
[+] [-] jaredcwhite|6 years ago|reply
[+] [-] goddtriffin|6 years ago|reply
[+] [-] danielovichdk|6 years ago|reply
[+] [-] unchic|6 years ago|reply
What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)
[+] [-] Tepix|6 years ago|reply
There's already https://github.com/temoto/robotstxt
[+] [-] jerf|6 years ago|reply