Google’s robots.txt parser is now open source

[+] randomstring|6 years ago|reply

Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.

Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.

Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.

My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.

With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.

If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.

(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)

Here's the Perl for the Blekko robots.txt parser. https://github.com/randomstring/ParseRobotsTXT

[+] saalweachter|6 years ago|reply

Accidentally deleting someone's entire website because they don't understand the difference between GET and POST requests is virtually a right of passage when writing a web crawler.

[+] asdfman123|6 years ago|reply

It's an easy fix if Google cared. Have an online tool that validates if the robots.txt is correct, and send out an announcement that files that don't meet spec will be penalized in terms of SEO.

[+] vinay_ys|6 years ago|reply

hmm, you mean "\n" vs "\r\n"? ;-)

[+] jxcl|6 years ago|reply

I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.

I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.

[0] https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...

[1]https://www.stonetemple.com/does-google-respect-robots-txt-n...

[2] https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...

[+] jxcl|6 years ago|reply

Google is also really generous with how they will let you spell "disallow": https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...

:D

[+] bhartzer|6 years ago|reply

Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.

If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.

In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).

[+] Youden|6 years ago|reply

> but the SEO people seem to trust these shady sites more.

It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".

[+] inh|6 years ago|reply

Except... they were correct.

Google has now clarified that they're removing the code behind the undocumented items, with noindex called out explicitly.

https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

It wasn't officially supported / the recommended way - but it worked (in many cases.)

[+] adrianmonk|6 years ago|reply

> evidence to me that this Noindex thing is bullshit

For those who (like me) don't know a lot about this, which side of the argument is bullshit? Have you just been proved right or wrong?

[+] seanlinmt|6 years ago|reply

While that's great, there should be instances where crawlers should ignore noindex directives. For example, all .gov sites.

[+] wybiral|6 years ago|reply

The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.

For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."

And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.

Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.

It's a good example of missing standards in the wild.

[0] https://www.robotstxt.org/robotstxt.html

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

[+] pbowyer|6 years ago|reply

> The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.

That is changing, and was announced today: https://news.ycombinator.com/item?id=20326067

[+] C1sc0cat|6 years ago|reply

Google used to have (must check the code) have problems with robots.txt files that had a BOM - If the First line was

Disallow

The D got mangled and that disallow directive got ignored.

[+] simonw|6 years ago|reply

I absolutely understand why they did this, but I have to say I was disappointed to see only 7 commits at https://github.com/google/robotstxt/commits/master dating back to June 25th.

When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".

[+] lucasmullens|6 years ago|reply

I think it's pretty rare for a company to make internal commits public once making something open source.

[+] circular_logic|6 years ago|reply

The first commit has the extra comment `PiperOrigin-RevId: 254932939`

This may[0] be because it is exported from there monorepo

-[0] https://news.ycombinator.com/item?id=14811937

[+] douglasfshearer|6 years ago|reply

> This library has been around for 20 years and it contains pieces of code that were written in the 90's.

Whilst I am sure there are good reasons for the omission, it would have been interesting to see the entirety of the commit history for this library.

[+] johannes1234321|6 years ago|reply

From a archeological perspective very much.

From Google's perspective it's probably too much work. I would assume this was a part of the cralwer code and extracted over time into a library, while part of the monorepo, so changesets probably didn't only touch this code, but also other parts and this code probably depended on internal libraries (now it depends on Google's public abseil library) publishing all that needs lots of review (also considering names and other personal information in commit logs, TODO comments and their like)

[+] rasmi|6 years ago|reply

Code here: https://github.com/google/robotstxt

[+] glenneroo|6 years ago|reply

It's also the 5th link "open sourced" in the article.

[+] jchw|6 years ago|reply

Note that this is quite strict on what characters may be contained in a bots user agent. This is due to strictness in the REP standard.

https://github.com/google/robotstxt/blob/master/robots_test....

    // A user-agent line is expected to contain only [a-zA-Z_-] characters and must
    // not be empty. See REP I-D section "The user-agent line".
    // https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1

So you may need to adjust your bot’s UA for proper matching.

(Disclosure, I work at Google, though not on anything related to this.)

[+] steventhedev|6 years ago|reply

The strictness is in what may be listed in the robots txt, not the User-Agent header as sent by bots. the example given in the linked draft standard[0] makes this abundantly clear that it's on the bot to understand how to interpret the corresponding lines of robots.txt.

Of course, in practice robots.txt tend to look less like [1] and more like [2].

[0]: https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1

[1]: https://github.com/robots.txt

[2]: https://wpengine.com/robots.txt

[+] Causality1|6 years ago|reply

I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".

[+] hitpointdrew|6 years ago|reply

I am hoping not much, because that is beyond a horrible "security practice". I have seen some lazy shit out there, but this would take the cake.

[+] rococode|6 years ago|reply

> how should they deal with robots.txt files that are hundreds of megabytes large?

What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - https://github.com/robots.txt - which is only about 30 kilobytes.

[+] jedberg|6 years ago|reply

They enumerate every page on the site sometimes specifically for different crawlers.

Or they have a ton of auto generated pages they don’t want crawled and call them out individually because they don’t realize robots.txt supports globing.

[+] AceJohnny2|6 years ago|reply

fun & useless little bit of trivia: Sci-Fi author [1] Charles Stross (who hangs around here) is the cause of the first robots.txt being invented.

http://www.antipope.org/charlie/blog-static/2009/06/how_i_go...

(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful notoriety from a less innocent program)

[1] and former code monkey from the dot-com era

[+] orf|6 years ago|reply

I guess lots of people misspell ~disalow~ disallow[1]

1. https://github.com/google/robotstxt/blob/master/robots.cc#L6...

[+] badrequest|6 years ago|reply

including yourself! Must be easy to do. :) https://www.merriam-webster.com/dictionary/disallow

[+] noir-york|6 years ago|reply

I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.

But I'm sure someone out there will fuzz it...

[+] dev_dull|6 years ago|reply

I’d be surprised if google isn’t fuzzing it with their (also open sourced) fuzzing tool.

[+] pedrorijo91|6 years ago|reply

can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...

[+] H8crilA|6 years ago|reply

That's probably the hidden agenda.

[+] jhabdas|6 years ago|reply

Anyone else witnessed this behavior? https://stackoverflow.com/questions/4769140/robots-txt-user-...

[+] TomAnthony|6 years ago|reply

There is a difference between robots.txt blocking a page and noindexing a page.

Blocking in robots.txt will stop Googlebot downloading that page and looking at the contents, but the page may still make it into the index on the basis of links to that page making it seem relevant (it will appear in the search results without a description snippet and will include a note about why).

To have a page not appear in the index you need to use a 'noindex' directive [1] either in the file itself or in the HTTP headers. However, if the file is blocked in robots.txt then note Google cannot read that noindex directive.

Also, in the StackOverflow response you linked to that the user agent is listed just as 'Google', but it should be 'Googlebot' as per the 'User agent token (product token)' table column listed in [2].

Good luck! :)

[1] https://support.google.com/webmasters/answer/93710?hl=en [2] https://support.google.com/webmasters/answer/1061943

[+] nn3|6 years ago|reply

That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.

[+] leftnode|6 years ago|reply

I expected the same (complex project structure, too many files, difficult to read, etc), but I love everything about this library. Easy to read, concise code, in two simple files. Very well tested, both by automated tests and the real world. Sticks to the Unix philosophy: does one thing and does it well.

Can you imagine how many billions of time this code has been executed? I love software like this.

[+] cmrdporcupine|6 years ago|reply

Looks like standard Google C++ coding style to me.

Honestly, excessive cleverness does not generally pass code review @ Google. Especially something that would get this many eyes.

[+] brainfog|6 years ago|reply

Why would you expect that?

[+] jaredcwhite|6 years ago|reply

Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.

[+] goddtriffin|6 years ago|reply

Looking forward to the robots.txt linters created as wrappers around this (especially for VSCode).

[+] danielovichdk|6 years ago|reply

I find it really cool the code for this is so simple and clean.

[+] unchic|6 years ago|reply

I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.

What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)

[+] Tepix|6 years ago|reply

So, is it premature to expect a Go package by Google as well?

There's already https://github.com/temoto/robotstxt

[+] jerf|6 years ago|reply

This is the sort of code you write a binding to and call it a day, since the entire point is to absolutely precisely match the behavior of this code, which is basically a specification-by-code. You can never be sure a re-implementation would be absolutely precisely the same in behavior, so it's not worth doing.

194 comments