top | item 24338758

(no title)

jd20 | 5 years ago

Some fun facts:

- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

discuss

order

ospider|5 years ago

> It was then modified to do it's own DNS resolution and caching, fond memories...

Unlike other languages, Go bypasses system's DNS cache, and goes directly to the DNS server, which is a root cause of many problems.

Spivak|5 years ago

This is true but a little misleading. On Windows Go uses GetAddrInfo and DNSQuery which does the right thing. But on Linux there are two options: netgo and netcgo -- a pure Go implementation that doesn't know about NSS, and a C wrapper that uses NSS.

Since netgo is faster, by default Go will try its best to determine if it must use netcgo by parsing /etc/nsswitch.conf, looking at the tld, reading env variables, etc..

If you're building the code you can force it to use netcgo by adding the netcgo build tag.

If you're an administrator the least intrusive method I think would be setting LOCALDOMAIN to something or '' if you can't think of anything which will force it to use NSS.

tylfin|5 years ago

Yeah, I've never had to implement my own DNS cache for a language before...

If you're on a system with cgo available, you can use `GODEBUG=netdns=cgo` to avoid making direct DNS requests.

This is the default on MacOS, so if it was running on four Mac Pro's I wouldn't expect it to be the root cause.

oasisbob|5 years ago

And Java.

As I understand it, Go and Java are both trying to avoid FFI and calling out to system libs for name resolution.

I tend to always offer a local caching resolver available over a socket.

ksec|5 years ago

>- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

Considering the timeline, are those Trash Can Mac Pro? Or was it the old Cheese Grater ?

jd20|5 years ago

Trash cans :)

nothis|5 years ago

>Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?

raxxorrax|5 years ago

Computers are very fast. We just tend to not notice because today's software is obese.

thdrdt|5 years ago

Doesn't it depend on a lot of things? For example you can only do head requests to see if a page changed since a given timestamp. If not then there is no need to process it.

throwaway4good|5 years ago

I am particular curious about data storage.

Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.

jd20|5 years ago

Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.

ricardo81|5 years ago

Interesting stuff. I've used libcurl to crawl at that kind of pace, is the parsing/indexing separate from that count per day? Also interested in how you dealt with DNS and/or rate limiting

edoceo|5 years ago

I've done similar at smaller scale. Instead of messing with underlying DNS or other caching in our code we just dropped a tuned dnsmasq as the resolver in front. The crawler had a separate worker to fill hosts so it was mostly hot when the crawler was asking.

pronoiac|5 years ago

Roughly estimating, each Mac Pro could crawl around 3k pages per second.

polote|5 years ago

Which is not possible

NiekvdMaas|5 years ago

Can you share some more details about the current state? Is it still written in Go?

jd20|5 years ago

No idea, it's been years since I last worked on it. It was also not the only Go service written at Apple (90% of cloud services at Apple were written in Java), though it may have been the first one used in production.

doh|5 years ago

Can you talk more about the specific? What kind of parsers did you guys use? How about storage? How often did you update pages?

jd20|5 years ago

You should check out Manning's "Introduction to Information Retrieval", it has far more detail about web crawler architecture than I can write in a post, and served as a blueprint for much of Applebot's early design decisions.

dx034|5 years ago

With 1b pages per day I guess you needed 1gbit/s connections on each of those machines? Especially if they also wrote back to centralized storage.

I guess there are not many places where you can easily get 4GB/s sustained throughput from a single office (especially with proxy servers and firewalls in front of it). Is that standard at Apple or did the infrastructure team get involved to provide that kind of bandwidth?

thatwasunusual|5 years ago

Do you have a timeline of how AppleBot has evolved?

Silasdev|5 years ago

Was that including the ability to render js driven asynchronously loaded pages, including subsequent XHR requests? If so, it's beyond impressive.

matthewhartmans|5 years ago

Thanks for sharing mate. That is amazing insights!

netsharc|5 years ago

Sorry to be pedantic, but your misuse of apostrophes in an otherwise perfect text annoys me.

All three uses of "it's" should be "its".

And I would just write "Mac Pros" instead of Mac Pro's".