How Algolia Reduces Latency

[+] simonw|9 years ago|reply

This is really interesting. Impressive to see a startup with so much bare metal hardware (800+ servers in 47+ data centers), due to their need to power autocomplete-style search and hence offer the lowest latency possible.

One detail I particularly liked (buried deep in the article) was this one: "Once a machine is detected to be down, we push a DNS change to take it out of the cluster. The upper bound of propagation for that change is 2 minutes (DNS TTL). During this time, API clients implement their internal retry strategy to connect to healthy machines in the cluster, so there is no customer impact."

Offering client API libraries that have a retry strategy baked-in and relying on that for part of your high availability strategy is very neat.

[+] dzello|9 years ago|reply

Thanks for the comment simonw. It's not directly mentioned in the article but we're also using the API clients to implement a DNS fallback strategy. If the hosts are unreachable through their primary hostnames (.algolia.net), the clients try alternate names (.algolianet.com) that are hosted by a different provider.

[+] danielvf|9 years ago|reply

"When a query takes more than 1 second we send an alert into Slack."

One day, during a problem, every single query will take over a second, and this will be an exciting Slack channel to be in.

[+] andrewf|9 years ago|reply

Several people are typing.

[+] geeio|9 years ago|reply

I'd hope they have some sort of rate limiting built in

[+] faizshah|9 years ago|reply

Would be interesting to see the story of how you guys started from square one and built out your bare metal infrastructure.

Really love the docsearch project btw, great UX.

EDIT: Just found this blog post right after posting this comment https://stories.algolia.com/algolia-s-fury-road-to-a-worldwi...

[+] dzello|9 years ago|reply

Thanks! Glad you're liking DocSearch. That's a good post about the journey. There are a few more listed in our awesome-algolia repo: https://github.com/algolia/awesome-algolia#blog-posts.

[+] sergiotapia|9 years ago|reply

Algolia is a fantastic product and this article helped me peek behind the curtain as to what makes it tick.

[+] mythrwy|9 years ago|reply

Agreed. Very interesting to see behind the scenes.

Implanted search for a client using Algolia last month and was completely blown away. The speed queries returned at were amazing. (Although it wasn't my idea to use Algolia I'll definitely be looking for opportunities to use it again).

[+] lazyjones|9 years ago|reply

Isn't a low DNS TTL problematic? DNS lookups are often slow on clients. Wouldn't something like Wackamole (IP address takeover on local networks with microsecond-latency on failure; dead project now though apparently: https://github.com/postwait/wackamole) help avoid this? We built our load balancers this way at my previous company...

[+] jlemoine|9 years ago|reply

You're right that low DNS TTL is not perfect (we saw few providers that override the TTL to reduce the number of DNS queries going out of their network, this is a big hack but cause some trouble). This problem is addressed by our API clients that have different DNS endpoint to reach the 3 machines of a cluster.

We cannot use any local network IP or load-balancer as we distribute a cluster on several providers with different autonomous systems. This is how we are able to offer SLA of up to 99.999% with a big refund strategy: https://blog.algolia.com/for-slas-theres-no-such-thing-as-10...

[+] prashnts|9 years ago|reply

> The S3 bucket sits behind CloudFlare to make downloading the binaries fast from anywhere.

Why not CloudFront? Moving binaries through CF CDN might not do for you. It's also seen that they don't really like you moving large data through their service [0].

[0] https://news.ycombinator.com/item?id=12825719

[+] threeseed|9 years ago|reply

CloudFlare has 105 PoPs. CloudFront has 45 PoPs.

That's a big difference.

[+] jlemoine|9 years ago|reply

We were using CloudFront at the beginning but we had a lot of performance problems to deploy our binaries worldwide (especially in Africa and Russia). We have seen a big performance improvement by switching to Cloudflare that have a POP in all region where we deploy binaries

[+] gumby|9 years ago|reply

This is tangental but: Now the term "bare metal" has been co-opted to mean "uses an O/S but no virtualization" what are those of us who run our software on bare metal supposed to call what we do?

[+] kevinsimper|9 years ago|reply

Really interesting that they run their application as a nginx module, that really goes to "keep it simple" and that you may not always need a cluster (I know they have a cluster, but that handles the clients)

[+] hamandcheese|9 years ago|reply

> Before deployment begins, another process has encrypted our binaries and uploaded them to an S3 bucket.

Is this to ensure data integrity, or some other purpose?

[+] dzello|9 years ago|reply

The encryption is for security. The upload is so we can front the S3 bucket with a CDN for fast download of the binaries from any region.

[+] alainmf|9 years ago|reply

Great article!

21 comments