top | item 8385793

Why Loggly Chose AWS Route 53 Over Elastic Load Balancing

77 points| jtblin | 11 years ago |loggly.com | reply

48 comments

order
[+] former_loggly|11 years ago|reply
Former Loggly employee here. Loggly is at CTO #3 or 4 in about 3 years. The CEO, marketing guy with black turtle neck, "runs" engineering. It is NOT an engineering company and they are on their way to outsourcing all development to India.

Formally they had all of their EC2 instances configured to run without swap and didn't use EBS such that instances would crash 1-3 times a day and lose all data which would require 1-2 day customer restores of data.

Additionally, this Java shop oversubscribed threads on every Solr box which made them restart each Solr instance every hour. To think any revolutionary engineering ideas come from an former Apple marketing wannabee who puts outsourced Indian engineering in place as yes men is a huge stretch.

Let's be honest, Loggly is in huge trouble and can't hire quality engineering talent and as a result is trying to remarket themselves as an engineering driven company as they outsource to India.

Key question isn't..do you use DNS or Elastic Load Balance...it is...what is your VOLUNTARY RATE OF ATTRITION? Hint, really bad!

[+] hijinks|11 years ago|reply
I interviewed there for a devops/sys admin role and ran far away after that process when I learned about what is going on and problems the ops group has to solve.

Then reading this fluff piece made me glad I never even thought about working there after that phone interview. Whoever claims a DNS round robin is a good way to handle fail over doesn't really know what they are talking about. I have dug into the how something like rsyslog handles a dns request. My guess is it just passes it off to the OS.

But what I got from this is loggly is ok with losing customer data

[+] boulos|11 years ago|reply
I'm disappointed this is the top rated comment. Your repeated bigotry towards "outsourced Indian engineering" combined with using a throwaway detract from what appears to be potentially useful information (Loggly CTO changes, running without EBS and without swap, etc.). Can someone else corroborate the factual bits without the xenophobia?
[+] fubu|11 years ago|reply
Serious question: Are people upvoting this to poke fun like some kind of daily wtf?

A logging platform that lists 1 of their 2 major requirements as "To not drop any data, ever" is using round robin DNS for fault tolerance? I can't see too many people on HN upvoting this for being insightful or impressive.

Edit: I just can't help myself. How are you going to send syslog when any server fails and not "drop any data, ever"? Even over TCP the in transit messages are lost when the connection is broken. So like, their business is basically syslog and they don't know that?

[+] latch|11 years ago|reply
I upvoted it in the hopes that someone would provide the missing piece. Like "oh, we forgot to mention that the DNS is pointing to our own haproxy servers that all have redundant power/network/whatever) or something.
[+] zimbatm|11 years ago|reply
> If there is an issue with a collector, Route 53 automatically takes it out of the service; our customers won’t see any impact.

Except when for example rsyslog caches DNS resolution forever. Or the log forwarded doesn't have a buffer and logs get lost.

[+] korzun|11 years ago|reply
Yeah I don't get their approach. There is no way this will cause 100% delivery if one server fails within that rotation.

That chances of failure go up dramatically if 2+ hosts behind round robin fail, etc.

Not to mention once hosts resolve this to an IP they will re-use the route. This approach is not balanced.

I don't want to be /that/ guy but if they can't scale with ELB they should invest in a dedicated load balancer infrastructure that can offload requests to their cloud instances.

This is a really bizarre post.

[+] skuhn|11 years ago|reply
Lots of other comments have torn this article apart (and justifiably so), but I still feel the need to pile on.

In their docs, Loggly only gives out one API endpoint: logs-01.loggly.com.

It is referenced as the endpoint for HTTP, HTTPS, syslog and syslog TLS. These seem to be the only methods available to send log data to them.

There is the obvious problem that a DNS record with a 60s TTL cannot possibly receive every single packet sent to it in the event of a server failure. Even if the returned IP address is an elastic IP, it takes a substantial amount of time to move to another instance in AWS.

I don't know why you would use the same service hostname for all of these endpoints. Separate names for each endpoint, even if they all pointed to the same pool of hosts, would at least give some flexibility in the future when they have enough traffic to get desperate about capacity. I would also think they might want to segregate native syslog from HTTP traffic, since I presume it uses different processes on the backend.

It's also curious that they chose to return only one A record. DNS RR is a poor substitute for real load balancing, but it's better than nothing. With multiple A records, there is at least a chance that some of their traffic will go to other servers -- rather than all of it potentially going to one as it is now.

While they made no claims about using Route 53 for its geo DNS capabilities, I still found it amusing that I was sent to a US East IP from California. Not that it's super critical that my log lines get delivered quickly, but it is ideal to shorten the path of an insecure and unreliable transport in order to improve durability. Although I would never ship syslog out to some host on the Internet, a host 16 hops away is even more ludicrous.

I think their article says a lot more about how poorly ELBs function when you exceed the low traffic threshold it is seemingly designed for than about how well Route 53 works (and it is a decent static DNS service). The inability to robustly direct incoming traffic is the achilles heel of AWS.

[+] mbell|11 years ago|reply
There is a rather large technical divide between 'no logs left behind' and relying on DNS lookup to provide that guarantee.
[+] lfuller|11 years ago|reply
I was thinking the exact same thing while reading this - it reads like a company that doesn't understand the unique challenges involved with distributed computing.

I'm actually in the middle of deciding between Loggly, Papertrail, and Logentries for centralized log management. I guess that cuts it down to two.

[+] philip1209|11 years ago|reply
This is primitive. It seems like they are on the verge of discovering BGP, which could be used to provide scalability, load balancing, and clean failover without DNS caching issues.
[+] mey|11 years ago|reply
Why would you allow your clients to transmit potentially sensitive data to you as clear text over the internet?
[+] korzun|11 years ago|reply
As far as I can tell they use Route 53 to identify targets.

The transmission type is still up to the client/receiver.

[+] Goopplesoft|11 years ago|reply
I'm sure it's because rsyslog supports it (many mentions in the article to staying compliant with rsyslog).
[+] ejain|11 years ago|reply
What are some alternatives to Loggly? I really like being able to aggregate my logs with minimal setup (and cost). I'm logging with Logback (Java), and there is a convenient extension that forwards log statements to Loggly.
[+] renaud34|11 years ago|reply
There are definitely some like Logentries or Sumologic. But the best I have seen so far is logmatic.io - to be honest I am working for them :). Our objective is to get one tool for all by allying the power of search with true business intelligence. Here is a video of our product as it was 6 months ago: http://bit.ly/logsjava. We are still in private beta and our prices are similar to Loggly and we already have tens of customers.
[+] mrucci|11 years ago|reply
Interesting points. Here is a few things you'll miss choosing Route 53 over ELB:

* HTTPS termination.

* Autoscaling group management. By connecting an ELB to an autoscaling group, the logic of registration and deregistration is fully managed behind the scenes. With route53, you have to implement it yourself.

* Minimum autoscaling group size. If you enable ELB health checks, you can rely on the ELB to maintain a group of instances of constant size.

[+] spamizbad|11 years ago|reply
ELB's HTTPS termination is mediocre and, last I checked, doesn't offer the best ciphers. A year ago It was impossible to get an A+ on ssltest https://www.ssllabs.com/ssltest/ using ELB to terminate SSL.

Not to mention it still needlessly includes a ton of dangerously insecure ciphers just begging to be misclicked.

[+] shackledtodesk|11 years ago|reply
ELB scales horrible and can not scale to even tens of thousands of connections per second, let alone handling spikes of 100k/sec simultaneous connections. Even if you get AWS to prewarm to ELB at a higher peak rate, if you spike over those limits you will drop new incoming connections. HTTPS termination is trivial compared to a requirement to be able to actually handle hundreds of thousands to millions of simultaneous connections per second.
[+] latch|11 years ago|reply
Is https termination really worth mentioning? They'll still be running some type of web server (nginx, apache, whatever) and enabling https termination there is probably easier than going through the elb wizard.
[+] kfnic|11 years ago|reply
What kind of TTL value would they use for these records? Should something happen to one of the collectors, couldn't that value still be cached by an endpoint or an intermediary?

Even with a short TTL, are there still servers out there that don't respect all TTLs, or has that been eliminated by now?

[+] hnhipster|11 years ago|reply
Everyone should use hosted services for everything. Soon we'll have hosted services for hosted services. (I actually worked at a company that was a hosted service running mainly off of another hosted service + AWS.)
[+] hobs|11 years ago|reply
>Amazon Route 53 DNS Round Robin Was a Win

>If you’ve ever used the Internet, you’ve used the Domain Name System, or DNS, weather you realize it or not.

Interesting article, wrong weather used in this sentence.

[+] KarenS|11 years ago|reply
Thanks for noticing this! It's fixed now.
[+] jcampbell1|11 years ago|reply
It seems odd to leave off any discussion about DNS TTLs, and the risk that something like 8.8.8.8 could end up sending them a thundering herd.
[+] lpgauth|11 years ago|reply
What kind of time granularity can you get for health checks on ELB vs Route 53?
[+] mrucci|11 years ago|reply
ELB Health Check: min, max = (1s, 300s)

Route 53 Health Check: either 10s or 30s