top | item 5215884

Heroku's Ugly Secret: The story of how the cloud-king turned its back on Rails

1763 points| tomlemon | 13 years ago |rapgenius.com | reply

423 comments

order
[+] teich|13 years ago|reply
This is Oren Teich, I run Heroku.

I've read through the OP, and all of the comments here. Our job at Heroku is to make you successful and we want every single customer to feel that Heroku is transparent and responsive. Getting to the bottom of this situation and giving you a clear understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com.

[+] doktrin|13 years ago|reply
Thanks for the response, but I have to admit that the lack of a clear-cut answer here is a little worrisome.

Anyone who wants to like Heroku would hope that the OP is flat out, 100%, wrong. The fact that Heroku's official answer requires a bit of managing implies otherwise.

On a related tangent, I would also encourage future public statements to be a little less opaque than some Heroku has put out previously.

For instance, the cause of the outage last year was attributed to "...the streaming API which connects the dyno manifold to the routing mesh" [1]. While that statement is technically decipherable, it's far from clear.

[1] https://status.heroku.com/incidents/372

[+] bambax|13 years ago|reply
What's the point of posting a link to the front page of your blog, where the most recent article is 15 days old (4 hours after the comment above)?

What we want to know:

- is the OP right or wrong? That is, did you switch from smart to naive routing, for all platforms, and without telling your existing or future customers?

- if you did switch from smart to naive routing, what was the rationale behind it? (The OP is light on this point; there must be a good reason to do this, but he doesn't really say what it is or might be)

- if the OP is wrong, where do his problems might come from?

- etc.

[+] character0|13 years ago|reply
While I think it is appropriate for Heroku to respond to this thread (and other important social media outlets covering this), linking to a blog without any messaging concerning your efforts might not be the greatest move... This may not be a sink or swim moment for Heroku, but tight management of your PR is key to mitigating damage. Best of luck, Heroku is a helpful product and I want to see you guys bounce back from the ropes on this one.
[+] GhotiFish|13 years ago|reply
I'm looking forward to hearing why Heroku is using such a strange load balancing strategy.
[+] tjbiddle|13 years ago|reply
Looking forward to your blog post. Hoping things get cleared up!
[+] willvarfar|13 years ago|reply
hint just use a rabbitmq queue or something. Don't have a 'smart' LB that has to know everyone's state; instead, have dynos that get more work as quick as they can.
[+] avodonosov|13 years ago|reply
I hope the solution will not break the possibility for multithreaded apps to receive several requests
[+] antihero|13 years ago|reply
Can "Dynos" serve multiple requests simultaneously? That's the question, really.
[+] toast76|13 years ago|reply
Wow. This is explains a lot.

We've always been of the opinion that queues were happening on the router, not on the dyno.

We consistently see performance problems that, whilst we could tie down to a particular user request (file uploads for example, now moved to S3 direct), we could never figure out why this would result in queuing requests given Heroku's advertised "intelligent routing". We mistakenly thought the occasion slow request couldn't create a queue....although evidence pointed to the contrary.

Now that it's apparent that requests are queuing on the dyno (although we have no way to tell from what I can gather) it makes the occasional "slow requests" we have all the more fatal. e.g. data exports, reporting and any other non-paged data request.

[+] 46Bit|13 years ago|reply
Several Rails apps I develop have been suffering from similar issues. Perhaps 2-3% of requests take 0.4-2s in just processing. If the allocation is a little intelligent, it'll not perform too badly and is less work than much harder optimization. Yet if it's random, it'll queue up horribly.

I'm pissed. Spent way too much time unable to explain it to coworkers, thinking I just didn't understand Heroku's platform and that it was my fault.

Turns out, I didn't understand it, because Heroku never thought to clearly mention something that's pretty important.

Easiest fix: moving to EC2 next week. I've wanted to ever since our issues became evident but it's hard to make a good argument from handwaving about 'problems'.

[+] tbenst|13 years ago|reply
Best thing we can do is follow through on the article's call-to-action for emailing [email protected]:

"After reading the following RapGenius article (http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics), we are reevaluating the decision to use Heroku. I understand that using a webserver like unicorn_rails will alleviate the symptoms of the dyno queuing problem, but as a cash-strapped startup, cost-efficiency is of high importance.

I look forward to hearing you address the concerns raised by the article, and hope that the issue can be resolved in a cost-effective manner for your customers."

[+] jaggederest|13 years ago|reply
It's interesting, because initially the way that queue time detection worked within New Relic was via timestamps.

Currently, though, I believe it's just fed as a number of milliseconds: https://github.com/newrelic/rpm/blame/master/lib/new_relic/a...

This solves the issue of the application seeing out-of-whack queue times if there's clock skew between the front-end routing framework and the actual dyno box, but misses all the queued time spent in the dyno-queue per rap genius's post.

[+] runarb|13 years ago|reply
Is it so that a dyno can only handle a single user request at a time?

Why dos it not use some kind of scheduling system to handle other task while one task is waiting on i/o?

[+] michaelrkn|13 years ago|reply
We ran into this exact same problem at Impact Dialing. When we hit scale, we optimized the crap out of our app; our New Relic stats looked insanely fast, but Twilio logs told us that we were taking over 15 seconds to respond to many of their callbacks. After spending a few weeks working with Heroku support (and paying for a dedicated support engineer), we moved to raw AWS and our performance problems disappeared. I want to love Heroku, but it doesn't scale for Rails apps.
[+] WillieBKevin|13 years ago|reply
We moved our Twilio app off Heroku for the same reasons. Extensive optimizations and we would still get timeouts on Twilio callbacks.

The routing dynamics should be explained better in Heroku's documentation. From an engineering perspective, they're a very important piece of information to understand.

We're with https://bluebox.net now and are very happy.

[+] FireBeyond|13 years ago|reply
This should be more prominent. I want to love Heroku, and am sure that I could.

But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.

It's arguable that the routing mesh and scaling dynamics of Heroku are a large part, if not -the- defining reason for someone to choose Heroku over AWS directly.

Is it a "hard" problem? I'm absolutely sure it is. That's one reason customers are throwing money at you to solve it, Heroku.

[+] chc|13 years ago|reply
> But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.

The thing is, their old "intelligent routing" was really just "we will only route one request at a time to a dyno." In other words, what changed is that they now allow dynos to serve multiple requests at a time. When you put it that way, it doesn't sound as horrific, does it?

[+] DeepDuh|13 years ago|reply
Why do you 'want to love' Heroku? Because their marketing speak is so great?
[+] badgar|13 years ago|reply
> That's one reason customers are throwing money at you to solve it, Heroku.

People are throwing money at Heroku because it's really easy to use, not because it's the best long-term technology choice. Seriously - what percentage of Heroku paying users do you think actually read up on the finest technical details like routing algorithms before they put in their credit card? Heroku knows. They know you can't even build a highly-available service on top of it, since it's singly-homed, and they're still making tons of money.

[+] lkrubner|13 years ago|reply
Good lord!!!!!

Percentage of the requests served within a certain time (ms)

  50%    844

  66%   2977

  75%   5032

  80%   7575

  90%  16052

  95%  20069

  98%  29282

  99%  30029

 100%  30029 (longest request)
Those numbers are amazingly awful. If I ever run ab and see 4 digits I assume I need to optimize my software or server. But 5 digits?

Why in the world would a company spend $20,000 a month for service this awful?

[+] CoffeeDregs|13 years ago|reply
Worse than that:

  * 89/100 requests failed (according to
    https://gist.github.com/a-warner/c8cc02565dc214d5f77d ).  
  * Heroku times out requests after 30 seconds, so the 30000ms
    numbers may be timeouts (I've forgotten if *ab* includes 
    those in the summary).
  * That said, the *ab* stats could be biased by using overly 
    large concurrency settings (not probably if you're running 50 dynos...),
    but still...
But still WTF. 89/100 requests failed? That's not happy-making.

Uncertainty is DiaI (death-in-an-infrastructure). I just created a couple of projects on Heroku and love the service, but this needs to be addressed ASAP (even if addressing it is just a blog post).

Also, if you have fewest-connections available, I've never understood using round-robin or random algorithms for load-balancers...

[+] JPKab|13 years ago|reply
I've been scanning through the comments, and I have yet to see anything written by a Heroku engineer to defend the company. I'm hoping its in here and I missed it. I have a feeling that this all might be absolutely true, and they have lawyers/PR trying to think of a damage control plan.

I suspect that the reason they'be been pushed to do this is financial, and it makes me think that Nodejitsu's model of simply not providing ANY free plans other than one month trials is a good one. I realize it's apples and oranges, since NJ is focused on async and this wouldn't even be a problem for a Node app, but from a business perspective I feel like this would alleviate pressure. How many dynos does Heroku have running for non-paying customers? Do these free dynos actually necessitate this random routing mesh bullshit? If not, what?

[+] csense|13 years ago|reply
High cost/risk associated with switching providers, and frog-in-heating-water syndrome.
[+] eli|13 years ago|reply
Well, at X level of concurrency, wouldn't most set ups with load balancers start to spit numbers like that?
[+] bignoggins|13 years ago|reply
Rap Genius is employing a classic rap-mogul strategy: start a beef
[+] parsnips|13 years ago|reply
Not only that, East Coast vs. West Coast at that...
[+] jussij|13 years ago|reply
We've had Battle Rap, Gangsta Rap, you name it Rap.

Maybe this is the start of Off the Rails Rap.

[+] mattj|13 years ago|reply
So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)

The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).

[+] aristus|13 years ago|reply
I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
[+] jholman|13 years ago|reply
Re the distribution, absolutely. That "FIFTY TIMES" is totally due to the width of the distribution. Although, you know, even if their app was written such that every single request took exactly 100ms of dyno time, this random routing would create the problem all over again, to some degree.

As for the intelligent routing, could you explain the problem? The goal isn't to predict which request will take a long time, the goal is to not give more work to dynos that already have work. Remember that in the "intelligent" model it's okay to have requests spend a little time in the global queue, a few ms mean across all requests, even when there are free dynos.

Isn't it as simple as just having the dynos pull jobs from the queue? The dynos waste a little time idle-spinning until the central queue hands them their next job, but that tax would be pretty small, right? Factor of two, tops? (Supposing that the time for the dyno-initiated give-me-work request is equal to the mean handling time of a request.) And if your central queue can only handle distributing to say 100 dynos, I can think of relatively simple workarounds that add another 10ms of lag every factor-of-100 growth, which would be a hell of a lot better than this naive routing.

What am I missing?

[+] lil_tee|13 years ago|reply
Simulation author here with some additional analysis using a faster distribution of request times. If you use a distribution with median 50 ms, 90th percentile 225 ms, and 99.9th percentile 898 ms, then you need 30 intelligent dynos to handle 9000 requests/minute without queueing. In the same scenario with 30 naive dynos, 44% of requests get queued.

Animations and results are in the explanation at http://rapgenius.com/1502046

[+] DannyBee|13 years ago|reply
Yes, it is very hard to do it at scale, but so what? I mean, isn't the whole premise of their company to do intelligent things at scale so you don't have to?

It's not an insurmountable problem by any measure, and it's definitely worth it.

[+] pdonis|13 years ago|reply
The solution here is to figure out why your 99th is 3 seconds.

I'm not sure this applies to the OP. His in-app measurements were showing all requests being handled very fast by the app itself; the variability in total response time was entirely due to the random routing.

[+] nthj|13 years ago|reply
I'm inclined to wait until Heroku weighs in to render judgement. Specifically, because their argument depends on this premise:

> But elsewhere in their current docs, they make the same old statement loud and clear: > The heroku.com stack only supports single threaded requests. Even if your applicaExplaintion were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.

They pull this from Heroku's documentation on the Bamboo stack [1], but then extrapolate and say it also applies to Heroku's Cedar stack.

However, I don't believe this to be true. Recently, I wrote a brief tutorial on implementing Google Apps' openID into your Rails app.

The underlying problem with doing so on a free (single-dyno) Heroku app is that while your app makes an authentication request to Google, Google turns around and makes a "oh hey" request to your app. With a single-concurrency system, Google your app times out waiting for Google to get back to you and Google won't get back to you until your app gets back to you so hey deadlock.

However, there is a work-around on the Cedar stack: configure the unicorn server to supply 4 or so worker processes for your web server, and the Heroku routing mesh appropriately routes multiple concurrent requests to Unicorn/my app. This immediately fixed my deadlock problem. I have code and more details in a blog post I wrote recently. [2]

This seems to be confirmed by Heroku's documentation on dynos [3]: > Multi-threaded or event-driven environments like Java, Unicorn, and Node.js can handle many concurrent requests. Load testing these applications is the only realistic way to determine request throughput.

I might be missing something really obvious here, but to summarize: their premise is that Heroku only supports single-threaded requests, which is true on the legacy Bamboo stack but I don't believe to be true on Cedar, which they consider their "canonical" stack and where I have been hosting Rails apps for quite a while.

[1] https://devcenter.heroku.com/articles/http-routing-bamboo

[2] http://www.thirdprestige.com/posts/your-website-and-email-ac...

[3] https://devcenter.heroku.com/articles/dynos#dynos-and-reques...

[edit: formatting]

[+] habosa|13 years ago|reply
Wow.

Normally when I read "X is screwing Y!!!" posts on Hacker News I generally consider them to be an overreaction or I can't relate. In this case, I think this was a reasonable reaction and I am immediately convinced never to rely on Heroku again.

Does anyone have a reasonably easy to follow guide on moving from Heroku to AWS? Let's keep it simple and say I'm just looking to move an app with 2 web Dynos and 1 worker. I realize this is not the type of app that will be hurt by Heroku's new routing scheme but I might as well learn to get out before it's too late.

[+] stevewilhelm|13 years ago|reply
Heroku Support Request #76070

To whom it may concern,

We are long time users of Heroku and are big fans of the service. Heroku allows us to focus on application development. We recently read an article on HN entitled 'Heroku's Ugly Secret' http://s831.us/11IIoMF

We have noticed similar behavior, namely increasing dynos does not provide performance increases we would expect. We continue to see wildly different performance responses across different requests that New Relic metrics and internal instrumentation can not explain.

We would like the following:

1. A response from Heroku regarding the analysis done in the article, and 2. Heroku-supplied persistant logs that include information how long requests are queued for processing by the dynos

Thanks in advance for any insight you can provide into this situation and keep up the good work.

[+] barmstrong|13 years ago|reply
We were very surprised to discover Heroku no longer has a global request queue, and spent a good bit of time debugging performance issues to find this was the culprit.

Heroku is a great company, and I imagine there was some technical reason they did it (not an evil plot to make more money). But not having a global request queue (or "intelligent routing") definitely makes their platform less useful. Moving to Unicorn helped a bit in the short term, but is not a complete solution.

[+] rapind|13 years ago|reply
I'd been using Heroku since forever, but bailed on them for a high traffic app last year (Olympics related) due to poor performance once we hit a certain load (adding dynos made very little difference). We were paying for their (new at the time) critical app support, and I brought up that it appears to be failing at a routing level continuously. And this was with a Sinatra app served by Unicorn (which at the time at least was considered unsupported).

We went with a metal cluster setup and everything ran super smooth. I never did figure out what the problem was with Heroku though and this article has been a very illuminating read.

[+] gojomo|13 years ago|reply
They want to force the issue with a public spat. Fair enough.

But, they also might also be able to self-help quite a bit. RG makes no mention of using more than 1 unicorn worker per dyno. That could help, making a smaller number of dynos behave more like a larger number. I think it was around when Heroku switched to random routing that they also became more officially supportive of dynos handling multiple requests at once.

There's still the risk of random pileups behind long-running requests, and as others have noted, it's that long-tail of long-running requests that messes things up. Besides diving into the worst offender requests, perhaps simply segregating those requests to a different Heroku-app would lead to a giant speedup for most users, who rarely do long-running requests.

Then, the 90% of requests that never take more than a second would stay in one bank of dynos, never having pathological pile-ups, while the 10% that take 1-6 seconds would go to another bank (by different entry URL hostname). There'd still be awful pile-ups there, but for less-frequent requests, perhaps only used by a subset of users/crawler-bots, who don't mind waiting.

[+] goronbjorn|13 years ago|reply
Aside from the Heroku issue, this is an amazing use of RapGenius for something besides rap lyrics. I didn't have to google anything in the article because of the annotations.
[+] zeeg|13 years ago|reply
If this is such a problem for you, why are you still on Heroku? It's not a be-all end-all solution.

I got started on Heroku for a project, and I also ran into limitations of the platform. I think it can work for some types of projects, but it's really not that expensive to host 15m uniques/month on your own hardware. You can do just about anything on Heroku, but as your organization and company grow it makes sense to do what's right for the product, and not necessarily whats easy anymore.

FYI I wrote up several posts about it, though my reasons were different (and my use-case is quite a bit different from a traditional app):

* http://justcramer.com/2012/06/02/the-cloud-is-not-for-you/

* http://justcramer.com/2012/08/30/how-noops-works-for-sentry/

[+] rdl|13 years ago|reply
Wow. I suspect Rap Genius has the dollars now where it's totally feasible for them to go beyond Heroku, but it still might not be the best use of their time. But if they have to do it, they have to do it.

OTOH, having a customer have a serious problem like this AND still say "we love your product! We want to remain on your platform", just asking you to fix something, is a pretty ringing endorsement. If you had a marginal product with a problem this severe, people would just silently leave.

[+] lquist|13 years ago|reply
Heroku implements this change in mid-2010, then sells to Salesforce six months later. Hmm...wondering how this impacted revenue numbers as customers had to scale up dynos following the change...
[+] bifrost|13 years ago|reply
I am only going to suggest a small edit -> s/Postgres can’t/Heroku's Postgres can't/

PG can scale up pretty well on a single box, but scaling PG on AWS can be problematic due to the disk io issue, so I suspect they just don't do it. I'd love to be corrected :)