Come on startups, you should be technically skilled and able to optimize in order to spend little money. If you sum EC2 and Heroku you are going to pay like 10x what it takes to run the same machines power in a dedicated server, all this because you can't handle the operations? This is absurd IMHO.
Also people that want to start a business, there is a huge opportunity here, create software that makes managing Apache, Redis, PostgreSQL, ..., in dedicated servers very easy and robust. Traget a popular and robust non-commercial distribution like Ubuntu LTE, and provide all is needed to deploy web nodes, database nodes, with backups, monitoring, and everything else trivial.
Startups can give you 5% of what they are giving now to EC2 and Heroku and you still will make a lot of money.
"I can only write my Ruby code but can't handle operations" is not the right attitude, grow up. (This is what, in a different context, a famous security researcher told me in a private mail 15 years ago, and it was the right advice)
> Also people that want to start a business, there is a huge opportunity here, create software that makes managing Apache, Redis, PostgreSQL, ..., in dedicated servers very easy and robust. Traget a popular and robust non-commercial distribution like Ubuntu LTE, and provide all is needed to deploy web nodes, database nodes, with backups, monitoring, and everything else trivial.
So the pitch is: clone Heroku, which has taken dozens of very smart engineer man-years to build and refine, then charge 5% of what the market will bear?
> "I can only write my Ruby code but can't handle operations" is not the right attitude, grow up.
I handled ops for 120+ Rails apps while managing a team of juniors and making time to write code. What a stupid waste of my time: ops is forever & never-ending, while implementing & delivering a new feature to 100K customers can bump sales' conversions permanently.
If I can outsource relatively-linearly-increasing Ops costs and instead focus on delivering value that multiplies compounding-interest style, that's not childish.
>"I can only write my Ruby code but can't handle operations" is not the right attitude, grow up.
How is such a condescending post at the top?
Everyone running a startup is an idiot because they choose not to waste their time on your priority?
Heroku isn't 10 times more and it wouldn't matter even if it was. Talented people are hard to find and spending time on operations when you might not be around in 4 months may not be the most important thing to focus on. On a 6 to 8 month time scale, heroku would be 10 times cheaper that taking the hit of setting up everything to emulate it. Deployments, backups, monitoring. Those things take a lot of time to setup correctly.
I run a startup that is hosted on Heroku. I write Ruby code but can't handle operations (I have and can set up servers but if shit hits the fan I wouldn't have a clue. Also I don't have the time). My startup is a one-man show. And you suggest I should grow up? By grow up you mean that I should watch my customers face days of downtime because I don't know how to fix the server, how to secure the server probably, how to scale or have money to hire someone who can?
Of the "ShitHNSays"-stuff I read here this surely takes the cake.
The primary bottleneck for most startups isn't money, it's time.
Also for most startups (assuming you're not doing something CPU/bandwidth heavy) the actual cost of hosting is going to be a relatively small part of the budget. If your burn rate is a million dollars, reducing 10k of hosting costs to 1k isn't really worth the effort.
Definitely. This is possible with e.g. puppet already - where I work I've setup puppet modules for provisioning everything automatically for developers, they just need to enter the customer name and whether they need dev, stage and/or live for that customer - the rest is automated. We use http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... for hosting - let me just say that I can buy an entire rack with 64 GB ram in each server, for the same cost as one amazon instance :D
This feels like a distraction. How about instead of pontificating on how someone else chooses to do business, we discuss the actual merits (or lack thereof) of Heroku's routing issues.
I can perfectly well install, set up and maintain my own Ruby servers - but it takes time.
Alternatively, I can pay someone else to do that, and remove that timesink from the elapsed time between "start developing" and "find out how well we've achieved market fit".
I can always optimise later - move off Heroku, develop our own load balancing, all that stuff. Once I've got a working product/market fit, I probably will.
But doing that before I know if I'm going to chuck the entire infrastructure in the garbage and move on to idea #2 (and #3, and #4...), or indeed pivot so wildly that we'll have to reorganise all our server stuff anyway, is a waste of time. And time is valuable.
I'm honestly surprised there aren't tools available now replicating what Heroku does.
We should have puppet scripts to deploy, instrument and manage all the popular infrastructure choices by now.
The same way originally Linux was a build it yourself box of parts, we ought to have "cloud infrastructure" distributions, from bootup to app deployment.
The neat thing here is that as people improve the distribution you get cumulative savings. Back in the 90's you needed a skilled individual to setup a Unix/Linux system. Now, even an MBA can do it. The same could happen with infrastructure on a higher level.
" If you sum EC2 and Heroku you are going to pay like 10x what it takes to run the same machines power in a dedicated server, all this because you can't handle the operations? This is absurd IMHO."
It really depends. My current experience with heroku is that it is absurdly cheap - at least, for our use cases. We would have to do a lot more traffic for us to ever consider moving to a colo.
Not to mention I'd bet that they do have someone spending all their time dealing with Heroku ops... and my sense is it's not just because of this bug.
Inaka has a combined total of close to a billion pageviews/month across all our EC2-hosted apps for all of our clients and we have zero full time operations staff - we have 2 guys that spend (much less than) part time on it.
I'm watching all this debate with great interest. The company I'm consulting for has ~150 dynos, a paid support contract, etc. We're on Cedar... NodeJS latest (which just got upgraded to 0.8.19 after being ignored since October), with Mongo, Rabbit and Redis hosted by third party addon companies.
After a ton of H12 errors, they helped us find out some slow points and optimize some things that were relatively slow. On our own, we did a huge amount of work to make things as fast as possible. While the H12's have gotten better, nothing has gotten rid of them completely. It really points to something fundamentally wrong with the routing layer because at some level we just can't optimize our code any further. There is definitely quite a few times in the logging where we just can't explain how things are insanely slow and we certainly can't explain why we get H12 errors anymore. To the point where we just gave up with it.
The thing that bothers me the most is that we have been complaining for a month now behind the scenes through our paid support contract about the things that are now being semi admitted in public. No PaaS is perfect and certainly hard problems are being worked on by smart people... the real issue here is the way that Heroku has pointed fingers at everyone but themselves until finally someone had the time and balls to get a posting to the top of HN.
Depending on what side of Hanlon's razor you fall, the only conclusion I get from this is that they are either incompetent or dishonest. I have a very hard time believing that this issue remained unknown to them for years.
As for the post, it's pretty much just documentation. I didn't see any apology. And the only promise of a better tomorrow is a vague "Working to better support concurrent-request Rails apps on Cedar".
Not to be too harsh, but I'm not sure whether "we had no idea it was so bad" is better or worse than "we knew it was bad, but didn't tell anybody" for a platform company. The tone of the post is appropriately apologetic, but this does make you wonder what other problems they're missing.
It seems to me that Heroku has chosen to be dishonest:
Heroku's blog response:
"but until this week, we failed to see a common thread among these reports."
vs.
Adam's response to Tim Watson, a year ago:
"You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done."
Hmm, no tangible solutions yet, but I expect that will be next.
From the discussion I've seen they have roughly two minimal options:
(1) Shard/tier the Bamboo routing nodes, so that a single router tends to handle any particular app, and thus the original behavior is restored. Consistent hashing on the app name could do the trick, or DNS tricks on the app names mapping to different routing subshards.
(2) Enable dynos to refuse requests, perhaps by refusing a connect or returning an error or redirect that tells a router to try the next dyno. (There are some indications a 'try another' logic already exists in their routers, so it might even be possible for customers to do this without Heroku's help. I have a question in with Heroku support about which request-shedding techniques might work without generating end-user visible errors.)
Both could potentially benefit from some new per-dyno load-monitoring features... which would also allow other more-sophisticated (but more costly and fragile at scale) balancing or load-shedding strategies.
I can see the commentariat lynch mob is out, but definitive recommendations and fixes take time. As they've admitted and apologized for the problem, I'd guess they'll have a more comprehensive response before their end-of-the-month user conference.
I would suggest a slight variation on the first. Have initial routers pick a random appropriate dyno then pick the router for that dyno. The second router picks a random appropriate dyno from the list it is in charge of.
If you make sure that the dynos for a given app are clumped behind few routers in the second layer, then you effectively get the old behavior. But you get it in a much more scaleable way. The cost is, of course, that you add an extra router to everything.
(I emailed this suggestion to them. I have no idea whether they will listen.)
It sounds like they have one routing cluster for all of Heroku. If this is the case, and large routing clusters are the problem's root cause, they should just shard the cluster.
I.e., give their bigger customers like RapGenius (who said they pay Heroku $20k/month and whose HN post spurred this debate) their own dedicated routing cluster with 2-5 nodes. Once a single customer exceeds that size, they're probably paying Heroku enough that Heroku can afford to devote an engineer to working specifically with them to implement large-scale architecture choices like running the app's DB on a cluster, etc.
Pile several smaller customers on a shared routing cluster to cut costs and keep the cluster utilization high, but once the cluster gets to be a certain size (reaches a fixed number of backend dynos or metrics get bad enough), start putting customers on a new cluster.
It should be fairly trivial to use DNS or router rules to dynamically move existing customers from one routing cluster to another.
The problem, as documented by the customer who went public with this issue, is that their request distribution scheme went from intelligent (i.e., load-based) to random, and a random distribution of requests is almost guaranteed to cause significant queuing for some non-trivial number of requests unless one has an absurd amount of extra capacity in place already, with ruinous financial aspects.
Well, this certainly calls into question their competence. They're a PaaS company that doesn't understand or measure their load balancing performance.
If you are a PaaS company, and you only have 5 metrics you can record, then 99% percentile latency across all apps should be one of them.
On another note: why is Rails single-threaded??? That seems unbelievable. So if you have a 2 second database query, your Rails process does nothing else for that 2 seconds? I mean people complain about the GIL in Python, which actually has reasons behind it, but this is just crazy.
Welp, I was waiting for their official response to decide if I should deploy my app with Heroku or roll up my sleeves and rig up AWS servers (which I've done before but was looking forward to not having to deal with it.) Based upon this post, it sounds like there are really no concrete steps that they have planned to fix the underlying issue. So, AWS it is.
I am still considering having Heroku manage my PostgreSQL instance. This would be a large burden lifted leaving me to just manage the app servers, etc. Is there any reason to be concerned about their PostgreSQL hosting? Any horror stories?
If you're considering Heroku, don't automatically dismiss it because of all this. For one, it's unlikely your site/app will ever be as big as RapGenius. That's not a shot, just reality. They ran into problems as an edge case. Is Heroku and their architecture at fault? Hell yeah it is, but I have faith that they will fix it.
Why?
Because I really think that when push comes to shove, Heroku was actually trying to do the right thing with the changes they made and perhaps didn't consider or understand some of the ramifications of the changes they made to the Rails community. They may have fallen in love a little too much with the new node.js hotness and the like. Their CORE audience is startups/new business where Rails is very popular and they understand what this has done to their reputation. If they don't address this in a serious way it will damage their business severely.
I don't personally use Heroku but I have used it in the past and would not hesitate to use it on an appropriate project.
Or, use their Cedar stack and multi-worker dynos, where the problem is much less acute (and is only going to affect you once you need many dynos). Figure that in a month or two they'll have learned and deployed more than you would on your own.
If your app is Railsian and will have requests that take a few seconds to be served, I'd suggest AWS. If it's ultimately a simple CRUD app then Heroku should be fine until you're at significant scale - and this issue will never be as severe.
1) Releasing a press release at 7 AM in the morning on a Saturday (CET)
2) The release looks mostly like the stuff a politicians spindoctor would ask the politician to say. Don't promis/admit too much.
3) They clearly state that they want to continue with this extremely inefficient way of routing. The right thing to do would be to make smaller clusters of Load Balancers who could then do proper routing, e.g. measuring the number of requests per dyno, last processing time, etc.
I'm currently working on a large project on Heroku and I'm very disappointed about this. We chose Heroku because we believed we could just `heroku scale web=X` when needed. Instead, now we know that it will be of very little use.
In the next week, I will be looking into a solution where I can utilize Heroku's add-on system without running my apps in Heroku Dynos. Creating a small system to hosts LXC's on AWS EC2 seems within my capabilities (or I could use Cloud Foundry's application server component) - and I believe I can configure a load balancer better than Heroku.
Let me know if anyone else is interrested - we could make an open source project for this :-)
This is still not ideal. Even if you're running unicorn, you're still susceptible to queueing spikes due to random load balancing. The concurrency just gives you a small buffer and/or some smoothing on 95th percentile responses. Right?
At least there's a commitment to update the reporting tools... getting bad data in New Relic was (IMHO) the worst -- even worse than out-of-date docs.
Actually, no. Using unicorn with only 2 workers makes a tremendous difference, not just incremental. RapGenius' own statistical model demonstrates this.
Picture each individual dyno in that case as its own "intelligent router". Since it's not distributed and this requires no network coordination, the job of knowing which workers are available becomes trivial.
If you're inclined to read up on queuing theory, you'll see that having at least 2 processes per worker makes the problem much simpler.
Rap Genius cofounder here. Below is the full unedited text of https://help.heroku.com/tickets/37665, a Heroku support ticket I logged about 1 year ago. Sorry it is so long, but I think you'll find it interesting:
Tom@Rapgenius| about 1 year ago
I know this is a bit of a vague problem, but I've been getting a bunch of Error H12 (Request Timeout)s recently, and I'm not sure what to do about it.
It's not like I have some particularly slow actions; I'm getting this error for actions that under most circumstances work totally fine (i.e., return in less than 300ms). Also I don't have a deep request queue (I'm running 40 dynos which is more than enough). Maybe I'm doing some slow queries? Should I upgrade my DB?
Also, I do notice that most of my app's time (according to New Relic) is being spent in Ruby (http://cl.ly/29132F272W2D0K1l2I3P). Would upgrading Ruby to 1.9 noticeably help this performance? (I'm a bit nervous it'll create a ton of problems).
Phil@Heroku
Hello - I can look into this, but I'll need access to your New Relic account. Will you make sure '[email protected]' has access?
Also, from your screenshot I notice your DB times are ~ 100 ms. We recommend keeping those times closer to 50 ms. You might be able to speed things up with a database upgrade.
I'll look into New Relic once I have access and let you know what I find.
Tom@Rapgenius
Thanks, Phil!
How do I give you access to my New Relic account? I tried clicking "account settings" and got this: http://cl.ly/0V2J3i0826400I2s3b2c
Phil@Heroku
Tom, I have access now. I'm not sure what was blocking me earlier.
After looking at New Relic and the database server, I think a larger database will help. At the very least, it will be helpful to try the next level for a week and compare performance statistics in New Relic with the prior week.
Your app is using an Ika right now, and the next step up is the Zilla database. We've made the upgrade process very simple, and it's outlined here - http://devcenter.heroku.com/articles/fast-database-changeove...
Your database is ~ 5.4 GB in size (via the 'heroku pg:info' command) so an upgrade shouldn't take too long. You will be able to test the process by adding a Follower and timing it via the 'heroku pg:wait' command. This should give you a good idea of how long it will take to spin up the new database. Also, should the Zilla not help much, the downgrade process to an Ika will be the same. You only pay for the resources used.
The current database server appears to be a bit under-powered when it comes to Compute Units. The Zilla has more power and should provide some room to grow.
As for an upgrade to Ruby 1.9.2, I'm not sure how much that would help. It would be an involved upgrade that would take time to plan and deploy. The database upgrade should be a quicker solution.
Long-term you may want to consider moving to the Cedar stack and Ruby 1.9.2.
Tom@Rapgenius
Thanks! I'm upgrading now
Tom@Rapgenius
I'm still getting a ton of "Request Timeout" errors. E.g.:
2011-12-08 14:46:53.222 219 1 2011-12-08T14:46:53+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/Wale-ambition-lyrics dyno=web.17 queue= wait= service=30000ms status=503 bytes=0
one weird thing: there aren't any values listed for the "queue" and "wait" parameters. Could that indicate a problem?
Could an exception have been thrown earlier in the request before the timeout? Or does the timeout error just indicate that the request took too long? If it's the latter I'm not sure how to troubleshoot all these errors since the associated actions are fast the vast majority of the time
Tom@Rapgenius
Here's another interesting example:
2011-12-08 15:59:32.293 222 1 2011-12-08T15:59:32+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/static/templates_for_js dyno=web.17 queue= wait= service=30000ms status=503 bytes=0
This action is extremely simple – it doesn't access the DB or any external services. Here's the template:
Ballin!
<% unless current_user %>
<% form_for User.new, :html => { :id => '' } do |f| %>
Tired of entering your email address? Create a Rap Genius account and you'll never have to worry about it (or anything else) ever again:
<%= render :partial => "/users/form", :object => f %>
<%= f.submit "Create Account" %> <small>(Already have an account? <%= link_to 'Sign in', login_path, :class => :facebox %>)</small>
<% end %>
<% end %>
Besides a big request queue (which there isn't), how could this action possibly time out?
Phil@Heroku
Tom - sorry for not getting back to you sooner.
It's possible for H12s to occur even for simple actions if there is already queueing for the app. With a busy site like your's, even a few H12s can cause a cascade of H12s for successive requests.
It looks like New Relic has not reported any downtime over the past 24 hours. Can we let the site run through the weekend and see how things look Monday after 3 days of New Relic data with the new Zilla?
Tom@Rapgenius
> It's possible for H12s to occur even for simple actions if there is already queueing for the app.
I feel what you're saying, but I don't think my app's queuing. For one thing, New Relic shows 0 time spent in the queue during the period in which I'm getting all these timeouts. For another, I'm running 40 dynos and my average request time is <400ms. So:
400 ms * 3000 requests / minute * 1 min / 60000 ms = 20 simultaneous requests (i.e., 20 dynos)
so 40 dynos should definitely be more than enough.
Also, shouldn't Heroku be showing me the queue / wait stats at the time of the timeout? That would help prove whether my app was queuing at the time in question
It looks like New Relic has not reported any downtime over the past 24 hours.
New Relic isn't great at catching intermittent problems like this; you really feel it when you're using the site continuously for an hour or whatever. Also, users make many more HTTP requests than New Relic (since every page load kicks off several AJAX requests).
Tom@Rapgenius
Here's some additional data: At 5am this morning (EST), Rap Genius went down. I woke up at 11am (it's a Saturday!), did a logs --tail and observed that basically every request was timing out. I did heroku restart, and now every request started returning a backlog too deep error
Finally, I added another 10 dynos (bumping the total to 50, which is a log of dynos!), and this seems to have fixed the problem – perhaps because my app needs the additional capacity, or perhaps because merely changing the number of dynos reset something else. Either way, I'm sticking with 50 dynos for now out of fear even though I doubt my app needs that many (right?)
Either way, the 5 hours of unexplained downtime (there weren't any application-level exceptions or anything) that was fixable by tweaking my dyno count further supports my theory that something's going on with my app on Heroku's end.
Phil@Rapgenius
Tom -
I've been looking over your New Relic stats.
First - the good news - the upgrade to a Zilla seems to have helped. Database times are down a bit, which can only help. I checked the actual database server and it's not showing signs of over-work like the previous Ika was.
Second, I notice that downtimes reported by New Relic over the past two weeks are in the early morning hours - 3 to 6 AM PST. Do you have any scheduled tasks that run during these times?
Also, request queueing is nearly zero, so 50 dynos does seem like a lot.
What are your usage patterns like? The RPM graph in New Relic indicates the normal cyclical usage pattern, lower during the night, but what does Google Analytics tell you?
Finally, the Heroku platform has been having issues over the past week, but none of them correspond to the downtime you had Saturday morning.
That all sounds so familiar. Just like what Phil was telling us too. Get a New Relic account. "Why do we need a $7k/mo NR account?" Oh, because you have some slow requests...
We go fix all of our slow requests, but we still have H12 errors.
Can you tell us how many dyno's we actually need to serve our requests and not get H12 errors? No.
Hey Rappgenius... thanks for having the balls to call Heroku out in public on this stuff. We are in the same boat.
Have you guys considered suing Heroku to get some of your money back? Given the nature of Heroku's deception and the resulting ill-gotten gains across its entire customer bases,
it would seem like you could work with an enterprising attorney to form a class-action suit against the company and get money back not just for yourselves but for the entire effected customer bases.
I had a string of similar requests with Heroku between Feb 2011 and June 2012, before we migrated off their platform.
I would complain about h12 errors, they would tell me to upgrade my resources and/or that it was my problem and there was nothing they can do. We ended up with a solution that was easily 10x as expensive (over-powered DB, too many dynos) as our initial configuration, and it still didn't fix the issue.
I'm happy to provide the full text support requests, but they don't tend to be quite as juicy as the one you posted.
Definitely interesting, but from this 'full unedited text' it looks like both you and Heroku had better things to do at the time than investigate more deeply. (At the time you seemed satisfied but suspicious that 50 dynos fixed things; Phil@Heroku shares your suspicions but his followup questions get no response. Case closed, everyone moves on to other things until another complaint or fresh info comes in.)
Tom, like many others here, I'd like to personally thank you for publicly surfacing this issue. Two months ago we ran into the same exact problem while doing some performance stress testing. After going back and forth with five different Heroku support staff members for a week, we ended up no where. Their response was simply to increase the # of dynos, but seeing as our average response time was 80ms with 0 request queueing that didn't any make sense. In the end we dropped it, since we were just doing some stress test, but I'm glad they are finally "doing" something about it.
Our documentation recommends the use of Thin, which is a single-threaded, evented web server. In theory, an evented server like Thin can process multiple concurrent requests, but doing this successfully depends on the code you write and the libraries you use. Rails, in fact, does not yet reliably support concurrent request handling. This leaves Rails developers unable to leverage the additional concurrency capabilities offered by the Cedar stack, unless they move to a concurrent web server like Puma or Unicorn.
So, do/will they now recommend Puma/Unicorn over Thin?
I'm not able to follow parts of the post.
Our routing cluster remained small for most of Bamboo’s history, which masked this inefficiency.
If you went from 1 router to 2, 50% of routers can't optimally route a request. If you went from 2 to 3, you would have 66% which can't route. 3 to 4, 75%.
Once you get to say 10 routers, you are already at 90% sub-optimal routing. So are they saying, the had only 1 or 2 routers earlier?
Ruby on Rails is using a default configuration where each process can serve one request at a time. There is no cooperative switch (as in Node.js) or (near) preemptive switch (as in Erlang, Haskell, Go, ...).
The routing infrastructure at Heroku is distributed. There are several routers and one router will queue at most one message per back-end dyno in the Bamboo stack and route randomly in the Cedar stack. If two front-end routers route messages to the same Dyno, then you get a queue, which happens more often on a large router mesh.
Forgetting who is right and wrong, there are a couple of points to make in my opinion.
The RoR model is very weak. You need to handle more than one connection concurrently, because under high load queueing will eventually happen. If one expensive request goes into the queue, then everyone further down the queue waits. In a more modern system like Node.js you can manually break up the expensive request and thus give service to other requests in the queue while the back-end works on the expensive req. In stronger models, Haskell, Go and Erlang, this break-up is usually automatic and preemption makes sure it is not a problem. If you have a 5000ms job A and 10 50ms jobs, then after 1ms, the A job will be preempted and then the 50ms jobs will get service. Thus an expensive job doesn't clog the queue. Random queueing in these models are often a very sensible choice.
Note that Heroku is doing distributed routing. Thus the statistical model Rapgenius has made is wrong. One, requests does not arrive in a Poisson process. Usually one page load gives rise to several other calls to the back-end and this makes the requests dependent on each other. Two, there is not a single queue and router but multiple such. This means:
* You need to take care of state between the queues - if they are to share information. This has overhead. Often considerable overhead.
* You need to take care of failures of queues dynamically. A singular queue is easy to handle, but it also imposes a single point of failure and is a performance bottleneck.
* You have very little knowledge of what kind of system is handling requests.
Three, nobody is discussing how to handle the overload situation. What if your dynos can take 2000 req/s but the current arrival rate is 3000, if you forget about routing for a moment. How do you choose to drop requests, because you will have to do so.
If you want to solve this going forward, you probably need Dyno queue feedback. Rapgenius uses the length of the queue in their test, but this is also wrong. They should use the sojourn time spent in the queue which is an indicator for how long you wait in the queue before being given service. According to rapgenius, they have a distribution where requests usually take 46ms (median) but the maximum is above 2000ms. I can roughly have a queue length of 43 and 1 have the same sojourn time then. Given this, you can feed back to the routers about how long a process will usually stay in queue.
But again, this is without assuming distribution of the routers. The problem is way way harder to solve in that case.
From the perspective of someone who might be looking at Heroku as a host in the future, this is a bit scary. Their response appears to be mostly apologetic in that they're sorry that it happened - but does nothing to address the issue. It's more of a "we screwed up, oh well" than anything else.
They would have warranted a better response if they said they were actively looking into how to improve the routing system, but by the looks of things they're going to sit by and hope developers switch practices so they don't have to solve their problem.
I appreciate the honesty, but I don't see any "this is how we'll fix it", rather, just "we promise to document it and make it clear to anyone who wants to measure it".
Effectively, they have a fundamental architectural problem, and don't know how to work past it.
They also have a fundamental cultural problem, and don't appear to have recognised it.
In short, they went for ~2 years with documentation that advertised features that the implementation didn't have, while receiving a string of support issues that they wouldn't acknowledge as their problem.
Yet, the blog posts show no indication that they are interested in working out why they offered such terrible service to their customers and how they can fix the company culture to take this issues seriously in the future.
There's a chance they'll come up with an architectural solution for their routing problem. But unless they do some serious introspection and work out why they (as a team) stuffed up so badly, then there's no chance that they'll fix that problem.
And if they don't fix that, then why would anyone have confidence that this sort of issue isn't going to be commonplace?
[+] [-] antirez|13 years ago|reply
Also people that want to start a business, there is a huge opportunity here, create software that makes managing Apache, Redis, PostgreSQL, ..., in dedicated servers very easy and robust. Traget a popular and robust non-commercial distribution like Ubuntu LTE, and provide all is needed to deploy web nodes, database nodes, with backups, monitoring, and everything else trivial.
Startups can give you 5% of what they are giving now to EC2 and Heroku and you still will make a lot of money.
"I can only write my Ruby code but can't handle operations" is not the right attitude, grow up. (This is what, in a different context, a famous security researcher told me in a private mail 15 years ago, and it was the right advice)
[+] [-] nthj|13 years ago|reply
So the pitch is: clone Heroku, which has taken dozens of very smart engineer man-years to build and refine, then charge 5% of what the market will bear?
> "I can only write my Ruby code but can't handle operations" is not the right attitude, grow up.
I handled ops for 120+ Rails apps while managing a team of juniors and making time to write code. What a stupid waste of my time: ops is forever & never-ending, while implementing & delivering a new feature to 100K customers can bump sales' conversions permanently.
If I can outsource relatively-linearly-increasing Ops costs and instead focus on delivering value that multiplies compounding-interest style, that's not childish.
It's great business.
[+] [-] nahname|13 years ago|reply
How is such a condescending post at the top?
Everyone running a startup is an idiot because they choose not to waste their time on your priority?
Heroku isn't 10 times more and it wouldn't matter even if it was. Talented people are hard to find and spending time on operations when you might not be around in 4 months may not be the most important thing to focus on. On a 6 to 8 month time scale, heroku would be 10 times cheaper that taking the hit of setting up everything to emulate it. Deployments, backups, monitoring. Those things take a lot of time to setup correctly.
[+] [-] sleepyhead|13 years ago|reply
Of the "ShitHNSays"-stuff I read here this surely takes the cake.
[+] [-] ig1|13 years ago|reply
Also for most startups (assuming you're not doing something CPU/bandwidth heavy) the actual cost of hosting is going to be a relatively small part of the budget. If your burn rate is a million dollars, reducing 10k of hosting costs to 1k isn't really worth the effort.
[+] [-] oellegaard|13 years ago|reply
[+] [-] doktrin|13 years ago|reply
[+] [-] tijs|13 years ago|reply
Looks pretty decent, would make sense to see something similar for other stacks.
[+] [-] lucian1900|13 years ago|reply
[+] [-] pytrin|13 years ago|reply
[+] [-] thenomad|13 years ago|reply
I can perfectly well install, set up and maintain my own Ruby servers - but it takes time.
Alternatively, I can pay someone else to do that, and remove that timesink from the elapsed time between "start developing" and "find out how well we've achieved market fit".
I can always optimise later - move off Heroku, develop our own load balancing, all that stuff. Once I've got a working product/market fit, I probably will.
But doing that before I know if I'm going to chuck the entire infrastructure in the garbage and move on to idea #2 (and #3, and #4...), or indeed pivot so wildly that we'll have to reorganise all our server stuff anyway, is a waste of time. And time is valuable.
[+] [-] Adirael|13 years ago|reply
"create software that makes managing Apache, Redis, PostgreSQL" Like Plesk, but better, faster and less resource hungry.
[+] [-] spitfire|13 years ago|reply
We should have puppet scripts to deploy, instrument and manage all the popular infrastructure choices by now.
The same way originally Linux was a build it yourself box of parts, we ought to have "cloud infrastructure" distributions, from bootup to app deployment.
The neat thing here is that as people improve the distribution you get cumulative savings. Back in the 90's you needed a skilled individual to setup a Unix/Linux system. Now, even an MBA can do it. The same could happen with infrastructure on a higher level.
[+] [-] scottschulthess|13 years ago|reply
It really depends. My current experience with heroku is that it is absurdly cheap - at least, for our use cases. We would have to do a lot more traffic for us to ever consider moving to a colo.
[+] [-] rubyrescue|13 years ago|reply
Inaka has a combined total of close to a billion pageviews/month across all our EC2-hosted apps for all of our clients and we have zero full time operations staff - we have 2 guys that spend (much less than) part time on it.
[+] [-] snowwrestler|13 years ago|reply
http://news.ycombinator.com/item?id=5222581
What's more likely, that everyone else in the world is too dumb to see this opportunity, or that perhaps you have underestimated how hard it is to do?
[+] [-] Rooki|13 years ago|reply
[+] [-] latchkey|13 years ago|reply
After a ton of H12 errors, they helped us find out some slow points and optimize some things that were relatively slow. On our own, we did a huge amount of work to make things as fast as possible. While the H12's have gotten better, nothing has gotten rid of them completely. It really points to something fundamentally wrong with the routing layer because at some level we just can't optimize our code any further. There is definitely quite a few times in the logging where we just can't explain how things are insanely slow and we certainly can't explain why we get H12 errors anymore. To the point where we just gave up with it.
The thing that bothers me the most is that we have been complaining for a month now behind the scenes through our paid support contract about the things that are now being semi admitted in public. No PaaS is perfect and certainly hard problems are being worked on by smart people... the real issue here is the way that Heroku has pointed fingers at everyone but themselves until finally someone had the time and balls to get a posting to the top of HN.
[+] [-] jkat|13 years ago|reply
As for the post, it's pretty much just documentation. I didn't see any apology. And the only promise of a better tomorrow is a vague "Working to better support concurrent-request Rails apps on Cedar".
[+] [-] seldo|13 years ago|reply
[+] [-] azarias|13 years ago|reply
Heroku's blog response: "but until this week, we failed to see a common thread among these reports."
vs.
Adam's response to Tim Watson, a year ago:
"You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done."
https://groups.google.com/forum/?fromgroups=#!msg/heroku/8eO...
[+] [-] gojomo|13 years ago|reply
From the discussion I've seen they have roughly two minimal options:
(1) Shard/tier the Bamboo routing nodes, so that a single router tends to handle any particular app, and thus the original behavior is restored. Consistent hashing on the app name could do the trick, or DNS tricks on the app names mapping to different routing subshards.
(2) Enable dynos to refuse requests, perhaps by refusing a connect or returning an error or redirect that tells a router to try the next dyno. (There are some indications a 'try another' logic already exists in their routers, so it might even be possible for customers to do this without Heroku's help. I have a question in with Heroku support about which request-shedding techniques might work without generating end-user visible errors.)
Both could potentially benefit from some new per-dyno load-monitoring features... which would also allow other more-sophisticated (but more costly and fragile at scale) balancing or load-shedding strategies.
I can see the commentariat lynch mob is out, but definitive recommendations and fixes take time. As they've admitted and apologized for the problem, I'd guess they'll have a more comprehensive response before their end-of-the-month user conference.
[+] [-] btilly|13 years ago|reply
If you make sure that the dynos for a given app are clumped behind few routers in the second layer, then you effectively get the old behavior. But you get it in a much more scaleable way. The cost is, of course, that you add an extra router to everything.
(I emailed this suggestion to them. I have no idea whether they will listen.)
[+] [-] jlgaddis|13 years ago|reply
So Rap Genius, a customer, was able to figure out the issues (from the outside looking in) but Heroku, "on the inside" wasn't able to figure them out?
Or they're playing the "we didn't know, we're going to fix it right away" angle?
EDIT: also, s/failed to/did not/ makes more sense. "failed" implies they tried.
[+] [-] csense|13 years ago|reply
I.e., give their bigger customers like RapGenius (who said they pay Heroku $20k/month and whose HN post spurred this debate) their own dedicated routing cluster with 2-5 nodes. Once a single customer exceeds that size, they're probably paying Heroku enough that Heroku can afford to devote an engineer to working specifically with them to implement large-scale architecture choices like running the app's DB on a cluster, etc.
Pile several smaller customers on a shared routing cluster to cut costs and keep the cluster utilization high, but once the cluster gets to be a certain size (reaches a fixed number of backend dynos or metrics get bad enough), start putting customers on a new cluster.
It should be fairly trivial to use DNS or router rules to dynamically move existing customers from one routing cluster to another.
[+] [-] fatbird|13 years ago|reply
[+] [-] chubot|13 years ago|reply
If you are a PaaS company, and you only have 5 metrics you can record, then 99% percentile latency across all apps should be one of them.
On another note: why is Rails single-threaded??? That seems unbelievable. So if you have a 2 second database query, your Rails process does nothing else for that 2 seconds? I mean people complain about the GIL in Python, which actually has reasons behind it, but this is just crazy.
[+] [-] gfodor|13 years ago|reply
I am still considering having Heroku manage my PostgreSQL instance. This would be a large burden lifted leaving me to just manage the app servers, etc. Is there any reason to be concerned about their PostgreSQL hosting? Any horror stories?
[+] [-] homosaur|13 years ago|reply
Why?
Because I really think that when push comes to shove, Heroku was actually trying to do the right thing with the changes they made and perhaps didn't consider or understand some of the ramifications of the changes they made to the Rails community. They may have fallen in love a little too much with the new node.js hotness and the like. Their CORE audience is startups/new business where Rails is very popular and they understand what this has done to their reputation. If they don't address this in a serious way it will damage their business severely.
I don't personally use Heroku but I have used it in the past and would not hesitate to use it on an appropriate project.
[+] [-] gojomo|13 years ago|reply
[+] [-] 46Bit|13 years ago|reply
Bit I wrote earlier about time-consuming requests - http://news.ycombinator.com/item?id=5216593
[+] [-] oellegaard|13 years ago|reply
1) Releasing a press release at 7 AM in the morning on a Saturday (CET)
2) The release looks mostly like the stuff a politicians spindoctor would ask the politician to say. Don't promis/admit too much.
3) They clearly state that they want to continue with this extremely inefficient way of routing. The right thing to do would be to make smaller clusters of Load Balancers who could then do proper routing, e.g. measuring the number of requests per dyno, last processing time, etc.
I'm currently working on a large project on Heroku and I'm very disappointed about this. We chose Heroku because we believed we could just `heroku scale web=X` when needed. Instead, now we know that it will be of very little use.
In the next week, I will be looking into a solution where I can utilize Heroku's add-on system without running my apps in Heroku Dynos. Creating a small system to hosts LXC's on AWS EC2 seems within my capabilities (or I could use Cloud Foundry's application server component) - and I believe I can configure a load balancer better than Heroku.
Let me know if anyone else is interrested - we could make an open source project for this :-)
[+] [-] jacques_chester|13 years ago|reply
I think they aimed to put out a response ASAP.
[+] [-] tomlemon|13 years ago|reply
Just put it up on Rap Genius – create an account to help explain this post to the Heroku users it affects!
[+] [-] beambot|13 years ago|reply
At least there's a commitment to update the reporting tools... getting bad data in New Relic was (IMHO) the worst -- even worse than out-of-date docs.
[+] [-] bgentry|13 years ago|reply
Picture each individual dyno in that case as its own "intelligent router". Since it's not distributed and this requires no network coordination, the job of knowing which workers are available becomes trivial.
If you're inclined to read up on queuing theory, you'll see that having at least 2 processes per worker makes the problem much simpler.
[+] [-] tomlemon|13 years ago|reply
Tom@Rapgenius| about 1 year ago I know this is a bit of a vague problem, but I've been getting a bunch of Error H12 (Request Timeout)s recently, and I'm not sure what to do about it. It's not like I have some particularly slow actions; I'm getting this error for actions that under most circumstances work totally fine (i.e., return in less than 300ms). Also I don't have a deep request queue (I'm running 40 dynos which is more than enough). Maybe I'm doing some slow queries? Should I upgrade my DB? Also, I do notice that most of my app's time (according to New Relic) is being spent in Ruby (http://cl.ly/29132F272W2D0K1l2I3P). Would upgrading Ruby to 1.9 noticeably help this performance? (I'm a bit nervous it'll create a ton of problems).
Phil@Heroku Hello - I can look into this, but I'll need access to your New Relic account. Will you make sure '[email protected]' has access? Also, from your screenshot I notice your DB times are ~ 100 ms. We recommend keeping those times closer to 50 ms. You might be able to speed things up with a database upgrade. I'll look into New Relic once I have access and let you know what I find.
Tom@Rapgenius Thanks, Phil! How do I give you access to my New Relic account? I tried clicking "account settings" and got this: http://cl.ly/0V2J3i0826400I2s3b2c
Phil@Heroku Tom, I have access now. I'm not sure what was blocking me earlier. After looking at New Relic and the database server, I think a larger database will help. At the very least, it will be helpful to try the next level for a week and compare performance statistics in New Relic with the prior week. Your app is using an Ika right now, and the next step up is the Zilla database. We've made the upgrade process very simple, and it's outlined here - http://devcenter.heroku.com/articles/fast-database-changeove... Your database is ~ 5.4 GB in size (via the 'heroku pg:info' command) so an upgrade shouldn't take too long. You will be able to test the process by adding a Follower and timing it via the 'heroku pg:wait' command. This should give you a good idea of how long it will take to spin up the new database. Also, should the Zilla not help much, the downgrade process to an Ika will be the same. You only pay for the resources used.
The current database server appears to be a bit under-powered when it comes to Compute Units. The Zilla has more power and should provide some room to grow. As for an upgrade to Ruby 1.9.2, I'm not sure how much that would help. It would be an involved upgrade that would take time to plan and deploy. The database upgrade should be a quicker solution. Long-term you may want to consider moving to the Cedar stack and Ruby 1.9.2.
Tom@Rapgenius Thanks! I'm upgrading now
Tom@Rapgenius I'm still getting a ton of "Request Timeout" errors. E.g.: 2011-12-08 14:46:53.222 219 1 2011-12-08T14:46:53+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/Wale-ambition-lyrics dyno=web.17 queue= wait= service=30000ms status=503 bytes=0 one weird thing: there aren't any values listed for the "queue" and "wait" parameters. Could that indicate a problem? Could an exception have been thrown earlier in the request before the timeout? Or does the timeout error just indicate that the request took too long? If it's the latter I'm not sure how to troubleshoot all these errors since the associated actions are fast the vast majority of the time
Tom@Rapgenius Here's another interesting example:
2011-12-08 15:59:32.293 222 1 2011-12-08T15:59:32+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/static/templates_for_js dyno=web.17 queue= wait= service=30000ms status=503 bytes=0 This action is extremely simple – it doesn't access the DB or any external services. Here's the template: Ballin! <% unless current_user %> <% form_for User.new, :html => { :id => '' } do |f| %> Tired of entering your email address? Create a Rap Genius account and you'll never have to worry about it (or anything else) ever again: <%= render :partial => "/users/form", :object => f %> <%= f.submit "Create Account" %> <small>(Already have an account? <%= link_to 'Sign in', login_path, :class => :facebox %>)</small> <% end %> <% end %>
Besides a big request queue (which there isn't), how could this action possibly time out?
Phil@Heroku Tom - sorry for not getting back to you sooner.
It's possible for H12s to occur even for simple actions if there is already queueing for the app. With a busy site like your's, even a few H12s can cause a cascade of H12s for successive requests.
It looks like New Relic has not reported any downtime over the past 24 hours. Can we let the site run through the weekend and see how things look Monday after 3 days of New Relic data with the new Zilla?
Tom@Rapgenius > It's possible for H12s to occur even for simple actions if there is already queueing for the app.
I feel what you're saying, but I don't think my app's queuing. For one thing, New Relic shows 0 time spent in the queue during the period in which I'm getting all these timeouts. For another, I'm running 40 dynos and my average request time is <400ms. So:
400 ms * 3000 requests / minute * 1 min / 60000 ms = 20 simultaneous requests (i.e., 20 dynos) so 40 dynos should definitely be more than enough.
Also, shouldn't Heroku be showing me the queue / wait stats at the time of the timeout? That would help prove whether my app was queuing at the time in question
It looks like New Relic has not reported any downtime over the past 24 hours.
New Relic isn't great at catching intermittent problems like this; you really feel it when you're using the site continuously for an hour or whatever. Also, users make many more HTTP requests than New Relic (since every page load kicks off several AJAX requests).
That said, there has been downtime in the past 24 hours (though less than in the previous 24): http://cl.ly/3D2b1Z170B0w1f1m113m
Tom@Rapgenius Here's some additional data: At 5am this morning (EST), Rap Genius went down. I woke up at 11am (it's a Saturday!), did a logs --tail and observed that basically every request was timing out. I did heroku restart, and now every request started returning a backlog too deep error
Finally, I added another 10 dynos (bumping the total to 50, which is a log of dynos!), and this seems to have fixed the problem – perhaps because my app needs the additional capacity, or perhaps because merely changing the number of dynos reset something else. Either way, I'm sticking with 50 dynos for now out of fear even though I doubt my app needs that many (right?)
Either way, the 5 hours of unexplained downtime (there weren't any application-level exceptions or anything) that was fixable by tweaking my dyno count further supports my theory that something's going on with my app on Heroku's end.
Phil@Rapgenius Tom - I've been looking over your New Relic stats.
First - the good news - the upgrade to a Zilla seems to have helped. Database times are down a bit, which can only help. I checked the actual database server and it's not showing signs of over-work like the previous Ika was. Second, I notice that downtimes reported by New Relic over the past two weeks are in the early morning hours - 3 to 6 AM PST. Do you have any scheduled tasks that run during these times?
Also, request queueing is nearly zero, so 50 dynos does seem like a lot. What are your usage patterns like? The RPM graph in New Relic indicates the normal cyclical usage pattern, lower during the night, but what does Google Analytics tell you?
Finally, the Heroku platform has been having issues over the past week, but none of them correspond to the downtime you had Saturday morning.
[+] [-] latchkey|13 years ago|reply
We go fix all of our slow requests, but we still have H12 errors.
Can you tell us how many dyno's we actually need to serve our requests and not get H12 errors? No.
Hey Rappgenius... thanks for having the balls to call Heroku out in public on this stuff. We are in the same boat.
[+] [-] MediaSquirrel|13 years ago|reply
Just a thought. ;)
[+] [-] WillieBKevin|13 years ago|reply
I would complain about h12 errors, they would tell me to upgrade my resources and/or that it was my problem and there was nothing they can do. We ended up with a solution that was easily 10x as expensive (over-powered DB, too many dynos) as our initial configuration, and it still didn't fix the issue.
I'm happy to provide the full text support requests, but they don't tend to be quite as juicy as the one you posted.
[+] [-] edouard1234567|13 years ago|reply
[+] [-] gojomo|13 years ago|reply
[+] [-] rickyc091|13 years ago|reply
[+] [-] eaurouge|13 years ago|reply
So, do/will they now recommend Puma/Unicorn over Thin?
[+] [-] jeswin|13 years ago|reply
If you went from 1 router to 2, 50% of routers can't optimally route a request. If you went from 2 to 3, you would have 66% which can't route. 3 to 4, 75%.
Once you get to say 10 routers, you are already at 90% sub-optimal routing. So are they saying, the had only 1 or 2 routers earlier?
[+] [-] auggierose|13 years ago|reply
[+] [-] jlouis|13 years ago|reply
Ruby on Rails is using a default configuration where each process can serve one request at a time. There is no cooperative switch (as in Node.js) or (near) preemptive switch (as in Erlang, Haskell, Go, ...).
The routing infrastructure at Heroku is distributed. There are several routers and one router will queue at most one message per back-end dyno in the Bamboo stack and route randomly in the Cedar stack. If two front-end routers route messages to the same Dyno, then you get a queue, which happens more often on a large router mesh.
Forgetting who is right and wrong, there are a couple of points to make in my opinion.
The RoR model is very weak. You need to handle more than one connection concurrently, because under high load queueing will eventually happen. If one expensive request goes into the queue, then everyone further down the queue waits. In a more modern system like Node.js you can manually break up the expensive request and thus give service to other requests in the queue while the back-end works on the expensive req. In stronger models, Haskell, Go and Erlang, this break-up is usually automatic and preemption makes sure it is not a problem. If you have a 5000ms job A and 10 50ms jobs, then after 1ms, the A job will be preempted and then the 50ms jobs will get service. Thus an expensive job doesn't clog the queue. Random queueing in these models are often a very sensible choice.
Note that Heroku is doing distributed routing. Thus the statistical model Rapgenius has made is wrong. One, requests does not arrive in a Poisson process. Usually one page load gives rise to several other calls to the back-end and this makes the requests dependent on each other. Two, there is not a single queue and router but multiple such. This means:
* You need to take care of state between the queues - if they are to share information. This has overhead. Often considerable overhead.
* You need to take care of failures of queues dynamically. A singular queue is easy to handle, but it also imposes a single point of failure and is a performance bottleneck.
* You have very little knowledge of what kind of system is handling requests.
Three, nobody is discussing how to handle the overload situation. What if your dynos can take 2000 req/s but the current arrival rate is 3000, if you forget about routing for a moment. How do you choose to drop requests, because you will have to do so.
If you want to solve this going forward, you probably need Dyno queue feedback. Rapgenius uses the length of the queue in their test, but this is also wrong. They should use the sojourn time spent in the queue which is an indicator for how long you wait in the queue before being given service. According to rapgenius, they have a distribution where requests usually take 46ms (median) but the maximum is above 2000ms. I can roughly have a queue length of 43 and 1 have the same sojourn time then. Given this, you can feed back to the routers about how long a process will usually stay in queue.
But again, this is without assuming distribution of the routers. The problem is way way harder to solve in that case.
(edit for clarity in bullet list)
[+] [-] Shank|13 years ago|reply
They would have warranted a better response if they said they were actively looking into how to improve the routing system, but by the looks of things they're going to sit by and hope developers switch practices so they don't have to solve their problem.
[+] [-] Tekker|13 years ago|reply
Effectively, they have a fundamental architectural problem, and don't know how to work past it.
[+] [-] timv|13 years ago|reply
In short, they went for ~2 years with documentation that advertised features that the implementation didn't have, while receiving a string of support issues that they wouldn't acknowledge as their problem.
Yet, the blog posts show no indication that they are interested in working out why they offered such terrible service to their customers and how they can fix the company culture to take this issues seriously in the future.
There's a chance they'll come up with an architectural solution for their routing problem. But unless they do some serious introspection and work out why they (as a team) stuffed up so badly, then there's no chance that they'll fix that problem.
And if they don't fix that, then why would anyone have confidence that this sort of issue isn't going to be commonplace?
(Disclaimer: Not a Heroku customer)