Rap Genius (YC S11) responds to Heroku’s call for ‘respect’

[+] tvladeck|13 years ago|reply

It doesn't matter how efficient or inefficient RG was with their Rails app. It's almost certainly true that they could have done things better on their end, and their performance penalty wouldn't have been as severe -- but that really is not the point.

The point is that one company promised a level of service with their product that they did not deliver, and the difference was significant and persistent. The fact that the consumer could have used the product more efficiently is immaterial to that fact.

Other things that don't matter:

-that RG could/should move to another provider. That is of course their choice now, but it does not change the money they've spent and wasted with Heroku.

-that the routing problem is hard. If anything this makes it worse - it's a hard problem so people would pay a lot of money for a solution. What matters is that Heroku claimed to solve it and did not.

-that other consumers of the product managed to figure this out before RG. Heroku was still advertising through their documentation that they offered a routing solution, and they did not make clear to their customers that a significant feature of their product was now different.

Furthermore, Heroku appeared to obfuscate this fact and shift blame to the customer during the time RG was trying to diagnose their issues.

Now, by attacking RG's tone, Heroku have employed argument-level DH2 [1], which at least according to pg is not even worth considering. They have at least acknowledged their mistake, but to me that means that by extension they have sold something that they did not deliver on. The only honest way to move forward is for Heroku to offer some kind of compensation to the customers that were affected.

[1]: http://www.paulgraham.com/disagree.html

[+] DeepDuh|13 years ago|reply

Yes, the comment quality on HN seems to be quite bad when it comes to Heroku threads. Why do so many CS professionals appear to be attaching themselves emotionally to software tools? That's pretty much what I have to conclude if you can't admit that this PaaS provider has screwed up and deserves more scrutiny when deciding for the platform of your next project/migration.

Isn't one of the great things about the Software startup scene that we can decide freely on what tools to use? Except for very niche markets we always have alternatives, even if it means a bit more work on our sides.

[+] brown9-2|13 years ago|reply

You have to feel comfortable that those people will generally give you good value for your money (since you can’t literally observe everything they do) and that they will tell you when something’s wrong as soon as they know, rather than covering it up.

I used to feel this way about Heroku, and I might again in the future, but I don’t right now.

I have a hard time understanding why, for all the money Rap Genius pays Heroku, they don't simply set up their own instances on EC2 and run the app there themselves. It seems like for a few days work with Puppet or Chef you could automate getting your code onto dozens of EC2 instances and installing the necessary tools/server processes, plus you don't have to complain anymore about how you can't run Unicorn.

Yes I get that there is a certain amount of value in being able to pay someone else to do all these things for you and saving time - but if you aren't happy with the result and the value given the money you are paying (and RG is not), then at a certain point it's time to just bite the bullet and fix things yourselves instead of continuing to be hamstrung by problems that the hosting provider won't/can't fix. There comes a point where you get large enough, and you are paying enough to Heroku, that it would be worth it to do things yourself and eliminate the problems.

[+] rubyrescue|13 years ago|reply

This is so true. The fact of the matter is that Rap Genius has obviously had to have someone spend a ton of time diagnosing problems with Heroku - and is objectively cheaper just to host some servers compared to Heroku dynos

This is why I always tell people that Heroku is actually NOT a good solution if you truly need scale. They're good for staging, launch, and an early traffic emergency or two. After that, ONCE YOU NEED TO SCALE, it's cheaper just to run your own servers, because the problem that Heroku is solving for you becomes a smaller and smaller percentage of your overall oeprations budget.

[+] znowi|13 years ago|reply

I agree with this point, however, how Rap Genius spends its money isn't an issue here. Whatever the reason, they paid and expected to get an adequate service from the company, which they didn't. And on top of this, they found the shady practice at work. And this is a big fucking issue, if you ask me.

[+] JangoSteve|13 years ago|reply

I have a hard time understanding why, for all the money Rap Genius pays Heroku, they don't simply set up their own instances on EC2 and run the app there themselves.

Who says they won't do that now?

Obviously when they started, they had no idea they'd have these problems or that they'd spend so much time diagnosing them, because Heroku told them that they wouldn't have these problems to begin with.

[+] andrewvc|13 years ago|reply

Fact of a matter w/ anything outsourced is that you can outsource responsibility, but you can't outsource accountability.

Ultimately RG's devs are responsible for their choice to leave all the admin work up to heroku.

[+] oijaf888|13 years ago|reply

Yeah you would think the cost savings from EC2 and the 60K they spent on New Relic would cover paying for a quality sysadmin to run that stuff.

[+] thraxil|13 years ago|reply

"Yes, one solution is to run a concurrent web server like Unicorn, but this is very difficult on Heroku since concurrent servers use more memory and Heroku’s dynos only have 512mb of ram, which is low for even processing one request simultaneously."

Is this really accurate? 512mb is barely adequate for serving a single request at a time? I'm not a Rails developer, but that sounds terrible. I'm all for trading off some performance for rapid development, but that seems a bit extreme.

I'm currently running twelve Django apps on one 512MB Rackspace VM. It's a bit tight, and I don't get a lot of traffic on them, but it's basically fine. And that's with Apache worker mpm + mod_wsgi (with an Nginx reverse proxy in front) which probably isn't even the lightest approach. And having been writing apps in Erlang and Go recently, I'm starting to feel like Python/Django are unforgivably bloated in comparison.

[+] thomasmeeks|13 years ago|reply

It really depends on your application. A fresh rails app will take up ~30mb of memory (iirc, been a while since I checked). Thirty gems and 11,000 lines of code later, yes, it can spike to 256mb.

If I were to toss down an average, seems like ~100mb is what I see most of the time for non-trivial rails apps.

[+] fomb|13 years ago|reply

I have many apps on Heroku, all running Rails, and mostly running on Unicorn, with three or four workers. Most of the apps I've seen pass me by use no more than 150Mb per worker, and there's a fair amount of work going on in many of them with image processing and the like.

512Mb for a single application sounds incredibly high to me.

EDIT: After looking at the docs, it seems like 512Mb isn't even a hard limit: https://devcenter.heroku.com/articles/dynos#memory-behavior

[+] kmfrk|13 years ago|reply

Atwood's new Discourse thingy recommends 1GB of RAM:

    We also recommend a minimum 1 Gb RAM to host Discourse,
    though it may work with slightly less.

~ http://www.discourse.org/faq/

Guess it depends.

[+] benologist|13 years ago|reply

Reading things like 512mb isn't enough for more than one request at a time, and one request at a time, and the performance of that one request looking terrible even though it's obviously got an entire vm dedicated to it...

What are (edit:) Rails developers getting in exchange for these enormous penalties that makes it worth choosing?

[+] grey-area|13 years ago|reply

Most rails apps use nothing like that amount of memory, the norm is more like 80-150MB. There are various factors which affect how much memory you use and of course if your processes are leaking memory they could easily grow over time and hit any limit. Rails itself is taking up around 30MB, so this is all about the specific app code. Another common problem is loading lots of records into memory (say fetching all your user records at once), this will allocate but then not free lots of memory. Personally I find passenger handles this perfectly well out of the box without having to worry too much about memory usage, routing or other issues, but it does require keeping an eye on the app code as the app grows and fixing any issues that come up with memory usage or response time. Those are not problems specific to rails.

Without knowing the specifics it's hard to say for sure, but I really think RG should try a comparison with running their own real VM (not a web worker on heroku) and see how well it runs. If they'd done that they'd probably find and fix the reasons that their processes are taking so long to respond and taking up such a huge amount of memory, because they'd feel more ownership of those problems, instead of playing a blame game with heroku.

This is not rocket science but it is a series of trade offs and heroku seem to have optimised for short running processes which don't take up lots of memory - many web apps run that way and would be happiest with random routing. Yes heroku could do better but at some point you have to take responsibility for your own ops instead of expecting some service to abstract away all the hard stuff, particularly if you're seeing performance issues and have a busy site. The amount they're paying heroku would easily pay for far more vps than they need.

So in summary, Heroku is not for everyone, and rails isn't really the problem here, so there are no enormous penalties for using it, just the sort of problems you see running any web app.

[+] dustym|13 years ago|reply

Rails developers

[+] unknown|13 years ago|reply

[deleted]

[+] seivan|13 years ago|reply

What kind of bullshit is this? That's 512mb of shared resource, you decide how many requests it actually is.

usually larger rails app can do 2-3 requests on a dyno. Just configure Unicorn workers to that and set it on your procfile. This is known since 2011 (a week after Cedar as announced)

[+] grk|13 years ago|reply

Speed and ease of development, mostly.

[+] aelaguiz|13 years ago|reply

The complaints of what amounts to essentially support contract extortion are something that I've personally experienced.

They were literally ignoring our repeated customer service tickets pleading for assistance or a phone call or something. We were paying them hundreds of dollars per month at the time.

When we finally got through the only people we could get ahold of were salesman. Essentially we were made to believe that only for $1000/mo support contract would we receive customer support.

FWIW Our issue was frequent network timeouts to other ec2 services which were. They did eventually resolve those after months and never did they assist us.

Heroku's platform is a significant accelerator of development for a startup. Using the platform has enabled us to do things faster and better than we'd otherwise be able to do them for the money and time we've invested.

That being said, I look forward to they day they have a true/viable competitor and are forced to compete on service. I'm extremely bitter towards them at the moment as a result of my customer support torture experience.

[+] ollysb|13 years ago|reply

Yes I got bitten by their lack of customer support a couple of weeks ago. I did a release and the rails asset pipeline stopped precompiling the resources. I'd tested in staging so this came as a bit of a surprise. I promptly rolled back to the previous release(had been working fine for days) only to find that that now was broken as well. With my production app now broken I fired off a request for support. At this time we were running 8 dynos and 3 workers(not to mention a bunch of addons). This was also Saturday afternoon, which turned out to be a bit of a problem, I received an auto-response saying that support was only available Monday to Friday! Paying the premium rates for heroku and not receiving support for a production failure really was a bitter pill to swallow. We're running fast at the moment and don't have time to switch off but when we do will certainly be looking at the options.

[+] wmf|13 years ago|reply

Nah, I think Heroku is pretty principled. There's no amount of money you can pay them to get working load balancing or multi-region reliability.

[+] kmfrk|13 years ago|reply

Rap Genius gets a (YC) tag, but Heroku don't?

I've always wondered whether the cut-off is time- or success-based. Maybe pg should write a Boolean return function for that. :P

Big props to Rap Genius for explaining the problem so plainly in the article. Unfortunately, many people of prominence in tech aren't even capable of talking about what they do to laymen.

[+] rcavezza|13 years ago|reply

RE: YC Tag - I think it is because Heroku was acquired.

[+] sologoub|13 years ago|reply

This entire thing against Heroku is so disingenuous... The fact that New Relic didn't expose these metrics is not great, but has very little to do with Rap Genius team not knowing about the metric.

Apparently, the fact that requests can be queued at Dyno level was common public knowledge back in 2011! Here's a quote from Stackoverflow answer:

"Your best indication if you need more dynos (aka processes on Cedar) is your heroku logs. Make sure you upgrade to expanded logging (it's free) so that you can tail your log.

You are looking for the heroku.router entries and the value you are most interested is the queue value - if this is constantly more than 0 then it's a good sign you need to add more dynos. Essentially this means than there are more requests coming in than your process can handle so they are being queued. If they are queued too long without returning any data they will be timed out."

Source: http://stackoverflow.com/a/8428998/276328

When you use a PaaS, it doesn't mean you don't need to be serious about it and completely forget about all technical aspects. Granted, it should have been included with New Relic from day one, but hardly justifies such a direct and persistent attack on Heroku.

[+] amatix|13 years ago|reply

From the article, it sounds like they were well aware of the logs & queue values, but they were misleading:

Their logs are STILL incorrect. Here’s a sample line:

  2013-03-02T15:41:24+00:00 heroku[router]: at=info method=GET path=/Asap-rocky-pretty-flacko-lyrics host=rapgenius.comfwd="157.55.33.98" dyno=web.234 queue=0 wait=0ms connect=3ms service=366ms status=200 bytes=25582

Those queue and wait parameters will always read 0, even if the actual value is 20000ms. And this has been the case for years.

[+] cbs|13 years ago|reply

Here's a quote from Stackoverflow answer

I tend to read (and trust) official documentation before Stack Overflow. I use Stack Overflow and it is great tool and all, but it can be really hit or miss. It doesn't cover every corner of every tech, and unless the answer is availible somewhere on the internet, or the person answering has first-hand experience it can lead to misleading wishy-washy answers.

Ultimately, pointing to SO you're lowering the expectations of a paid service from "the documentation reflects the product" so damn low to the point of "users should read everything googleable about the product they're using, and trust that OVER the official docs. Including mailing list posts from 2011 and a stack overflow question that asks a different question than you're asking"

[+] vannevar|13 years ago|reply

The problem wasn't that queuing delay was impossible to detect. The problem was that the documentation described a specific load balancing setup that would have guaranteed better performance per dyno, and that setup was not in fact what was being delivered. It was clearly a material misrepresentation, and in any other service context would constitute a deceptive trade practice. That Heroku is being defended at all is a testament to the goodwill they've built up in the tech community, but it doesn't change the fact that they misrepresented their service, even if it was negligence rather than malice.

[+] jonmc12|13 years ago|reply

Why does Lehman say Heroku is "one of a kind in the world"? Isn't Cloud Foundry equivalent? http://www.quora.com/What-are-the-main-differences-between-C...

[+] spronkey|13 years ago|reply

I'm astounded at the number of "$60k hires a good sysadmin and some EC2 resources" comments. You guys clearly don't understand exactly what Heroku (or a similar service) offers - providing it works.

There's a concept called a Bus Factor. Basically, it's the number of people who, if hit by a bus and made otherwise unusable, it would take to completely rail your business.

With $60k spent on a single sysadmin and an army of EC2, that's a pretty effing small bus factor - 1. So... that one guy gets taken out of action, and they're more or less toast? Yeah, no. Heroku gives them a massive bus factor for perhaps a little bit more money than it would take to cheap it themselves. It's a cheap way to avert risk.

They're probably at the size now where they could handle taking it in-house, but you've still then got to factor in hiring, developing the procedures for ops inhouse etc., and migrating. It's not easy to just flip the switch.

In any case, Heroku's behaviour is pretty shoddy. Though, knowing how much of a pain documentation is, I'm not surprised. I don't think they realised just how bad the change from intelligent to random routing actually was - and didn't treat it as such. This is giving them benefit of doubt though, because the other option is that they didn't publicise it precisely because they knew how bad it is. Scary thought.

[+] plasma|13 years ago|reply

I think it's obvious that Rap Genius would be happy with a "I see how its a problem, let us fix it" quote from Heroku - just acknowledging that there is an underlying problem and that there is a future on the platform.

[+] dkhenry|13 years ago|reply

This is the tech world equivalent of tabloids. Please don't promote this mindless back and forth, If you have a problem with Heroku leave and go to one of the other providers. If you don't stay and push them to fix this problem. Either way stop pretending this is some huge event that we must mindlessly obsess over

[+] jfim|13 years ago|reply

Indeed, especially considering it's painfully obvious that the problem isn't on Heroku's side but rather on their app's dismal performance. You should be able to easily do a couple of dozen requests per second; this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching:

  $ ab -n 1000 -c 20 https://*****-staging.herokuapp.com/**********
  This is ApacheBench, Version 2.3 <$Revision: 655654 $>
  Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
  Licensed to The Apache Software Foundation, http://www.apache.org/
  
  Benchmarking *****-staging.herokuapp.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Completed 600 requests
  Completed 700 requests
  Completed 800 requests
  Completed 900 requests
  Completed 1000 requests
  Finished 1000 requests
  
  
  Server Software:        
  Server Hostname:        *****-staging.herokuapp.com
  Server Port:            443
  SSL/TLS Protocol:       TLSv1/SSLv3,AES256-SHA,2048,256
  
  Document Path:          /**********
  Document Length:        9670 bytes
  
  Concurrency Level:      20
  Time taken for tests:   7.130 seconds
  Complete requests:      1000
  Failed requests:        0
  Write errors:           0
  Total transferred:      10034000 bytes
  HTML transferred:       9670000 bytes
  Requests per second:    140.25 [#/sec] (mean)
  Time per request:       142.606 [ms] (mean)
  Time per request:       7.130 [ms] (mean, across all concurrent requests)
  Transfer rate:          1374.25 [Kbytes/sec] received
  
  Connection Times (ms)
                min  mean[+/-sd] median   max
  Connect:       55   59  31.7     58    1057
  Processing:    37   82  43.8     66     308
  Waiting:       35   74  42.7     57     298
  Total:         92  141  53.4    124    1096
  
  Percentage of the requests served within a certain time (ms)
    50%    124
    66%    138
    75%    153
    80%    166
    90%    199
    95%    239
    98%    282
    99%    301
   100%   1096 (longest request)

Edit: formatting.

[+] olefoo|13 years ago|reply

I'm not Heroku's biggest fan, and haven't used it for more than a couple of one-off fiddles to play with the platform.

But, my sympathy is going to them, because what I see coming from Rap Genius looks like classic blame-game. So a vendors documentation was unclear and your server sucked publicly for some time? Shameful. You didn't know about it because you expected your vendors to give you extra hand-holding? That's really rough. Instead of fixing the issues and moving on, you make it the one thing that everyone thinks about when your company is mentioned... that might not be in your best long term interests.

After this, I would be hesitant to enter into any sort of relations with Rap Genius, and I'm not that sure of what they do or what their product is.

[+] gregd|13 years ago|reply

I'm having a hard time understanding your justification for your sympathy going to Heroku.

RG is paying for PaaS from Heroku based on documentation, sales pitches, etc. They're also paying good money for the tools necessary to make business decisions based on data collected from that PaaS. Just given the realm of customer service, why wouldn't you expect "hand-holding" from your vendor? Why is it unreasonable to have that expectation? Why is it acceptable for your vendor to have a fall down response in "optimize your web-stack"? How do you expect them to "fix" this problem without the vendors involvement? What did you expect them to do, change platforms? How are they supposed to "move on" when the issue hasn't been resolved?

Have we gotten so far away from customer service with the likes of Google, that we don't even know what that means anymore? Are we to settle for mediocrity from any PaaS because our expectations are just too high?

[+] paul_f|13 years ago|reply

We were promised flying cars and got online Rap lyrics instead.

[+] unknown|13 years ago|reply

[deleted]

[+] dtweney|13 years ago|reply

Here's the other side of the story, from Heroku: http://venturebeat.com/2013/02/28/heroku-chief-opens-up-abou...

[+] neya|13 years ago|reply

Just curious - Why after all this mess, didn't Rap genius recommend Engine Yard (Heroku's competitor). Is it because they had similar issues too, or did they simply ignore not trying to switch over to a different provider altogether? Just curious.

[+] aptwebapps|13 years ago|reply

Seems like it would just muddy the waters further if they recommended someone else.

[+] hashset|13 years ago|reply

Did they seriously sell a Gem 'New Relic' as a diagnostic tool that flat-out makes up queuing and response latency numbers on requests to their platform? If this is true then hell yes they need to refund all their customers!

[+] wmf|13 years ago|reply

New Relic is a third party tool that Heroku resells. The numbers aren't made up; they are measured, but in the wrong place. The result is still wrong numbers, but it's not obvious where to pin the blame.

[+] ChuckMcM|13 years ago|reply

So what happens when Heroku says "Ok, fine, we can't give you the service you want, please download any data you want to keep and we'll re-allocate those resources to our other customers in 60 or 90 days." ?

This has taken on the patina of a really huge fight between operations and engineering with nobody to step in and say "Hey, we both want to make progress here, let see what we can do." there is no common point of contact here sadly.

What is the end goal? One of these companies being out of business? What? Its pretty clear that Heroku doesn't have any ideas on how to implement routing the way Rap Genius believed it worked, they even said as much. So what is the next step?

[+] wmf|13 years ago|reply

For $60,000 per month they can't create a mode where all your dynos are behind a single HAProxy with "intelligent" least-connections load balancing?

[+] ctovision|13 years ago|reply

Heroku should make this right if they want long term success.

[+] herokulawsuit|13 years ago|reply

[deleted]

87 comments