top | item 2852415

How and Why Mixpanel Switched from Erlang to Python

139 points| ankrgyl | 14 years ago |code.mixpanel.com | reply

80 comments

order
[+] mhd|14 years ago|reply
I might get flamed/downvoted for this, but where's the content of this article? Apart from some vague tropes ("Erlang is bad at string processing!" "We need scalability, and that means async Python", "Pick the right Python libraries!"), there's really nothing interesting about the implementation details, especially considering such a attention-grabbing headline (to put it nicely).

What specifically was bad about the Erlang code? Isn't this just saying that if nobody in your company really understands a language, don't use it?

This is more about the technical competency of a specific company than general technical issues. Or, to put it bluntly, it's more "Mixpanel sucks at Erlang" than "Erlang sucks". Don't get me wrong, I'd be really, really interested in a good analysis why in this case Erlang was the wrong choice, but this article didn't even get close to anything technically interesting (Even with the ubiquitous requests/s graph).

[+] buro9|14 years ago|reply
The thing I learned from the article is that they return 0 for failure and 1 for success. And then, instead of fixing this meaningfully, they added a bit of sticky plaster that enables you to ask for verbose info when you get a 0.

I can't help but think when I spot such code smells that perhaps the issue runs deeper. So when I then read about Erlang vs Python I also can't but think that I bet it's not the language choice that is the problem.

I don't think I fully understood that I judge code at first sight like that, but I clearly do.

[+] noelwelsh|14 years ago|reply
I agree with you. The most important lesson I got from this post is that you'll get a bunch of hits if you choose a provocative title.
[+] killerswan|14 years ago|reply
There isn't even a comparison on the graph between Erlang and Python versions!
[+] fleaflicker|14 years ago|reply
I think it did make a technically interesting point--if your infrastructure is written in a language that nobody in your team has mastered, it might be worth the investment to port it to something else.

Erlang is widely regarded as a great platform for servers with lots of concurrent connections because threads are so lightweight (Facebook chat is built on Erlang). And yet Mixpanel decided to rewrite in a language that would be more maintainable for them in the future.

[+] gms|14 years ago|reply
I agree with you, for what it's worth.
[+] acgourley|14 years ago|reply
I just hope you're not suggestion the author should never have written it - not every blog post needs to be a dazzler. I'll assume you're angry about the upvotes instead.
[+] trefn|14 years ago|reply
Pretty harsh response to a post from an intern, dudes and dudettes.

Yeah, the erlang server code was pretty gross. It was one of the very first things written when we started Mixpanel over 2 years ago, and it's only been updated a few times since then.

I feel like the big thing you guys are missing is how little time we have. It's not like we don't know when we have bad code, or we don't realize that we made mistakes in the initial server design - we just have a million things on our collective plate. Fixing a very simple server - (accept request, validate json, put on queue) that is doing its job okay hasn't been a high priority thing.

When this code was written, Mixpanel had zero customers and we weren't sure what we were building yet. In that regard, Erlang has been a rock. We've barely had to touch it during the rampup from 0 to thousands of requests per second.

Now that we have the manpower, and we've learned what we really need, we can rewrite it to make things easier on ourselves. If we can get acceptable performance in python, there is no reason to use erlang.

I think there's some merit to the other complaints (error codes, etc), but that's another symptom of this thing being written so long ago. We want to improve things incrementally (and backwards-compatibly) for now, but it will be dramatically simpler for us to make changes to the server now that it's written in python.

Ultimately, we have skeletons in the closet, just like the rest of you - I'm sure all of you have some bad code in production somewhere. Now we're saying "Look, we're getting rid of our skeletons!" and you guys are like "OMG WHY YOU HAVE SKELETONS" instead of "sweet, no more skeletons".

[+] rvirding|14 years ago|reply
Seriously, it would be interesting to see your code. As an Erlang inventor/developer it would be interesting to see how the language is actually used and how that relates to the problems people have.

I agree that not having Erlang competence in your company IS a good reason to change language.

[+] rednum|14 years ago|reply

  Finally, we use a few stateful, global data structures to
  track incoming requests and funnel them off to the right 
  backend queues. In Erlang, the right way to do this is to 
  spawn off a separate set of actors to manage each data 
  structure and message pass with them to save and retrieve 
  data. Our code was not set up this way at all, and it was 
  clearly crippled by being haphazardly implemented in a 
  functional style.
Seriously?! I have used only a little erlang, but this makes no sense to me - it's like you were writing some big java project and put everything in one huge class, with all methods and variables static. It's hard for me to imagine why and how someone would write production erlang app with no actors, especially some kind of server. No wonder the thing sucked in the first place.
[+] tzs|14 years ago|reply
From the article:

   Because of these performance requirements, we originally wrote the
   server in Erlang (with MochiWeb) two years ago. After two years of
   iteration, the code has become difficult to maintain.  No one on
   our team is an Erlang expert, and we have had trouble debugging
   downtime and performance problems. So, we decided to rewrite it
   in Python, the de-facto language at Mixpanel.
My first impulse would have been to have one or more team members become Erlang experts. Was that considered?
[+] plinkplonk|14 years ago|reply
"After two years of iteration (on an Erlang codebase) ...no one on our team is an Erlang expert,"

This sounds a little strange. How is this possible? High turnover on the team?

[+] mhd|14 years ago|reply
It seems in a lot of cases, Erlang is just used because of its reputation at being really good at concurrency, mostly in rather minimal API implementations -- or in other words, servers that could easily be done in almost any language that provides some decent event-handling functions. Ruby, Python, node, C/libev, etc.

So unless we're talking about thousands of lines of code, it really doesn't matter what library or language you'll choose for something like this. If this would be your only use of Erlang, it's probably not worth it. Erlang is pretty great at building distributed, high-concurrency applications that are good at coping with errors. For one out of those three, you have plenty of other options…

[+] Aloisius|14 years ago|reply
Out of curiosity, how long do you think it takes to become become an expert in Erlang?
[+] mononcqc|14 years ago|reply
“Finally, we use a few stateful, global data structures to track incoming requests and funnel them off to the right backend queues. In Erlang, the right way to do this is to spawn off a separate set of actors to manage each data structure and message pass with them to save and retrieve data.”

Nope, that’s not the right way. The way you were doing it ended up making all calls sequential and bound to single processes that could lose state. That’s not right.

The best way to do it would have been to use ETS tables (which can be optimized either for parallel reads or writes), which also allows destructive updates, in order to have the best performance and memory usage possible. Note that you could then have had memory-only Mnesia table (adding transactions, sharding and distribution on top of ETS) to do it.

As for string performances, I’m wondering if you used lists-as-strings, binary strings or io-lists to do your things. This can have significant impact in performance and memory use.

Then again, if you had a bunch of Python and no Erlang experts, I can’t really say anything truly convincing against a language switch. Go for what your team feels good with.

[+] breck|14 years ago|reply
> The biggest challenge for me was pushing the server from working 99.9% of the time to 99.99% of the time, because those last few bugs were especially hard to find.

Could you expand upon this some more? How do you know the server works 99.99% of the time (or 99.9%)? Do you run regression tests using actual past requests?

[+] sayrer|14 years ago|reply
:)

Bob's software sucks, let's switch to Bob's software.

[+] dreamdu5t|14 years ago|reply
That's pretty much all I got from that article.
[+] staunch|14 years ago|reply
It sounds like they're accepting really simple HTTP requests (event updates) and inserting a job in a queue.

Really simple + rarely changing + needs to scale to really high req/sec = perfect candidate for being written in C. Maybe as an nginx module?

[+] megaman821|14 years ago|reply
This is only true if the queue isn't the bottle neck. If the queue can only handle 2,500 req/s and the Python program can send at 3,000 req/s, what use is it writing a C program that sends at 12,000 req/s?
[+] jerf|14 years ago|reply
What was the original Erlang performance?

I mean, good enough is good enough, and local culture counts, no problem there, just curious.

[+] theclay|14 years ago|reply
This is exactly what I want to know. What were the numbers?
[+] tigerthink|14 years ago|reply
"The main difference is that eventlet can’t influence the Python runtime, but actors are built into Erlang at a language level, so the Erlang VM can do some cool stuff like mapping actors to kernel threads (one per core) and preemption. We get around this problem by launching one API server per core and load balancing with nginx."

The actor model is for concurrency, which is when your threads are communicating with one another, right? What about the task that the API server does requires inter-thread communication?

[+] j2labs|14 years ago|reply
The author is wrong about simplejson performing 10x better than the json included with python.

Here is my proof: http://j2labs.tumblr.com/post/7305664569/python-vs-javascrip...

[+] ankrgyl|14 years ago|reply
No, we ran an extensive benchmark against log data and found that simplejson was indeed 10x faster. Your benchmark assumes a different "shape" of json dictionary than ours, and I would recommend updating your methodology to use real data instead. I added ujson to our benchmark, and here are the results (seconds):

$ python json_bench.py history.log.1 json 106.270362854 simplejson 11.336577177 cjson 5.63336491585 ujson 3.81600308418

[+] martincmartin|14 years ago|reply
There's not much about "why," in fact, these are the only sentences that are at all relevant to "why:"

After two years of iteration, the code has become difficult to maintain. No one on our team is an Erlang expert, and we have had trouble debugging downtime and performance problems.

Erlang is historically bad at string processing, and it turns out that string processing is very frequently the limiting factor in networked systems because you have to serialize data every time you want to transfer it. There’s not a lot of documentation online about mochijson’s performance, but switching to Python I knew that simplejson is written in C, and performs roughly 10x better than the default json library.

I was able to provide some important operations in constant time along with other optimizations that were cripplingly slow in the Erlang version.

The [Python] community is extremely active, so many of my questions were already answered on Stack Overflow and in eventlet’s documentation.

[+] carbonica|14 years ago|reply
If string processing is a bottleneck in your system, either your system isn't doing anything else interesting to take up CPU time, or you've done something very, very wrong. Serialization is a damn-near solved problem.
[+] mattdeboard|14 years ago|reply
Wow, this post made me feel like the world's most incompetent intern.
[+] hello_moto|14 years ago|reply
Don't beat yourself up.

I once met an intern that is very good at abstraction and writing "OK"-designed OOP code (OK because it looks and sound good minus the ability to unit-test, but other than that it was simple enough for other people to understand and quite flexible). On the flip side, he's not that good when it comes to networking code (pretty much system programming stuff). He could be good, but at that time, software design (in OOP environment) was his forte.

You might have your own pluses. Besides, we don't know what the code looks like or whether what this intern wrote is the truth. If you've been in this industry long enough you'll start to take a lot of things with a lot of grain of salt.

[+] simplegeek|14 years ago|reply
I'm pretty sure he must have received lot of input from his senior peers so let's spare the kid ;)
[+] socratic|14 years ago|reply
It seems like the lesson here is that basically any language (Python, Ruby, ...) will perform about the same with non-blocking I/O.

Does this mean that Erlang and node.js are mostly compelling because of the prevalence of async versions of common libraries? Or are they not that compelling in web contexts in the first place?

[+] megaman821|14 years ago|reply
A lot of the languages will probably perform similar on non-blocking I/O because they are all leveraging epoll (or select or kqueue) underneath it all. There is great variation however, on how the green threads are exposed. Node.js has callbacks, Python has yields, and Erlang has messages. Some of these approaches are easy to reason about and maintain than others.

I always found Haskell's take on parallelism interesting, and maybe it is faster. In Haskell you create a unit of work called a 'spark'. You can have billions of these, they get mapped to lightweight Haskell threads (powered by epoll) and those get mapped onto OS threads.

[+] monopede|14 years ago|reply
I understand that this is meant just as an experience report, but I have to say this article didn't convince me in any way that this rewrite was a good idea. Obvious questions:

1. How does the performance of the new system compare to the old system?

2. What exactly were those maintenance issues with the Erlang server? Did just no-one in your team find the time to learn Erlang well enough? I know Erlang isn't the prettiest of languages, but async I/O isn't the only advantage of Erlang. A battle-tested concurrent runtime and built-in support for fault-tolerance are two obvious examples.

[+] cageface|14 years ago|reply
Erlang is compelling because it's been built from the ground up to support reliable distributed computing and heavily battle tested in incredibly high-volume applications. Non-blocking I/O is just the plumbing in a far more sophisticated machine.
[+] ankrgyl|14 years ago|reply
That's exactly the point we wanted to convey. Erlang and node are great, but we know Python really well and were able to write a performant server with the tools we're familiar with.
[+] Vitaly|14 years ago|reply
After 2 years in production you nave no one on the inside that knows the core part of your system? Duh! Start investing time in your core technology. Blaming Erlang for poor R&D management choices is not going to fly here.
[+] gnubardt|14 years ago|reply
Do they run the message queue on the same box as the gateway server in production? If not then the test he ran isn't a direct comparison (since network latency between the app server & queue isn't accounted for). Running both of those services on the same box isn't great either, since they could slow each other down, and you lose both if the box dies.

Still, very cool, congrats ankrgyl, it's awesome to be able to write stuff like that as an intern!

[+] nivertech|14 years ago|reply
This sums it up:

    "No one on our team is an Erlang expert"
Regarding mochijson we switched to jiffy [1] (NIF-based native C parser).

Also I would love to get a comparison between 2-years old (probably badly written) Erlang server and a new Python/eventlet server.

[1] https://github.com/davisp/jiffy

[+] nirvana|14 years ago|reply
Riak is a fairly large, open source, NoSQL database, written in erlang. I've looked at its source code on occasion knowing little about its internals, and found them to be really comprehensible. Sometimes it is shocking to see how elegant the code is.

At the same time, I have gone and looked at code I wrote back when I was first looking at erlang, that does much less and is much more verbose, confusing and sprawling.

I don't think erlang lacks maintainability. I think it just requires some discipline- like any language.

It sounds like your company has a culture of python hackers and erlang was chosen because you felt you needed to choose something "serious" for this bit of work, rather than because you loved erlang and would use erlang even if you needed to write something trivial. There's nothing wrong with that, but I don't see this article as revealing any hidden weaknesses in erlang.

Regarding the JSON parsing issue, erlang has excellent support for code written in other languages, specifically C, and you could wrap any C based JSON parser and use it, though I bet someone has already done this for you. I believed I was watching such a project on GitHub but can't for the life of me find it now.