top | item 8631022

Node.js in Flame Graphs

778 points| stoey | 11 years ago |techblog.netflix.com

240 comments

order
[+] ChuckMcM|11 years ago|reply
The moneyquote:

"We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue."

This situation is my biggest challenge with software these days. The advice to "just use FooMumbleAPI!" is rampant and yet the quality of the implemented APIs and the amount of review they have had varies all over the map. Consequently any decision to use such an API seems to require one first read and review the entire implementation of the API, otherwise you get the experience that NetFlix had. That is made worse by good APIs where you spend all that time reviewing them only to note they are well written, but each version which could have not so clued in people committing changes might need another review. So you can't just leave it there. And when you find the 'bad' ones, you can send a note to the project (which can respond anywhere from "great, thanks for the review!" to "if you don't like it why not send us a pull request with what you think is a better version.")

What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

There was an interesting, if unwitting, simulation of this going on inside Google when I left, where people could check-in changes to the code base that would have huge impacts across the company causing other projects to slow to a halt (in terms of their own goals) while they ported to the new way of doing things. In this future world changes, like the recently hotly debated systemd change, will incur costs while the users of the systems stop to re-implement in the new context, and there isn't anything to prevent them from paying this cost again and again. A particularly Machievellan proprietary source vendor might fund programmers to create disruptive changes to expressly inflict such costs on their non-customers.

I know, too tin hat, but it is what I see coming.

[+] akkartik|11 years ago|reply
You're assuming that your closed source vendors are perfectly aligned with you. In practice they almost inevitably seem to cause capture (https://en.wikipedia.org/wiki/Regulatory_capture).

Open/closed is a red herring here. Projects slowing down as they succeed seems to be a universal phenomenon, from startups to civilizations. Specialization leads to capture. I think almost exclusively about how to fix this: http://akkartik.name/about (you've seen and liked this), http://www.ribbonfarm.com/2014/04/09/the-legibility-tradeoff

Disclosure: google employee

[+] vorador|11 years ago|reply
> What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

To contrast with what you said, I've worked at Microsoft, which is almost the company that invented NIH and we had the same problem. I think it's because, to paraphrase Alan Perlis, programming nowadays is less about building pyramids than fitting fluctuating myriads of simpler organisms into place.

[+] chx|11 years ago|reply
This is why I like choosing open source technologies with some sort of commercial support available. The best support I've ever gotten was from MySQL ab. (before Sun) -- the 10k we paid them for two years and three servers was affordable back then even for a small startup. I had a MySQL engineer (if memory serves he is now a MySQL community manager at Oracle) SSH into my server 34 minutes after a desperate call.

Disclaimer: I work for a company providing commercial support (scaling) for an open source project (Drupal).

[+] nissimk|11 years ago|reply
Actually, open source allows something that is not possible with commercial software. You don't have to read all of the code and understand all of the implementation details, but you have that option if you need to. I think having the option is a great benefit as we can see from this example. When they started debugging they could reference the express source and figure out what was going on. If they were using some sort of commercial framework they would have had to either refer to the docs or call the help desk.

Just because you have the option to read the implementation details of every library and service you are using doesn't mean that you have to. You only have to learn enough about it to decide whether you think it is a good addition to your stack, and to use it to do whatever you are trying to get done. But open source gives you the ability to figure out why you're using it wrong, or how it is broken when that time comes.

If you are saying that open source is not documented well enough because the developers fall back on "check the source," that is a different argument where commercial software may be better, but this is not true for any of the more common open source software I've used.

Making a decision of which software to depend on in your application is something that is always difficult whether you have access to the source code of all of the choices or not. It's a decision that you make with limited information. You only have extensive knowledge of the tools that you already have experience with, so using alternatives is always a risk, but understanding and managing that risk is part of the developer's job.

And with regard to keeping up with changes, you always can remain with previous versions for some time to avoid the slowdown associated with shifting to new APIs.

[+] rgawdzik|11 years ago|reply
You can't misuse closed source APIs? How would you know something is O(n) without seeing the source code?
[+] runT1ME|11 years ago|reply
>What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

I think...your experiences at Google have altered your world view to the point where you don't see how thing are happening at other organizations. Google's monolithic codebase where everything builds against everything else may work(?) for them, but the alternative is disciplined module management.

You don't have to ever bump a version of working code if it's doing its job. Good open source projects should absolutely (publicly) test against performance regressions. New versions should be minor/incremental and source compatible.

I've never worked on a codebase the scale of Googles', but I fail to see how you can't mitigate your concerns, nor do I see commercial software the solution.

[+] dantiberian|11 years ago|reply
Consequently any decision to use such an API seems to require one first read and review the entire implementation of the API, otherwise you get the experience that NetFlix had.

There's no getting around this, you or someone that you trust (not necessarily at your company) needs to read and review both the API, and implementation details for open source software that you use. Open source software isn't a hardware store you can dip into to get the latest parts that you need for your project. Adding a dependency makes your code depend on other peoples code. Just like internal code should go through a code review, so should external dependencies. This also explains why for a lot of companies, it's easier to write their own thing rather than need to keep on top of other people's changes.

I happened to write about this yesterday which explains my thoughts further http://danielcompton.net/2014/11/19/dependencies.

[+] dpweb|11 years ago|reply
Especially in high availability/load systems like Netflix IMO you need to reduce complexity and that means less modules that depend on 69 other modules, each of which depends on 69 modules etc.. What's going on as this sw proliferates is insane.

Then, you've got to be intimately familiar with every piece. There's no excuse when the source is readily available. I give these guys credit though they seem to take responsibility instead of just saying "express sucks". Some of their design choices seemed a little shaky.

[+] gcv|11 years ago|reply
This is precisely why — for some products in some industries — NIH is a reasonable strategy for writing good programs.
[+] darkandbrooding|11 years ago|reply
Please forgive me if I'm misinterpreting you, but the lament you make about open source software seems to me to be more about distributed systems. It is an interesting observation, and now that you've pointed it out I can observe the pattern at past and present employers. But I have seen that pattern on internal software, written by the company for the company. This makes me think the problem is an architectural one.

I hesitate to comment about the relative velocities of open source vs proprietary software because I do not have enough experience with commercial, third party software. My sample size is too small, but I'm inclined to agree with you.

I don't disagree with your Machiavellian conspiracy, either, but I've worked in marketing and advertising so I know that some of the villains are on the payroll. Maybe there needs to be a third category? There's open source software, written by someone who has no particular relationship to you. There's commercial software, written by someone who has a positive economic relationship with you. And then there is ... corporate?... software, written by someone who might think they have a zero sum relationship with you.

[+] collyw|11 years ago|reply
I would say this is more of a mature software versus new software rather that open versus closed source . Node isn't even at version 1 yet.
[+] thedufer|11 years ago|reply
> It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.

Its actually quite clear - most routes are defined by a regex rather than a string, so there is no built-in structure (if there's a way at all) to do O(1) lookups in the routing table. A router that only allowed string route definitions would be faster but far less useful.

I can't explain away the recursion, though. That seems wholly unnecessary.

Edit: Actually, I figured that out, too. You can put middleware in a router so it only runs on certain URL patterns. The only difference between a normal route handler and a middleware function is that a middleware function uses the third argument (an optional callback) and calls it when done to allow the route matcher to continue through the routes array. This can be asynchronous (thus the callback), so the router has to recurse through the routes array instead of looping.

[+] quotemstr|11 years ago|reply
Of course there's a faster way! Combine all the routes into a DFA, then run the DFA over the URL. It's guaranteed to run in constant space and O(n) (n=URL length) time! The union of any set of regular languages is itself a regular language.

You can use Ragel[1] to build your automaton.

[1] http://www.colm.net/open-source/ragel/

[+] rwaldin|11 years ago|reply
I'm surprised nobody has mentioned that express has a built in mechanism for sublinear matching against the entire list of application routes. All you have to do is nest Routers (http://expressjs.com/4x/api.html#router) based on URL path steps and you will reduce the overall complexity of matching a particular route from O(n) to near O(log n).
[+] remon|11 years ago|reply
I wonder what the thought process was behind moving their web service stack (partially?) to node.js in the first place. For a company with the scale and resources of Netflix it's not exactly an obvious choice.
[+] personZ|11 years ago|reply
Netflix seems to operate like a tech start-up that is trying to glue together a ragtag collections of often unsuitable solutions because of limited funding. It is a deeply perplexing company.

Similar is LinkedIn, as an aside -- despite being fairly formidable now, I regularly have entire feeds disappear, their caching is abhorrent, they can't markup text properly, and so on. It seems very amateur hour, yet they regularly publish "how it's done" documents that see wide applause despite often completely contradicting their prior missives.

[+] yourad_io|11 years ago|reply
What are the arguments against node.js in their use case?

Not looking to start any wars, but I was under the impression that if you know what you're doing* node.js is pretty awesome.

This particular bug had to do with a misunderstanding regarding the express API.

* for the most part: understand async and closures/memory leaks.

[+] tjholowaychuk|11 years ago|reply
I share this thought, I'm not trolling, I really believe node is a bad solution for something like Netflix.

Node has its perks but for a money making machine that relies solely on being available and providing a good customer experience, not so much.

I can't imagine the ops nightmares at that size, one buggy code path and the entire cluster could be down. These are issues that drove me away from Node to Go, in my opinion Node has way too many issues to run in money-making scenarios.

[+] vkjv|11 years ago|reply
> ...as well as increasing the Node.js heap size to 32Gb.

> ...also saw that the process’s heap size stayed fairly constant at around 1.2 Gb.

This is because 1.2 GB is the max allowed heap size in v8. Increasing beyond this value has no effect.

> ...It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.

It it is non-trivial (not possible?) to do this in O(1) for routes that use matching / wildcards, etc. This optimization would only be possible for simple routes.

[+] tedchs|11 years ago|reply
That seems like a pretty low size to me... how are people getting around this when they need to handle >1.2GB of data on Node?
[+] herge|11 years ago|reply
> It it is non-trivial (not possible?) to do this in O(1) for routes that use matching / wildcards

I'd be impressed if they did it consistently in O(1) for static routes. I think they were looking for O(log(number of different routes)) instead of O(n).

[+] tjholowaychuk|11 years ago|reply
Sounds like a documentation issue, or lack of a staging environment. I've written and maintained countless large Express applications and routing was never even remotely a bottleneck, thus the simple & flexible linear lookup. I believe we had an issue or two open for quite a while in case anyone wanted to report real use-cases that performed poorly.

Possibly worth mentioning, but there's really nothing stopping people from adding dtrace support to Express, it could easily be done with middleware. Switching frameworks seems a little heavy-handed for something that could have been a 20 minute npm module.

[+] _Marak_|11 years ago|reply
I read:

"This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array"

refresh our route handlers from an external source

This is not something that should be done in live process. If you are updating the state of the node, you should be creating a new node and killing the old one.

Aside from hitting a somewhat obvious behavior for messing with the state of express in running process, once you have introduced the idea of programmatically putting state into your running node you have seriously impeded the abiltity to create a stateless fault tolerant distributed system.

[+] emeraldd|11 years ago|reply
When I concluded what they had to be doing and then read the actual confirmation of what they were doing I was somewhat shocked. Why on Earth would you want to programatically recreate the routes in an express app?!?!? It would be really interesting to see a write up on what/why they think this kind of behavior is needed in the first place ....
[+] TheLoneWolfling|11 years ago|reply
> benchmarking revealed merely iterating through each of these handler instances cost about 1 ms of CPU time

1ms / entry? What is it doing that it's spending 3 million cycles on a single path check?

[+] clebio|11 years ago|reply
> I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.

This has me scratching my head. The diagrams are pretty, maybe, but I can't read the process calls from them (the words are truncated because the graphs are too narrow). And I can't see, visually, which calls are repeated. They're stacked, not grouped, and the color palette is quite narrow (color brewer might help here?).

At least, I _can_ imagine how you could characterize this problem without novel eye-candy. Use histograms. Count repeated calls to each method and sort descending. Sampling is only necessary if you've got -- really, truly, got -- big data (which Netflix probably does), but I don't think the author means 'sample' in a statistical sense. It sounds more like 'instrumentation', decorating the function calls to produce additional debugging information. Either way, once you have that, there are various common ways to isolate performance bottlenecks. Few of which probably require visual graphs.

There's also various lesser inefficiencies in the flame graphs: is it useful (non-obvious) that every call is a child of `node`, `node::Start`, `uv_run`, etc.? Vertical real-estate might be put to better use with a log-scale? Etcetera, etc.

[+] donavanm|11 years ago|reply
> The diagrams are pretty, maybe, but I can't read the process calls from them (the words are truncated because the graphs are too narrow).

Flame Graphs provide SVGs by default. You should be able to zoom if your broser supports it. The current version also supports "zooming" in to any frame in the stack, resetting that frame as the base of the display. Also WRT the base frames of 'node' et al its because Flame Graphs are a general use tool for stack visualization, it might be 'main' for a c program or the scheduler looking at a system.

> They're stacked, not grouped, and the color palette is quite narrow (color brewer might help here?).

Colors by default have no meaning and the palette is configurable. The current lib can also assign colors by instruction count/ipc and width by call count, if you have access to that.

> Sampling is only necessary if you've got -- really, truly, got -- big data (which Netflix probably does), but I don't think the author means 'sample' in a statistical sense.

It is sampling. Flame graphs re typically used with something like perf/dtrace/oprofile which dumps stacks at a couple hundred to thousand hertz. Actual call tracing is (typically) not feasible for large/prod stacks.

[+] jasonkester|11 years ago|reply
You're looking at a screenshot. The actual diagram isn't static. Hover over it and it will expand each box with all the info you need to see what was run, what called it, etc.
[+] drderidder|11 years ago|reply

  > our misuse of the Express.js API was the 
  > ultimate root cause of our performance issue
That's unfortunate. Restify is a nice framework too, but mistakes can be made with any of them. Strongloop has a post comparing Express, Restify, hapi and LoopBack for building REST API's for anyone interested. http://strongloop.com/strongblog/compare-express-restify-hap...
[+] wpietri|11 years ago|reply
From the article:

> What did we learn from this harrowing experience? First, we need to fully understand our dependencies before putting them into production.

Is that the lesson to learn? That scares me, because a) it's impossible, and b) it lengthens the feedback loop, decreasing systemic ability to learn.

The lesson I'd learn from that would be something like "Roll new code out gradually and heavily monitor changes in the performance envelope."

Basically, I think the approach of trying to reduce mean time between failure is self-limiting, because failure is how you learn. I think the right way forward for software is to focus on reducing incident impact and mean time to recovery.

[+] akkartik|11 years ago|reply
Without over-training on this one incident, and without guidance on how to get from here to there (I'm still working on that):

1. Don't get suckered by interfaces, share code. If you create code for others to share ("libraries"), stop trying to hide its workings.

2. You don't have to learn how everything works before you do anything. But you should expect to learn about internals proportional to the time you spend on a subsystem. Current software is too "lumpy" -- it requires days or months of effort before yielding large rewards. The first hour of investigation should yield an hour's reward.

3. "Production" is not a real construct. There will always be things that break so gradually that you won't notice until they've gone through all your processes. Give up on up-front prevention, focus instead on practicing online forensics. And that starts with building up experience on your dependencies.

More elaboration: http://akkartik.name/post/libraries2

My attempt at a solution: http://akkartik.name/about

My motto: reward curiosity.

[+] quaunaut|11 years ago|reply
> I think the right way forward for software is to focus on reducing incident impact and mean time to recovery.

So in this case, guarantee you have a strong means of evaluating performance and maybe even include it by default just to be sure.

[+] forrestthewoods|11 years ago|reply
If I had to pick one line to highlight (not to criticize, but was a wise lesson worth sharing) it would be this one:

"First, we need to fully understand our dependencies before putting them into production."

[+] gdulli|11 years ago|reply
In my experience developers constantly overestimate the gain of using a new dependency and underestimate the amount of effort it will take to sufficiently understand it. (Or fail to make the effort, not understanding the risks.)

This is why developers without significant experience should not be making decisions about the tech stack.

[+] _RPM|11 years ago|reply
[Subjective] Not to criticize the Express.js code base, but have you tried reading it? It is very complicated and there are a bunch of clever things going on. I think it could have been written simpler and easier to understand. A problem with frameworks is that they are written by people who want to show off how clever they are.
[+] augustl|11 years ago|reply
A surprising amount of path recognizers are O(n). Paths/routes are a great fit for radix trees, since there's typically repetitions, like /projects, /projects/1, and /projects/1/todos. The performance is O(log n).

I built one for Java: https://github.com/augustl/path-travel-agent

[+] degobah|11 years ago|reply
tl;dr:

* Netflix had a bug in their code.

* But Express.js should throw an error when multiple route handlers are given identical paths.

* Also, Express.js should use a different data structure to store route handlers. EDIT: HN commentors disagree.

* node.js CPU Flame Graphs (http://www.brendangregg.com/blog/2014-09-17/node-flame-graph...) are awesome!

[+] bcoates|11 years ago|reply
It's not just the extra lookups -- static in express is deceptively dog-slow. For every request it processes, it stats every filename that might satisfy the URL. This results in an enormous amount of useless syscall/IO overhead. This bit me pretty hard on a high-throughput webservice endpoint with an unnoticed extra static middleware. I wound up catching it with the excellent NodeTime service.

Now that I look at it, there's a TOCTOU bug on the fstat/open callback, too: https://github.com/tj/send/blob/master/index.js#L570-L605

This should be doing open-then-fstat, not stat-then-open.

[+] jaytaylor|11 years ago|reply
I am upset that the title has been changed from "Node.js in Flames". Which is not only the real title of the article, but also a reasonable description of what they've been facing with Node.

#moderationfail

[+] ajsharma|11 years ago|reply
This is the first I've heard of restify, but it seems like a useful framework for the main focus of most Node developers I know, which is to replace an API rather than a web application.