Performance bugs – the dark matter of programming bugs

[+] dahart|9 years ago|reply

> Performance bugs are out there, lurking and unseen

Is it a bug if it's unseen? Code that runs slower than it could in theory is literally everywhere. But is it worth paying the cost of the time to fix it, if you aren't experiencing any delays?

I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.

"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.

"Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all."

http://www.jagregory.com/abrash-black-book/#the-human-elemen...

[+] shanemhansen|9 years ago|reply

User focused metrics:

1. It's a user facing bug if I have to purchase a bigger laptop because someone decided to get the smallest element of a list by typing list.sort()[0].

2. It's a user facing bug if the use of my app heats up their phone, drains the battery, and leaves someone stranded w/o a way to get safely get home on a friday night.

3. It's a business facing bug if my AWS bill is in the hundreds of thousands per month when I could be paying a mere 10s of thousands. (This is more common than you'd think)

4. It's a user facing bug if a sudden increase in traffic (due to my product being featured somewhere) takes down my services and results in lost sales and brand damage.

I understand your point, it's not necessarily a problem if a backend feed is taking 45s instead of 30s, or if the the text input subroutine for an interactive app allocates.

As far as I can tell as a software developer and daily computer user, people aren't erring on the side of making their software too fast.

[+] vilya|9 years ago|reply

I think that was true back when he wrote the words, but less so now. Even if you're not noticing a delay, code that completes quicker is generally code that uses less power and you can still notice the effect on your phones battery life - or on the size of your monthly bill from your data center.

[+] datenwolf|9 years ago|reply

> I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.

In a time of battery powered computers CPU cycles spent equate to energy consumed equates to reduction in battery life.

Even if your program does not impact the user's experience of runtime behaviour (latencies, delays) it's still impacting in the user's experience on his devices recharge cycle.

Also spinning fans generate noise, which is just as a nuisance as a laggy program.

[+] pdpi|9 years ago|reply

"Unseen" is relative

A performance issue that is hard to detect at 10 rps becomes an issue when you're trying to push 1000 rps. A performance issue in a mobile app (or a desktop app on a laptop) might be undetectable from a responsiveness or frame rate point of view, but burn through your battery.

If you count memory footprint as a performance issue, thowr don't become apparent until you try to run more things concurrently, and find yourself thrashing when memory should be plentiful.

[+] izacus|9 years ago|reply

Performance issues cost you money: either you're losing users due to bounce rate (there are studies out there that having pages load for more than 300-500ms or so will significantly increase your bounce rate), or you're paying huge prices for your cloud hosting or you're burning battery of your users and thus losing them as well.

The issue is that usually smart design avoids significant performance issues and design which does not consider preformance at all ("we'll optimize it later") usually has to be scrapped and rewritten due to how many systemic issues accumulate over time. When you hit a scaling wall, it's usually orders of magnitude more expensive to rewrite the application than spending a bit of time getting basics right before you started.

[+] btilly|9 years ago|reply

Meh, there is a balance.

A performance issue that isn't an issue when doing a single thing can become huge when you're trying to automate doing it many thousands of times. On the other hand there are few ways which are better to obfuscate a code base than to try to optimize performance.

The rule of thumb is don't do anything stupid, but don't particularly worry about efficiency until there is a real need. And then, as Knuth advised, you start with a profiler because the performance problem probably isn't where you think.

[+] 0x07c0|9 years ago|reply

>"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.

And then define pointless, to make a quantified decision on optimization you should use a performance model. Common model is the roofline model. https://en.wikipedia.org/wiki/Roofline_model

[+] craigvn|9 years ago|reply

> Is it a bug if it's unseen?

It's not a bug, it's a feature. Gives you time to get coffee while the page loads.

[+] btilly|9 years ago|reply

I have seen too many of these to keep track of.

But I will never forget once having to work with some Windows software written by someone else. He decided to parse a CSV file by reading the whole thing in memory, and then literally chopping off one character at a time..copying the rest of the string. It made what should be an instantaneous import of a fairly small file into a 20 minute ordeal. Simply reading in one line at a time would have been a massive speedup.

I did not have access to change it. I asked him to fix it. I begged him to fix it. His response was to shrug and say, "Who cares? This runs as part of a batch process with nobody in front of it. It works fine." That may be, but I cared because I had to sit through it repeatedly while debugging certain misbehavior that arose in said batch process.

Years later I was talking to an ex-coworker who stayed at said company after I left. He reported that that original developer after I left had to make a fix to said batch process. While doing so he had fixed the performance bug and was an instant hero among a number of other people who had to sit through the same import repeatedly over time.

Yeah..it turns out that he had made the fix I'd originally asked. And nobody realized that he had caused the problem originally AND refused to fix it.

That's one developer that I'm glad I will never have to work with again...

[+] dahart|9 years ago|reply

That would be pretty frustrating and annoying!

This reminds me of a story I heard about a lead game programmer who made a habit of pre-allocating an unused megabyte at the beginning of a project. The last week before ship, when the game would crash out of memory often and everyone was freaking out and trying to squeeze every last byte out of their code, he'd comment the allocation nobody had noticed, free up a megabyte, and be the hero that made the game fit.

[+] wolfgang42|9 years ago|reply

Quote from the linked article by John Carmack[1]:

> The fly-by-wire flight software for the Saab Gripen (a lightweight fighter) went a step further... Control flow went forward only. Sometimes one piece of code had to leave a note for a later piece telling it what to do, but this worked out well for testing: all data was allocated statically, and monitoring those variables gave a clear picture of most everything the software was doing.... No bug has ever been found in the “released for flight” versions of that code.

When I started writing Arduino code, I independently re-invented this technique. I write my code with only fixed-size loops and no delay() calls (basically cooperative multitasking), making heavy use of finite-state machines to keep track of what should happen when.

This style of code takes some getting used to, but its advantages are enormous: I see other people using delays and busy-loops and then struggling to get interrupts working to keep their device responsive, whereas I have a clean linear set of logic that can be easily understood and is free of strange race conditions and the like, and is guaranteed to keep responding to events.

This style also makes it easy to literally eyeball performance: attach a bicolor LED (one that has e.g. red and green LEDs in the same package) and tell it to toggle colors once per loop. If the main loop is taking more than 8 milliseconds or so the LED will start to flicker visibly; if something pauses the LED will go solid and it's obvious the main loop is frozen.

[1] http://number-none.com/blow/blog/programming/2014/09/26/carm...

(Edits for clarity and writing style.)

[+] dahart|9 years ago|reply

One of the interesting side effects of this style is that you sometimes choose to let code run slower than it could, and choose to use more memory than you need.

The point is (as you say) to make all loops & allocation as predictable as possible, so when you make them fixed size, you have to fix it on the worst case. Ideally there is no worst case, and the amount of work you do per frame or cycle is constant. But for anything non-trivial, chances are fixed allocations & loops aren't always needed, even though they're extremely useful.

This does exactly what you say, in my experience: it makes performance eyeball-able, you can see where problems are. Code becomes so easy to reason about that debugging is easier and time is saved.

[+] RangerScience|9 years ago|reply

I would like to see more of this style of code (specifically in the Arduino setting) as I'm having trouble seeing what you're up to from here. Is some place I can read more - ideally, actual code?

[+] z3t4|9 years ago|reply

Thanks for sharing the link. In JavaScript you can put functions inside functions, so if it's only used once you can put it where it's used, witch would have the same effect as in-lining it. But you also get to eat the cookie as you get the advantage of using functions.

I like the advice of setting a breakpoint and then stepping though all the code for example a frame. If what actually happens is way off from how the code is structured your team will have problems comprehending what the code does, witch will make it very hard to work with.

[+] rosshemsley|9 years ago|reply

One of my favourite examples:

A friend once sped up some code 90% by changing a

list.size() == 0

to

list.empty()

yes, std::list::size() can be _linear_ in C++98.

http://www.cplusplus.com/reference/list/list/size/

[+] logophobia|9 years ago|reply

std::list is a linked list, so it makes sense that size() is linear. Only if you store the size separately, you get constant behavior (which the c++11 enforces), it's constant.

Your friend probably could've sped his code up even more by using a std::vector. Linked lists have very bad processor cache behavior (not allocated in continuous memory), there are not a lot of things a list does better than a vector.

Only if you prepend a lot, or insert in the middle of a list, it might have some advantages over a vector, even then, there're are better alternatives.

[+] johnfn|9 years ago|reply

What a fun example.

I once had some code that I managed to speed up around 90%. It was returning a massive data structure that another function would then operate on. I sped it up by removing the data structure entirely, and instead adding a callback that would be called for each element of the data structure.

It's not always possible to do, and half the time it doesn't even help, but every now and then it's a good way to optimize.

[+] ptx|9 years ago|reply

One recent example of this is Synapse, the Matrix messaging server.

Synapse was generally considered to be inefficient and a huge memory hog, so they have been experimenting with various ways to rewrite it in Go, saying[1] at the end of last year:

"Synapse is in a relatively stable state currently ... we’re starting to hit some fundamental limitations of the architecture: ... python’s single-threadedness and memory inefficiency; ...; the fact the app papers over SQL problems by caching everything in RAM (resulting in synapse’s high RAM requirements); ..."

But in January[2] it was discovered, during the investigation of another problem, that the vast majority of the memory usage was due to a bug:

"It’s also revealed the root cause of why Synapse’s RAM usage is quite so bad – it turns out that it actually idles at around 200MB with default caching, but there’s a particular codepath which causes it to spike temporarily by 1GB or so – and that RAM is then not released back to the OS."

I wonder if they would still have started the Go rewrite if it hadn't been for that bug. (It seems the new implementation has a better architechture and is substantially faster, so I guess it's still a good idea though, but maybe not as urgent as before?)

[1] https://matrix.org/blog/2016/12/26/the-matrix-holiday-specia...

[2] https://matrix.org/blog/2017/01/06/synapse-0-18-6-is-out-ple...

[+] Namrog84|9 years ago|reply

This is a great example of why you can't take perf related issues intuitively or at face value. Profiling and in depth analysis is always needed. I wonder if this could have been more easily spotted earlier on if a better analysis was done.

[+] Arathorn|9 years ago|reply

We would absolutely still be doing the go-rewrite: while fixing that bug improved the memory usage, it didn't help actual performance much. Meanwhile Dendrite (the rewrite) is currently benchmarking 2 orders of magnitude faster than Synapse. Unfortunately Synapse's architecture simply has many design efficiencies :)

[+] BurningFrog|9 years ago|reply

So the analysis of why it was slow was just informed guesswork.

One thing I keep learning again and again: You never know what the performance problem is until you've measured it.

[+] bluetomcat|9 years ago|reply

Back in 2001, Joel Spolsky coined "Schlemiel the Painter's algorithm" to illustrate the same class of inefficient programming techniques: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...

His particular example was with the standard C library function strcat where each new invocation traverses the enlarged string over and over from the beginning, in order to find the terminating NUL character and append the new part.

[+] bastijn|9 years ago|reply

I define a special category of distributed system performance bugs. The fact that proper distributed tracing is not standard yet (especially outside of cloud environments) is a real pain here. The google white paper on Dapper [0] and all its implementations [1][2][3] in one form or another are appearing but usually target the Web environments. That means that for my legacy stack with multiple nodes connected via networks of different kinds I have to profile on multiple systems, aggregate different log types manually, think of time syncs (not everything logs utc..) and so on. To be honest we just reached the point where we try to get a European subsidiary project rolling where we address this problem. The time lost to trace down unexplainable time gaps between first displayed result in app and request are weeks, and in multiple occasions months. Just because we have no easy way to profile our distributed ecosystem.

After identifying who is to blame: network, db nodes, processing nodes, application logic, cache (invalidation); we can talk about fixing.

For fixing of course it would be great if you could extract a meta model of the application and another of the hardware. Changes to those models and some nice prediction of impact would be the ideal end goal. Or, if we are wishing anyway, some nice DSE on those models to optimize would be even better.

[0] https://research.google.com/pubs/pub36356.html [1] http://opentracing.io/ [2] http://lightstep.com/ [3] https://aws.amazon.com/xray/

[+] pritianka|9 years ago|reply

OpenTracing has libraries for C++ (though it needs work tbh) and Java which can be used for mobile. The idea is to cover your entire stack ...

[+] glangdale|9 years ago|reply

Occasionally I like the idea of producing some sort of dynamic instrumentation code (I privately think of it as "shitgrind") that detects dynamic code that is doing redundant work and flags it. Of course, this doesn't tell you that the code is useless (i.e. I might have to initialize something that subsequently isn't used down some deep and not-anticipatable-at-compile-time set of branches), but it might provide hints that I'm doing a lot of writes that never match up with reads, or that the writes/calculations that I'm doing are going into places that already have that value. It would be incredibly slow but there are plenty of uses I can imagine where you could detect performance problems in a toy version of the problem (maybe running 100-1000x slower) that are the same problems that you would have in the real version of the problem at full scale. Would be a fun project, IMO.

[+] AstralStorm|9 years ago|reply

Assembly instruction level profiler? Sounds like an extension to callgrind.

[+] unsoundInput|9 years ago|reply

Even though I agree that >>insert method of std::map is roughly 7 times slower than it should be<< is bad, these kind of problems are not too hard to find and solve if they are actually problematic for your software.

The most problematic performance issues I've come across were usually bad/premature optimizations that were not (correctly) validated against a simpler implementation as a performance baseline. Things like parallelism (multi-threading, web workers) or caching can absolutely tank performance if not done correctly. Plus they usually tend to make stuff more complex and bug-prone.

[+] pjc50|9 years ago|reply

This is one of the few areas where adding more levels of abstraction cannot help you fix the problem. It's remarkably easy for some library function or low level of the program to have O(n) performance that ends up getting called in a loop with the same input or output.

[+] wtetzner|9 years ago|reply

> This is one of the few areas where adding more levels of abstraction cannot help you fix the problem.

I'm not sure that's true. It is of course true that adding levels of abstraction often degrades performance, but I don't think it's necessary.

In fact, programming languages themselves are abstractions. It's pretty well agreed at this point that a good C compiler will generate faster code than someone writing assembly. It's of course possible to write code that's at least as fast in assembly, but in practice, nobody would do it, because the program would be unmaintainable.

You can make abstractions fast, and one of the benefits of doing so is that you can make your abstraction be the easy way of doing things. That means, by picking the easiest solutions, the people on your team are also using an efficient solution.

[+] gmfawcett|9 years ago|reply

It's true that libraries can cause performance issues, but it's not a given. Using template metaprogramming libraries (like STL and Boost in C++) can result in clean, high-level code which has very good performance. Often the "loops" and other high-level constructs are optimized right out of the program.

Also, link-time and whole-program optimization can remedy some performance issues across modules/libraries. [edit: not algorithmic problems, of course, LTO isn't going to rewrite your algorithms for you.]

[+] Const-me|9 years ago|reply

Modern hardware is incredibly complex. It’s nearly impossible to reliably predict the performance outcome of the changes. Personally, when I work on optimizing performance of my code, I throw away like 25% of my changes after profiling.

Good architecture with well-defined levels of abstractions makes it easier to change algorithms, to replace implementation of algorithms with different ones, and so on. Even though abstractions are not always free performance-wise, often they do help fixing performance problems, albeit not directly.

[+] sbov|9 years ago|reply

Ex post facto adding, no. But if the usage of the function corresponds to a concept within your codebase, and if you were able to abstract the usage of that function away, it would obviously help you fix the problem.

[+] toast0|9 years ago|reply

Abstraction in software almost always involves using something that does more things than you actually need; of course it's not going to help with performance. Sometimes it means calculating related results one at a time, when they would be better calculated at the same time in a single pass, because the library to do the calculation can do either calculation on an array, but only one at a time.

[+] johnfn|9 years ago|reply

Unless you add those layers of abstraction into the compiler for the programming language, of course! ;-)

[+] hacker_9|9 years ago|reply

Depends on the problem? Abstractions can usually make assumptions about the data, and thus be faster.

[+] ericb|9 years ago|reply

I'm open sourcing a Java Selenium library with page performance assertions built in as part of my startup. It will let you do things like:

   test.config.setSLAMaxTime(2000); // set time in MS

   assertSLAMet(); // add this to any page

It will also allow you to assert specific URL's, and window timings. You can then use these assertions to pass/fail builds. I also have some other nifty ideas on preventing performance regressions I'm building into this (smart benchmarking vs. previous).

If anyone is interested, I'll be announcing it here when we open source it:

http://signup.browserup.com/

[+] orf|9 years ago|reply

> What if your Selenium tests had performance asserts for CI/CD

Isn't this just... a single function? Like, how the heck do you build a startup around that. We do this as part of our Python selenium tests, I didn't realize it wasn't built in to be honest.

[+] kazinator|9 years ago|reply

It's only a bug if the stakeholders that surround the program agree that there should be (or already is) in place a performance requirement which is not being met.

Opportunities to uncover performance that could be improved and impose requirements against it are out there, lurking unseen. Only problem is you have to convince people to care on a case-by-case basis.

[+] tcopeland|9 years ago|reply

At one point I was working on a utility to detect suboptimal sequence of method calls; canonical example is using "[1,2,3].select {|x| x > 1 }.first" rather than "[1,2,3].detect {|x| x > 1}". These can be performance issues, although the bigger win is in readability and communicating the developer intent. More details and examples here if you're interested:

https://thomasleecopeland.com/2014/10/22/finding-suboptimal-...

I haven't worked on it much recently though because the problems it found weren't that significant. But I like the idea of runtime analysis for finding issues, especially in a dynamically-typed language.

[+] bitwize|9 years ago|reply

An internet points out the obvious -- that performance matters in computing. Hackernews nods in agreement, then goes back to coding Ruby on Rails applications in an editor implemented in HTML5 and JavaScript that consumes 13% CPU just to make the cursor blink.

[+] luckydude|9 years ago|reply

"It takes an experienced programmer, with a reasonably accurate mental model of the problem and the correct solution, to know how fast the operation should have been performed, and hence if the program is running slower than it should be."

I love this sort of approach to performance. If you know the hardware, the software, and the application, you can predict how fast it should go. That's surprisingly rare, in my experience, sadly. About the time I wrote lmbench I was sitting in a lot of meetings at Sun where people were saying "we should do this" and I could do a mental estimate of how fast it would go (I memorized most of the stuff that lmbench measured so I know how many packets/sec we could do, how many iops, how much memory bandwidth we had, etc). You'd be amazed at the number of times "architects" were proposing to build something that couldn't possibly work.

It's pretty systems-y and not that common in programmers, what with today's frameworks and all. But there is still some value in being able to predict how fast something should run. I love it when I see people doing that with the needed knowledge to be accurate, very fun to watch.

[+] songlinhai|9 years ago|reply

I did several research projects on performance bugs during my phd. https://songlh.github.io/

[+] radarsat1|9 years ago|reply

Reminds me of https://accidentallyquadratic.tumblr.com/

[+] makecheck|9 years ago|reply

Performance sort of has a paradoxical "coffee test": something that takes a few seconds is aggravating and you will wait for it but something that routinely takes 5 minutes will make you go get coffee or switch tasks completely.

With enough alternate tasks, waits are absorbed. Oddly, overall throughput can be worse with sporadic tiny delays that are technically indicative of a fast task.

You don't just want fast operations, you want programs that can be smart about gathering unavoidable delays into chunks. This can even pay dividends when consuming resources, as the gathered steps may use less (e.g. laptop wakes dormant hardware just once to perform several grouped tasks, instead of repeatedly over a short time frame).

[+] base698|9 years ago|reply

In practice the majority of web app performance have more to do with latency of IO. People seem to forget that 4ms may seem fast to humans, but it's glacial to computers. Add 100 4ms calls to the DB and all the sudden your code is waiting for half a second if it does nothing else.

[+] AstralStorm|9 years ago|reply

Bonus points that in a real real-time system (anything with a GUI) you might have only a 1 ms or less to handle an event in a bigger stream, e.g. dragging. This is why so much UI is or should be really asynchronous.

[+] glangdale|9 years ago|reply

We run into this a fair bit of the time with Hyperscan. One particularly noxious anti-pattern that can arise is the "prefilter that isn't" anti-pattern: specifically, a multiple stage pipeline where the initial stage is meant to filter out some portion of false positives when feeding to a more expensive check, but isn't doing its job.

The painful thing is that the expensive check downstream winds up doing everything, but the code still "works" because the expensive check is the only necessary part (well, as long as the pre-filter isn't totally hosed and isn't producing false negatives).

107 comments