top | item 21743424

O(n^2), again, now in Windows Management Instrumentation

501 points| harikb | 6 years ago |randomascii.wordpress.com | reply

224 comments

order
[+] ww520|6 years ago|reply
Ah, this brings back the memory of my O(n^2) fiasco. I wrote a low level file system storage driver in the past. In the caching layer, there's a need to sort by LRU time for cache eviction. It's a minor thing not run often and I wanted to move quickly. Also since it's low level kernel mode code, I wanted the code to be simple and correct. The number of cache entries was not big. Bubble sort was adequate for small N, to get the job done quickly and simple enough to ensure correctness. Plus I could replace it later if needed. So it's done and forgotten.

Things were working fine and performance was good. Then one day Windows Explorer suddenly hung with 100% CPU for couple seconds. This was one of the worst kind of bugs. There's no crash to pinpoint the problem. Things still work most of the times, just slowed down intermittently. Luckily I was able to catch a slowdown and deliberately crashed the process in time. The call trace stopped in the bubble sort function. I immediately kicked myself - it's the classic case of O(n^2) blowup. The cache entries had been scaled up to couple thousands items and the exponential O(n^2) blowup to tens of million of iterations was having a real impact. I switched to merge sort and performance was back to normal.

Edit: I picked merge sort because the worst case was O(n log n), unlike quick sort whose worst case was O(n^2). Once burnt, needed to be extra careful with edge cases.

[+] hinkley|6 years ago|reply
I got the honor to file a bug against Together's design tool. We had a project that was a monorepo with a few dozen distinct compilation units in it and some had complained that the more modules you added to the project the longer it would take to open. One person was up to half an hour and didn't even have the whole thing.

I had a slow week babysitting something else so I just let it chew away in the background, while doing other things like documentation and requirements work, noting down the time, adding one more module and running it again. The last one I was willing to do took 18 hours to load. It was still running when I got back to the office the following day.

To this day, I can't recall ever seeing any production performance bug with greater than n cubed complexity. This bug progressed at n to the fifth. Truly, a thing of singular beauty.

Thankfully they had a workaround that dropped it to something like n^2 (30 minutes became less than 5) and a bug fix not too long after.

[+] vecter|6 years ago|reply
n^2 is polynomial, not exponential. If it was truly exponential (which a sort should never be unless you've decided to solve an arbitrary 3-SAT problem while sorting a list), then it would've blown up in your face much sooner.
[+] mark-r|6 years ago|reply
I once coded a bubble sort in an application where I expected the worst case to be n=4. Later that assumption was invalidated, but I had long ago moved on to something else. I hope the person who inherited my code doesn't still hate me.
[+] harikb|6 years ago|reply
A gem

> Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there

[+] rcthompson|6 years ago|reply
I think the other reason O(n^2) is a "sweet spot" is that it often arises from one O(n) algorithm calling another O(n) algorithm in each iteration, resulting in O(n^2) overall. Very often it's ultimately because the inner O(n) algorithm should have been implemented as O(1), but nobody bothered because it was never intended to be called in a loop.
[+] mehrdadn|6 years ago|reply
Speaking of O(n^2) algorithms (if it's OK for me to promote my own writing on this recently?):

I recently illustrated how they can even occur as a result of algorithms being slow by a constant factor.

I thought it might be useful for others, so I posted it here: https://news.ycombinator.com/item?id=21745911

[+] rwmj|6 years ago|reply
I have my own O(n^2) blow-up story. Back in the day I wrote some Google AdWords analysis software in a functional language. Being written in a functional language meant that using association lists (linked lists for key/value) was very natural. Anyway these were only used to load the analysis inputs which was written in an Excel spreadsheet (the customer insisted on this!), exported to a CSV and loaded into the software. The spreadsheet was only perhaps 5 rows so, whatever.

One day I got a call from the customer that his analysis was taking 6 hours to run, for a program that should have finished in a fraction of a second. It turned out the customer had tried to load a 30,000 row spreadsheet of input. The program would on every loop iterate over the input association list, resulting in classic O(n^2) performance overall.

After changing all the places to use a hash table it again ran in sub-second, although the code was nowhere near as elegant.

[+] masklinn|6 years ago|reply
Why would the code be "nowhere near as elegant"? Lack of a functional / persistent map?

Because I'd expect all maps to provide roughly similar interfaces whether they're assoc lists, hashmaps, btrees, HAMT, …: iterate all entries, presence of a key, get value for key, insert (key, value), remove key (and value), possibly some other niceties on top (e.g. update value in-place, merge maps, …) but those are extras.

[+] jedberg|6 years ago|reply
The thing that impresses me most about this writeup is how much instrumentation there is on Windows now. I stopped using Windows about 15 years ago, but I don't think this kind of analysis was possible back then.

Or maybe I just didn't know the tools well enough?

[+] AaronFriel|6 years ago|reply
15 years ago? You might have just missed it. Event Tracing for Windows gained a lot of functionality in Windows Vista, which some[1] have described as the most forward-looking and instrumental release in Windows history. Vista had a lot of flaws, but it introduced almost everything fundamental to Windows 7 to Windows 10, save perhaps the more recent developments in virtualization and containerization. From differential application-consistent full backups, to transparent full disk encryption, to the use of a TPM to verify boot integrity.

Vista was a massive release, but also much maligned so maybe you didn't miss much leaving when you did. The tooling has certainly gotten a lot better since then, and so has Windows.

Rather more recently, Windows has gained dtrace support: https://techcommunity.microsoft.com/t5/Windows-Kernel-Intern...

[1] - See this self-described eulogy by @SwiftOnSecurity: https://twitter.com/SwiftOnSecurity/status/85185740489147187...

[+] mixmastamyk|6 years ago|reply
Same here. WMI and perf counters have been around a long time though.

I recently used Win10 for work and was amazed at the combination of awe-inspiring tech and plain awe-full complexity. Just the control panel goes four layers deep attempting to make things easier but in practice is several times harder to find options than it was in Win2k.

If there were a distribution with all the corporate goals stripped out I'd jump on it.

[+] hnick|6 years ago|reply
At work we had a daily script that took about 30 minutes to run, and about 3 hours on the weekend due to more data. One day we got a report that a downstream service was missing SLA because the script wasn't finishing. I'd come onto the team late so wasn't aware this was abnormal.

It turns out it was now taking about 1-2 hours daily and 6-12 hours on the weekend depending on data size. This had been going on for months but gradually getting worse as the data grew, to the point it was unbearable so finally reported.

A senior programmer had removed the shell call to sort on an indexed text file and written their own ad-hoc sorter through every fresh programmer's favourite (you guessed it) bubble sort. To make things worse, this was perl which has a perfectly functional sort itself if you really have to do it that way. I still have no idea why this was done, I don't think asking would have been productive in that place at that time.

[+] peter_d_sherman|6 years ago|reply
Excerpt: "I decided that the interactions were too complex to be worth analyzing in detail, especially without any thread names to give clues about what the 25 different threads in svchost.exe (3024) were doing."

Future OS/Compiler Programming Note: It would be nice if threads could be individually named and those thread names shown/slowed/stopped/debugged in whatever tool implements Task Manager like functionality...

[+] muststopmyths|6 years ago|reply
This already exists.

Windows had a janky way of naming threads before 10, but it has a supported API now.

The problem is that svchost is a host that runs a thread pool with various services being invoked as needed. You'd have to rename each thread every time you ran a function for a service.

That's probably more doable now with the supported API. The previous way involved raising an exception, so I can see why it wasn't done.

Seems like Windows has a more systemic testing problem though. We used to stress-test the crap out of various NT components back in the day. Don't know if the elimination of SDETs means there's no one doing that other than the NT perf team, which never did targeted stress.

[+] brucedawson|6 years ago|reply
The "thread names" clause was a hyperlink to an article I wrote a few years listing all of the problems with thread naming on Windows. Microsoft took the suggestions seriously and basically landed all of the features and fixes that I requested, so now when I'm profiling or debugging Chrome the threads all have nice names.

I just kinda like trolling Microsoft for not using their own thread naming API very much.

https://randomascii.wordpress.com/2015/10/26/thread-naming-i...

[+] retrovm|6 years ago|reply
POSIX threads can be individually named. They inherit the name of their creator if you don't set one.
[+] bobbyi_settv|6 years ago|reply
That would help, but I think the root problem here (of the debugging difficulty, not the actual perf problem) is the Windows approach of having lots of unrelated services running together in a single process (svchost)
[+] gameswithgo|6 years ago|reply
In the world of database backed web apps a related pattern is asking the database for some rows for a table, and then asking the database from some more rows, based on each row you just got. Rather than joining it together into one query.

Makes it into production but falls down eventually!

[+] kbenson|6 years ago|reply
There are sometimes reasons why you don't want to join, such as large data parent row sizes that get repeated a lot in if there's a lot of child relations.

But that should be two queries. One for all the parents, and one to get the children with a sub-select (or at worst a manually generated parent id list), and then you can join them manually (if actually needed and you can't just use the data sets as they are).

[+] toast0|6 years ago|reply
I've actually had a lot of success going the other way. A database query with a join takes down production, so do a client side join, query one: get a bunch of ids from table A, query two: get a bunch of rows from table B, often as get one row from table B UNION get next row from table B etc.

You can try using IN on the second query, but usually if that was going to work in a reasonable amount of time, your join would have also worked in a reasonable amount of time.

The real problem people run into with the client side join is making it query one get a bunch of ids from table A, query two through N, get one row from B, with each query requiring a round trip. Even a pretty small client to server roundtrip of 1 ms gets nasty quick with repeated queries.

[+] 72deluxe|6 years ago|reply
I had this on Magento 2.3.3 for a "custom stock" module that was running loops to grab data for a product and its child configurable products. As Magento is written in a modular way, asking a subcomponent for data can involve a database lookup so it is really slow to use and painful to debug.

Anyway, it was running 29000 database queries as it was repeatedly using an "in" clause of 1 item across hundreds of queries, instead of 1 query with hundreds of items in the "in" clause.

I replaced them with 1 query that fetched all it needed instead of the 29000 queries (and 8000+ which were duplicates).

Terrible code.

[+] rcthompson|6 years ago|reply
Probably the most important thing I learned about algorithmic complexity analysis in grad school was that N is not constant, but rather it grows over time (probably at an exponential rate). That is, as our computing power increases, we continue to aspire to analyze larger and larger data sets. So anything beyond O(n*log(n)) that works right now is more or less guaranteed to blow up in the not-so-distant future, if it remains in use that long.
[+] ihaveajob|6 years ago|reply
This reminds me I once wrote a hyper-exponential graph processing algorithm, in the range of O(2^2^n), and I still believe it was the right choice because: 1) It was much easier to understand than a faster alternative, and 2) I could mathematically guarantee that the largest value of n was something like 8 or 9, so you could argue that it was in fact a constant time operation.
[+] vecter|6 years ago|reply
Could you explain more? As the other commenter pointed out, 2^2^9 = 2^512 which is an astronomically large number (far more than the number of atoms of ordinary matter in the observable universe). Something doesn't seem right.
[+] Google234|6 years ago|reply
2^2^9 is 1.340781e+154
[+] lousken|6 years ago|reply
seems like building chrome is a loyal pain in the ass, so many bugs found in windows because of that :)
[+] techntoke|6 years ago|reply
It's actually pretty easy and there are package manifest in Arch that shows you how.
[+] JonathonW|6 years ago|reply
Honestly, O(n^2) is probably okay for a diagnostic tool that's only meant to be run occasionally-- I'd like to know why Google's IT department felt the need to schedule it to run so frequently.
[+] JoeAltmaier|6 years ago|reply
That could be underestimating the truly catastrophic nature of O(n^2). Blows up very very fast. From milliseconds to seconds to days to decades. Pretty much anything that grows at all, must not be subject to the terrible power of O(n^2)
[+] btilly|6 years ago|reply
The problem is that available disk space is growing exponentially over the years with a current doubling time of around 18 months for flash drives. And therefore repository size has been as well. What today are 1 GB repositories fairly predictably will be 10+ GB repositories in 5 years. And 100+ GB in 10 years. Which means that the period of locking your computer predictably will go from minutes to hours to days in the same time period.

What is kinda OK today is going to definitely not be OK going forward. This algorithm has to get fixed.

[+] loeg|6 years ago|reply
It's sort of unclear why 3 billion bytes needs to be processed in an O(N^2) algorithm in the first place, but seems especially egregious in that it also blocks other processes from running.

One other aspect of the issue is that it's unclear why that database is now so large. It seemed like different machines had different sized databases — perhaps one angle of the solution is trimming and vacuuming.

[+] ComputerGuru|6 years ago|reply
WMI is not a diagnostic tool. It’s an abstraction/API and lots of developers unfortunately the build over it rather than directly against the underlying WIN32 APIs.
[+] jandrese|6 years ago|reply
That's the part that seemed insane to me. Dial it back so it only runs at 4AM every day and nobody would have noticed. Running it every hour seems like outright paranoia by the IT staff.
[+] aneutron|6 years ago|reply
Reminds me of the time we found out a team was running O(n^2) code, but in terms of SQL Queries, that is to say, not O(n^2) of comparisons or anything, and were trying to blame the other team when a client sent a request that never ended.

And they tried blaming them for migrating data to a new server with an SSD that shaved 20 minutes from their processing time.

And if you're wondering, they refused to fix it because "it would need too many sprints" and "maybe we'll talk about it in a workshop".

It's still not fixed.

[+] 72deluxe|6 years ago|reply
Funny that you used the word "team" when the outcome is anything but teamwork.
[+] fourseventy|6 years ago|reply
I love this guys blog. Been reading it for a while. He is a beast.
[+] radicalbyte|6 years ago|reply
Ten years ago the hash-map implementation in Internet Explorer was also O(n2).

I had to inject javascript into some vendor code to avoid this after our production environment died. I ended up replacing the underlying hash-map into multiple smaller maps based on a hash of the items. So I'd have 50 maps of 1000 items instead of one map of 50000 items.

[+] Etheryte|6 years ago|reply
I'm interested to know, what practical application did you load 50.000 items into the frontend for 10 years ago?
[+] codetrotter|6 years ago|reply
I have over a 100,000 images on my iPhone and some apps used to be really really really slow when I selected browse album in them.

One of the apps that used to be maddeningly slow was Instagram, but Instagram has gotten better lately. No idea why though.

[+] tinus_hn|6 years ago|reply
Then again, your disk space, processing power and memory is cheap for Microsoft.
[+] zackmorris|6 years ago|reply
If you're building a library or framework, then just assume that worst-case performance will be encountered by your users, because you can't predict their use cases.

So for example, I almost always use associative arrays (maps) instead of lists. I actually really wish there a MAPP language because I view the map as a potentially better abstraction than the list in LISP, but I digress.

I also tend to use atomic operations instead of locks. Some good starting points for that are understanding how compare-and-swap (CAS) works, and also how functional programming with higher order functions and immutable variables works because that mindset greatly simplifies threading with no shared mutable state for the Actor model. Lazy evaluation is another good one. Also vector languages like Gnu Octave and MATLAB are good because they favor a level of abstraction above the bare-hands programming of C-style languages like C++ and Javascript so you tend to see that most algorithms are embarrassingly parallel at some level (especially the things we tend to think of as computationally expensive like multimedia processing).

Also (this may be controversial) but I think that poor performance can be politically motivated. For example, I run Safari with Javascript disabled so I can have thousands of tabs open (it's disabled as I write this). But when I disable Javascript in Chrome, performance grinds to a halt. You can try it right now on the Mac by force quitting Chrome with a bunch of tabs open and relaunching it from Terminal.app with:

  open -a "Google Chrome" --args --disable-javascript
Or manually with:

https://www.computerhope.com/issues/ch000891.htm

I don't know what causes it, but my guess is that Google either wrote some of the loops under the assumption that Javascript would always be on, or they had a blind spot in their implementation because so much of their business model depends on ads having dynamic behavior.

So when you're in a meeting and someone shouts down your concern about edge case performance because they don't see that as a priority, graciously humor them and then write your code the right way because you know that it doesn't take any longer than doing it the wrong way. You might catch some flack during code review so have a good excuse handy, something about trying it the other way but running into problems. Often I'll write the easy imperative solution in a comment above the simple functional solution or even put both solutions under a preprocessor directive or feature flag to leave it up to the team lead/project manager and have our keisters covered if/when something melts down.