Ask HN: When was the last time you used big-o in a web app?

[+] perlgeek|14 years ago|reply

Whenever I write a loop, I keep the big-O in the back of my mind. Whenever I linearly scan a data structure for a value, I ask myself I couldn't use a hash lookup instead. When loops are nested, my alarm bells start to ring, and I estimate the size of each list I loop over, and try to think of ways to reduce the number of iterations.

So I always track my O()s, even if it doesn't boil down to a rigorous analysis.

[+] methodin|14 years ago|reply

I imagine the intuitive, unconscious use of big-o is actually more beneficial as you can immediately identify problem areas without having to bust out paper for complex analysis.

[+] jhuckestein|14 years ago|reply

You don't "use" big-O, it's just a way of writing down different classes of (mathematical) functions; hence, "big-O-notation".

At all times when writing software the engineer should know a) the complexity of the code they are trying to write and b) what the memory complexity of the program is and, in many cases (slightly off-topic), c) the approximate times it takes to do certain simple tasks (this may be memory access, simple disk/network IO, database queries etc)

This is useful because it protects you from writing programs that "perform" unexpectedly badly (I'm confused why anyone would expect them to "perform" without thinking about it).

It may sound like a lot of work, but thankfully most things you program are really constructed of simpler things that have well known complexities (simple loops, sorting, hashing, stuff like that)

At our company (pretty straight-forward social web-app), we have surprisingly many pieces of code where the trivial implementation wouldn't work. We have a graph of different users and relations and want to maintain a transitive closure for each type of edge while also maintaining a bunch of indexes.

Even on the client side we have simple things, such as an autocomplete box for usernames. With many users, we can't just keep an array of all existing users and iterate through it on every keystroke. At that point you have so many options to choose from (e.g. have an index on the server and do a request for the filtered list after short timeout or build a simple index (searchtree) on the client and use that for searching) and you couldn't make a decision without knowing about runtime complexity, constant time factors and memory usage.

[+] ColinWright|14 years ago|reply

I have no idea what tools are taught in CS programs, but three of my recent web apps used big-O analysis extensively.

I can't tell you what they are, because they are all projects that might soon be opened, and while they're not exactly in stealth mode, they are good ideas that I can only work on slowly. I need to get enough of an advantage to be able to stay ahead of those with more web implementation experience.

However, one involves analysing the cross-connections between hundreds of users, and hundreds of small units of data they each have. The naive algorithm is O(n^3 m^2) and is too slow for more than 30 users or 100 units of data. The analysis showed that early, so the architecture of the system was changed to cause the data to "flow" around the system and agregate closer to where it was needed.

Another application uses proximity detection on potentially thousands of points. Again, the naive algorithm worked for small numbers of users (I use the term loosely - they're more "agents" than users), and again, big-O analysis showed the critical points in the system.

I'm sure that most web development consists of gluing libraries together, stuffing databases, and populating forms or similar. Interesting stuff that I do uses algorithmic analyses.

[+] ww520|14 years ago|reply

The first problem sounds like the join operation in RDBMS. It might be useful to look into the various join algorithms as there're extensive research in that area.

[+] methodin|14 years ago|reply

Interesting. So your use of big-o was primarily for an algorithm that ran on data after it was stored and/or used in real-time when fetching data? I tend to agree. Either I tend to do very simple input/output pages or very complex modeling with relationships between data. I would imagine the latter leads to more compelling products.

[+] billybob|14 years ago|reply

I actually learned about Big O from a friend with a computer science background (I don't have one) when I was working on some jQuery show/hide code a few years ago. The idea was: you have a bunch of products on a page, and as you check and uncheck features you want, only the products matching your criteria are shown.

Originally, my logic was something like "for each checkbox that's checked or unchecked.... for each product that may or may not have that feature..." The nested for loop was fine with a small number of checkboxes and products, but as I got more products and more checkboxes, it began to slow down. My friend pointed out that it was Big O of N squared and explained what that meant. It was eye-opening.

I rewrote it to first compile a list of checked features (one pass through the checkboxes) and then show/hide the products (one pass through the products), which scales linearly.

I can't say that I think about Big O every time I write an algorithm, but I do have a sense that nested loops are dangerous.

[+] jmreardon|14 years ago|reply

Big O notation is intended to convey the algorithm's runtime as a function of the inputs. By the sounds of it, the number of checkboxes you had was actually fixed and relatively small. In this case, the algorithm is still O(n). The real problem with your code is that you went through the list (an expensive operation) multiple times. This gives you a large constant factor, not O(n^2). Because Big O notation is only concerned with large (asymptotic) values, the constants are generally ignored.

If you had a form where the number of checkboxes scaled with the number of items on the list, then you might end up with an O(n^2) algorithm, depending on how exactly they scale.

[+] Shenglong|14 years ago|reply

How many checkboxes were there, that you actually noticed the delay in a N^2?

[+] Me1000|14 years ago|reply

I had a major hand in writing the CPTableView class in Cappuccino. The tableview is designed to handle hundreds of thousands of rows without slowing down. This is accomplished both through very aggressive caching, and the smart use of algorithms. If you look at the source code, you can actually see where we document the time complexity.

A standard tableview (static row heights) is laid out in constant time, whereas using variable row heights is laid out in logarithmic time.

An early implementation provided variable row heights in linear time (or perhaps n^2, I don't recall)... this caused scrolling to be absolutely terrible and unusable if you had more than a couple dozen rows.

With constant time we can, theoretically, have any number of rows in your tableview and it will remain just as performant as if you only had 1.

If you're building real software and throw out the concept of time complexity, then you're setting yourself up for failure in the long run.

[+] Me1000|14 years ago|reply

Quick follow up to clear up a couple questions that came up:

We only lay out visible rows, thus that layout is done in linear time, but calculating the position of those rows use the time complexities I described above.

For variable row heights we cache a few values for every single row up front, that is also linear.

The time complexity I'm describing in my original post is talking about calculating the position of those views are you scroll, draw, and layout (not the initial setup) ... which is what is most expensive.

[+] PaulHoule|14 years ago|reply

Web apps stick to O(N) (iterate through a list) and O(N log N) (sort, some operations on indices).

If something is O(N^2) or worse, it "doesn't scale" and you won't usually see it done in a web app. That is, there's a whole shadow universe of web apps that you've never seen because the resource demands are too high.

[+] bunchesofdonald|14 years ago|reply

I disagree as well. Certainly most code that runs when you actually view a page in a web app is O(N), but there are many web apps that do backend computing that is much more complicated. Take Google and Netflix as examples.

[+] diolpah|14 years ago|reply

Well, I strongly disagree with this. One example from our own applications would be O(m*n) for computing levenshtein distances between arbitrary product sets. Of course many of these are precomputed, but they are still used in a web application. Computing sales relationships would be another example.

[+] derwiki|14 years ago|reply

I built a recommendation engine: http://github.com/causes/suggestomatic

Because of the scale of data I'm using, I'm acutely aware of big-O. I guess what's most interesting is that gcc is able to take some of those "slow" operations and make them super fast (if it can fit them in a register) and other times I get the full O(n) effect. Valgrind has been a very useful tool, and can empirically show you where your program is going to blow up (if you doubt your big-O analysis).

[+] th0ma5|14 years ago|reply

In all of my work with Google App Engine, I thought about complexity in loops all of the time. If the problems were way more mathematical or even more complex then sure I'd have to break out a textbook or something, but mostly this was the first-day algo class sort of stuff about growth over time, how deep does the tree-recursion go, etc. Mostly I just thought about this stuff in my head without even specifically conceptualizing it as a big o thing, but the principal is the same, just way simpler, more off-the-cuff, and in the middle of coding.

[+] mindcrime|14 years ago|reply

Go into detail about the last time you've found it necessary to analyze a particular algorithm for either speed or memory using the typical tools taught by CS programs in your web application.

I agree with the folks who mentioned doing a sort of naive, intuitive "big O" analysis as you work, as opposed to a formal analysis. I was recently working on some social graph stuff, and I started down the path of doing an "all pairs shortest paths" determination with Floyd-Warshall (which is O(|V|^3) complexity).. needless to say, it only took a few minutes to realize that this won't work for any real world size datasets and would require a different approach. If nothing else, you quickly find that the naive approach doesn't work for any meaningful amount of data, as just building an adjacency matrix for a few hundred entries, if the datatype is an int (which is what I was using at the time) will blow past the heap size limit on a 32bit JVM.

I'm actually still researching and considering exactly what I'm going to do. I don't necessarily need every shortest path in the system, so the first cut at optimizing this will probably be to cook up something that only calculates a portion of the paths. Another approach might be to look into a parallelized algorithm that can do this on a cluster. Beyond that, I've been poking around trying to see if there's a way to apply a sort of incremental evaluation approach to the problem, where I build the initial matrix of path distances, and then incrementally recompute just the part I care about, when something changes.

Anyway, at the moment, this is the largest extent to which I've actually had to spend some time thinking about the kind of stuff that I studied in some of my CS classes... I've actually got all my algorithms books of the shelf, and digging in, as I research this. :-)

[+] rledge21|14 years ago|reply

All the time. I think about this stuff for every new feature that needs implemented server-side. Our system is very read-heavy(posts need to be circulated into different feeds for users), and all the mysql joins were starting to slow things down. We ended up using redis sets to store post_ids for specific feeds, it speeded things up considerably.

[+] hamidpalo|14 years ago|reply

While I never sit down and write out a formal proof, I always think about it and always check the implementations of standard library functions that work on collections before using them.

The real value of Bio-O isn't knowing how to prove that red-black trees have O(1) amortized insert but rather understanding the tools used and the code written better.

[+] roel_v|14 years ago|reply

What is special about a web app in this context? When programming, shouldn't/doesn't one always estimate the complexity of algorithms or select data structures based on the anticipated usage patterns?

[+] methodin|14 years ago|reply

A majority of web-apps are forms, data storage and display. I was more interested to see what sort of problems people do solve with Big-O considering it's not something that you typically have to use when working with a majority of web apps and simple push/pull sites.

[+] snikolic|14 years ago|reply

I can't remember the last time I did a formal analysis, but I naturally think about this stuff whenever I work with sets/lists, loops, recursion, string manipulations, or non-trivial standard library implementations....which is basically whenever I sit at a keyboard.

[+] ohashi|14 years ago|reply

Last time I used it (I can still see it on my whiteboard) is when designing a spam filtering process. I wanted to get rid of the most spam as cost efficiently as possible. So that meant figuring out the cost of each method (Big O) and then figuring out approximately how much spam the process would catch. Thus, I could run the most expensive processes last on the spam which still hadn't been caught.

[+] ww520|14 years ago|reply

I haven't had the chance to deal with big O issue in the frontend web apps where most operations are short and data set are small. For backend processing, looping and large data processing are always considered with efficiency in mind.

The most memorable big O analysis that actually led to fixing a performance bug was when I wrote the caching layer of a file system using bubble sort for sorting the dirty blocks before flushing. It slowed down noticeably, like seconds of pause with 100% CPU, when the cache size exceeded 10K blocks. Replacing it with heap sort made the problem go away. O(nlogn) really is much more scalable than O(n^2).

[+] nicholaides|14 years ago|reply

If you solve problems that deal with a lot of data, or make a lot of calculations, you'll probably care about big O.

Last time I really had to consider big O in a web app was 2009. Of course, it had nothing to do with the fact that it was a web app and not desktop or otherwise. It had everything to do with the problem I was trying to solve.

[+] baddox|14 years ago|reply

It depends what you mean. If you mean using asymptotic analysis on a "new" algorithm to figure out its running time (master theorem, recurrence relations, etc.), then probably never.

If you mean using your knowledge of computational complexity to inform your choice of algorithms and data structures, then all the freaking time.

[+] sushrutbidwai|14 years ago|reply

as recently as Yesterday. We use algorithms like knapsack problem, shortest path, minimum spanning tree and lot of predictive analysis algorithms. I have used 3 levels of analysis - memory complexity, CPU complexity and database calls complexity including cache hit/miss.

[+] tlrobinson|14 years ago|reply

The last time I did a pen-and-paper "big-O" analysis was probably in college or a job interview, but that's not really the point. It's much more important to have an intuition to guide you.

[+] ramses0|14 years ago|reply

Last week, I found an n^2 (actually n*m) loop in some graph drawing code. Reworked it to be 2n (actually n+m). Everything went better than expected.

35 comments