top | item 24764605

14,000x Speedup (2015)

424 points| signa11 | 5 years ago |james.hiebert.name | reply

228 comments

order
[+] dang|5 years ago|reply
All: some comments are finding fault with the article and the author in ways that aren't in the intended spirit of HN. When someone made a mistake or didn't find the best answer, it's great to explain why and supply correct information, but please avoid mixing in putdowns, and if your comment is just a putdown, please don't post.

Otherwise we end up with basically this: Person 1: "I am very smart." Person 2:" Rubbish. It is I who am very smart." That is not interesting, regardless of who is smarter. Worse, it all-too-easily turns a community like HN into an unfriendly place.

When in doubt, simply remember that the intended spirit is curious conversation, and then your comments will naturally contain interesting details and insights without the other stuff.

https://news.ycombinator.com/newsguidelines.html

[+] mynegation|5 years ago|reply
Algorithms are important but are especially powerful in combination of knowing computer architecture and programming language intricacies.

Many years ago I was asked to look at the program written in C++ that calculated Kendall-tau correlation matrix for a large amount of data. Basically Kendall Tau is a robust replacement for Pearson correlation and it had to be calculated for 0.5M^2 elements and calculation of each element needed calculation of Kendall Tau on two series of numerical data of length N. Both M and N were on an order of few thousand and naive implementation has the complexity O(N^2 x M^2). The program ran for 6+ hours.

After looking at C++ I realized that a lot of time is spent just copying data needlessly (a result of not knowing the difference between passing a vector and reference to it) and needless memory allocations and deallocations. I did a rewrite from scratch, keeping the tests, but not copying anything and allocating a minimum amount of memory only once, employing the cache locality, spreading the calculation with OpenMP and finally - realizing that Kendall Tau can be done in O(M^2 x Nlog(N)) I found that out on my own, just by squinting and realizing that calculation of Kendall tau for two series is essentially sorting, but later found out that e.g. scipy implementation uses the same approach.

The resulting program ran for 4 minutes on a 2-core and 4 threads.

[+] roter|5 years ago|reply
> In climate science we do a lot of downscaling. We take temperature and precipitation readings from a coarse scale Global Climate Model grid and map them to a fine scale local grid. Let’s say the global grid is 50x25 and the local grid is 1000x500. For each grid cell in the local grid, we want to know to which grid cell in the global grid it corresponds.

Climate scientist here. Colour me a little confused. We have a GCM which is (typically) on a structured grid, i.e. with latitudes & longitudes on a pixel-like grid. Finding the GCM (coarse) grid cell for a given latitude/longitude of the downscaled (fine) grid is an integer operation, i.e. invert lat = lat0 + i x delta_lat and lon = lon0 + j x delta_lon. Must be missing something here. Even if there is a strange projection, you can map it such that the deltas are uniform.

[+] sillysaurusx|5 years ago|reply
Gamedev! One of the best parts of being in gamedev was that the problem space forces you to work this stuff out. People won’t wait 30 minutes for a frame to render.

A few ways to solve this pop to mind. As Aeolun mentions in a parallel comment, you can just divide coordinates if your grid is regular. But it often isn’t.

In that case, you can attack it in a few ways. One time Feynman was giving a lecture and said “This triangle looks a little cockeyed. Obtuse! That’s the word! I know about the triangles but not their names.” And I’m not Feynman, but I did forget the name of this: you build a tree of squares. First one is your whole grid; split it into four smaller squares, those are the children; repeat till your resolution is sufficient. Then you can attach whatever you want to the child nodes (in this case, grid squares). Since it bisects space into halves, you can do a search in O(log n).

But what if you want to go faster? What if it’s absolutely crucial for speed, like a physics engine, to be able to ask “which objects occupy this region of space?”

My favorite algorithm is absolutely delightful: a radix sort on floats. http://www.codercorner.com/RadixSortRevisited.htm

You end up getting O(k*N) time, just like radix sort normally gives you. But the real magic is that it works with bounding boxes. There’s an article from the same author which I can no longer find, detailing it. But the idea is, you sort all your objects in one spatial dimension first, say along the X axis, and then you don’t need to do anything to search quickly. No trees. Suppose you want to ask “does this region of the X-axis have any objects?” You just... do something, ha. I forgot the next step, so I don’t want to say a mistaken answer and tarnish poor Pierre’s idea, which is quite nice. But he’s been working on this problem for basically decades, so here’s one of his N papers on the topic (of spatial partitioning, not applying radix sort to solve it): http://www.codercorner.com/ZeroByteBVH.pdf

I could work it out if I sat down with it for a bit, but, I leave it to one of you younger devs can take up the torch. It’s a fun puzzle.

[+] dharmon|5 years ago|reply
The second algorithm you are thinking of is probably what is known as "sort-and-sweep". Basically each object takes up an interval along each axis, so a necessary condition for collisions between two objects is that the intersection of the intervals along each axis are non-empty. (which means if there is no overlap of intervals along any axis, you can guarantee they are not colliding).

You compute the bounds along each axis in linear time. The sorting of these bounds can be done in linear time, then you take a pass over each axis looking at the marked intervals, which is also linear.

[+] sp332|5 years ago|reply
That's a quadtree, or an octree in 3D.
[+] Cthulhu_|5 years ago|reply
I've learned a few simple tricks about things like random numbers, rounding, simple physics (acceleration, inertia, gravity), etc just from following some tutorials in the pico-8 fantasy console.

And some people have taken that to extremes, building 3d rendering engines and tech demos in what is a very constrained environment.

[+] djmips|5 years ago|reply
And I would point out that some of the best Game Devs who make frighteningly fast code don't have CS degrees and are sometimes fixing up after people that do.
[+] wang_li|5 years ago|reply
Binary Space Partitioning tree.
[+] CyberDildonics|5 years ago|reply
If you sort bounding boxes in only one dimension by their min values you still have to brute force test their max values and their other dimensions unless you make a secondary space partition or acceleration structure.
[+] blisterpeanuts|5 years ago|reply
Years ago, I was a contract programmer at a large university, working on Perl reporting scripts for the library system.

Once a month, they reloaded the patron database; it took about 24 hours. If there were a network interruption or power glitch, it would have to be rolled back and run again the next day. Pretty bad.

It wasn't my task, but I was curious why this program took so long to run, so I started picking it apart. It was written by a PL/I programmer -- his first Perl project. No comments. He was no longer there. No one understood the script.

I started changing things, among other things I moved the Oracle-specific DBD calls out into a separate module, because someone on the internet said Perl ran more efficiently that way. And poof! just like that, the program then ran in 8 minutes, down from 24 hours.

I looked like a hero, but I very openly said I wasn't quite sure what I did. Just got curious and moved things around, and apparently discovered some bug in the DBD layer by pure accident.

Sometimes curiosity and willingness to experiment is all it takes.

[+] mlang23|5 years ago|reply
Most code is bad code. And if it is not algorithmically bad, there are many other errors lurking in dark corners.

I am a HPC cluster admin. Many years ago, we had a (for us back then) rather large project. Several million hours of CPU time. During a support case, I happened to stumble across the source code for the project. And I was pretty surprised. It was a few hundred lines of Pascal, compiled with fpc. I knew about the language being used and was told other compilers dont make a lot of difference "because the code only uses 32-bit values". Hmm, suspicious, but you can't dictate how people solve their problems. At least not easily. But the small LOC count, and a upcoming weekend without a lot to do got me thinking. So I sat down and more or less mechanically translated the code to C++. Making use of templates to move a value from runtime to compile time. And bang, I got a speedup of 5x. So a few hours of time spent by someone NOT involved in the project at all helped to save around 5 mio. hours CPU time (around 1.5x my yearly salary).

Of course, this is a pretty extreme example and not really representative for the scientific computing community. Still, with my experience watching people run big jobs I am estimating that we waste around 50% of our computing time (world wide) on insufficiently thought-out implementations and lack of knowledge about the actual architecture the code is running on. Sometimes, I am happy if a user knows the difference between OpenMP and OpenMPI :-)

[+] gridlockd|5 years ago|reply
> ...not really representative for the scientific computing community.

I disagree.

Source: dude, trust me.

[+] chrismorgan|5 years ago|reply
Point of document implementation: please don’t use <script type="math/tex">…</script> for equations, or else the equations are simply missing in environments where MathJax doesn’t run (e.g. JS disabled, text mode browser, or JS fails to load—all up, it’s more common than you realise). Use a form of markup that will fail visible rather than fail invisible. TeX mathematics notation is far better than a void.

You could make this abuse of <script> a bit more tolerable with something like the CSS `script[type^="math/tex"]:not([id]) { display: inline }`, but that will still be hidden to text-mode browsers or browsers where the CSS fails to load. It’s much better to just not use <script> in this way.

[+] thangalin|5 years ago|reply
> please don’t use <script type="math/tex">…</script> for equations

KeenWrite[0], my desktop text editor, has an Export HTML using TeX/SVG feature that encodes math glyphs as vector paths, embedding the SVG into the resulting HTML file.

On the subject of optimizations, KeenWrite can render 1,000 simple TeX formulas to SVG format in about 500 milliseconds (on my desktop machine), sufficient for real-time previews. Two libraries needed changes to accomplish this feat.

First, JFreeSVG and Apache Batik both suffer from the same performance problem: they use Java's NumberFormat to convert floating point numbers to strings. It's slow because of internationalization rules, which aren't necessary for a machine-readable format. The fix entailed replacing NumberFormat with the Ryū algorithm[1].

Second, JMathTeX[2] was parsing TeX by throwing exceptions for flow control. In Java, throwing an exception forces the virtual machine to fill out a stack trace. Normally this wouldn't result in abysmal performance, but the flow control was being executed for each character in every macro name encountered. Pre-parsing macro names without any flow control resolved the issue.

Further optimizations were made, but those two were the lion's share. Combined, the total time went from 1 TeX formula to SVG format in ~5 seconds to 1,000 in ~.5 seconds.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/JMathTeX

[2]: http://jmathtex.sourceforge.net

[+] thewebcount|5 years ago|reply
I would go farther and say don't use special math mark-up for writing simple things like variable names and the 4 basic math operations. The only thing in the entire post that might need it is the log2 written twice near the end of the article.
[+] CogentHedgehog|5 years ago|reply
That's a good speedup but why are they not using locality-sensitive hashing? Or doing some basic math to convert X and Y to binned indices for a lookup table -- you can easily precompute values for a few million cells and hold them in memory. If the lookup table is small enough it would even fit in a single CPU cache, and lookup would be nearly instant. Both of these approaches provide O(1) lookup for grid cells rather than O(log n).

I used a similar optimization when I was younger to enable interactively plotting very large (at the time) 2D point datasets. We stripped out points that were so close together that the difference would not be visible. The X,Y coordinates were each binned to an integer pixel location and both were packed into a single long integer. Then we stored a hashset of longs so we knew what pixels had points occupied. When adding a point to render, we did O(1) hashtable lookups to see if the location already had a point in it. The result was an extreme speedup because we could avoid rendering all points which were not visible, and it scaled linearly with the number of points.

[+] Aeolun|5 years ago|reply
I’m not sure why we can’t map a position in a big grid to a position in a small grid by doing:

X = Math.floor(bigX / bigXMax * smallXMax)

Is there some magic here I’m not seeing?

[+] mattmanser|5 years ago|reply
Or, if the grids never change, just store the data of the parent grid with the child grid.

We've not got the whole picture though, but you don't need CS training to know that a loop within a loop within a loop within a loop is a bad idea.

[+] rocqua|5 years ago|reply
Grid doesn't need to be regular
[+] thamer|5 years ago|reply
Recently I got a speedup that was literally 6-7 orders of magnitude by changing a few characters in a regex, which made it match its input in linear time instead of O(k^N) due to a crazy backtrack.

The speedup I measured could have been even greater if longer strings were being matched, but in practice the longest string I tested it with took over 14 hours to match before the change, and ~2 milliseconds after the change. This alone is a ~25 million times speedup.

It was such a fun problem to solve, and by far the largest speedup I had ever seen. It was also the first time I encountered this issue outside of compilation class in college, but the way this regex was constructed made the issue pretty obvious.

[+] vecter|5 years ago|reply
What was the actual change?
[+] cestith|5 years ago|reply
Hacking through this and thinking himself so superior because he understands CS just makes the article feel pompous and self-aggrandizing. It's highlighting a personal win with notes about the shorter runtime being a team win.

CS is a wonderfully useful, deep and broad field. There's nothing so special about some simple Big O complexity that he couldn't have taught every developer in the department this rather than bragging and upstaging the person who wrote a correct but slow implementation. Patting all of CS on the proverbial back for this smells of gatekeeping.

[+] lanecwagner|5 years ago|reply
Trying to get Bootcamp grads to learn CS is like pulling teeth. The worst thing I hear as an employer is, "I dropped out of my CS degree because the last couple years of classes weren't applicable to daily programming".

I've started Qvault to try to address the problem... We'll see where it goes.

https://app.qvault.io

[+] moron4hire|5 years ago|reply
My senior year of CS classes may not have been applicable to "daily programming", but are certainly applicable to "monthly programming".

There are just jobs I never would have gotten or kept if I wasn't able to solve complex problems, on my own, the few times they come up.

This is something that is so hard to get across to boot camp grads. They complain about how hard it is to get a job, how hard it is to progress at the jobs they do get, how hard it is to write basic programs on their own, how hard it is to even figure out what is important to study. But then, if I suggest that constantly focusing on shallow, broad study of JavaScript-framework-du-jure might not be sustainable and that Computer Science might actually be relevant to software development, suddenly I'm the asshole for calling into question the teaching methods of the boot camp system.

For me, it's also a quality of life issue. I just can't imagine being satisfied with a career where I'm just a glue coder. I want to make new things. If I were working in the automotive industry, I wouldn't want to be on the assembly line, I'd want to be in the engineering department. And most of the people I talk to want to make new things, too. They just don't seem to want to hear that "building something from the ground up" requires "understanding fundamentals".

[+] noisy_boy|5 years ago|reply
I remember many years ago I came upon a giant shell script doing some data processing (data not even that large, maybe few GBs) that took close to 25 mins (mainly due to unnecessary I/O due to tons of temp file creation) - written by "developers". I re-wrote it in Perl and it completed in less than a minute (and it wasn't even optimized because I was a non-CS newbie who was just learning Perl). I even setup a cronjob to compare the output from both and email to the concerned folks. It was never adopted because I was new and Perl was considered "too-risky".
[+] ignoramous|5 years ago|reply
> It was never adopted because I was new and Perl was considered "too-risky".

What's not to love about bureaucracy?

[+] dumbfounder|5 years ago|reply
Computer Science for the... loss?

I did not study computer science in any substantial way, and I hardly consider myself a computer scientist, but I do have a nose for algorithms despite not really thinking about it in a structured way. I think lot of people get wrapped up in the notation without realizing they can approach the problem in a completely different way. Case in point, a professor at a University in the US created a parts-of-speech tagger and released the source in the early 2000's. I wanted to leverage this tagger to create a topic oriented search engine, but it was WAY too slow. It took about 10 seconds to tag a document (back in 2003 on decent hardware), and since I needed to tag hundreds of millions of documents, and despite having about 15 servers at my disposal, this was not going to work. So I dug into that algorithm and realized it was trying to test all of it's conditions all of the time to figure out which rule would apply. I redesigned it to index the rules and only apply them if it was possible for that rule to have some effect. Or something like that, the details are now fuzzy. The speedup was roughly 1000x, and made the tagger usable at scale.

Plot twist: I took a computer science class (as a Mech E major) taught by that same professor years earlier and failed it. That class was Data Structures.

[+] Al-Khwarizmi|5 years ago|reply
The tagger's speed was probably enough for what the professor intended at the time. They probably knew how to optimize it, but didn't because they had no reason to.

I say this as a professor myself, because we do that all the time in my team: we create some software system (also in NLP, by the way), we do experiments to show that it improves whatever metric we are interested in or that it teaches us something interesting, we publish a paper with it, make the code available and then move on to the next paper. That's what we are typically incentivized to do. The time spent into optimizing a system that we are not going to take to production anyway is time not spent doing more research. Which doesn't mean that we don't know how to optimize.

[+] jerf|5 years ago|reply
There is nothing a computer science course teaches, or indeed, any university course really, that can't be learned by someone who didn't go through that course. Undergrad computer science is indeed one of the easiest things to self-teach because of the ready availability of the "lab materials", i.e., computers are cheap and abundant now.

But a structured program will take you through these things, ensure that you understand them, be available to answer questions in a manner other than "bang your head on the problem for weeks" or "just give up", and provide a structure progression through to more advanced topics.

I think it's important to keep both sides of this in mind; the universities do not have a monopoly on the knowledge and arguably it's easier than ever to pick up yourself, but that doesn't mean the structured experience is useless either. I've worked with some very good self-taught people, but they all have also tended to have gaping holes in their education relative to a good computer science education. (I agree with those people that they are much better than some people who end up with bad computer science educations. Unfortunately, if one optimizes one's computer science education for "the easiest 4.0s", one can end up with very similar gaping holes in one's formal education!)

[+] lmilcin|5 years ago|reply
So one of the applications I have worked with was an embedded credit card terminal app which needed a transactional database. Since I could not find a product that would fit all requirements I decided to write one.

Now, you can imagine smart algorithms, trees, hashes...

Nothing of that sort. The database was written as append only transactional log. To retrieve data, entire file was scanned for initial record and all its updates. No indexes. Some algorithms were plainly O(n^3).

Yeah, I got into many heated discussions with "computer scientists" in the team. Yet for all their might they were not able to improve upon the program because they forget that algorithmic complexity is only one factor of performance.

Because I used extremely simple algorithms (like linear search) the operations were very simple and progressed quite fast on a CPU that was good at prefetching from the storage.

The total amount of memory available was just 0,5MB meaning that the "super inefficient" scans still completed within perception threshold of the user.

While O(n^3) was used for some operations, the application state machine which limited number of actual number of steps. Most transactions follow same basic pattern of execution and so I don't care what happens when somebody does 500 updates to the transaction when I can figure out there will ever be 5 state changes at most.

There were other consideration for the program. For example, it had to use very little instructions (we were very short on space) and it had to work in constant memory (to be able to statically calculate stack space necessary, for example).

[+] pc86|5 years ago|reply
I'm biased because I studied Poli Sci, I'd like to think I'm a pretty successful software architect and senior developer, and for a long time I had a big chip on my shoulder because I didn't study Comp Sci (and barely graduated college at all if we're being honest).

But your story shows the gulf that exists between Computer Science and software development. You needed to actually use a piece of software constrained by business requirements (not spinning up 250 servers), and when it didn't do what you needed it to, you changed it. The fact that you failed a data structures class didn't prevent you from refactoring and improving the code. The fact that that professor wrote code that performs poorly on good hardware doesn't negate the fact that there was probably some novel CS implementation in there somewhere.

I know in the past I've gotten caught up on CS profs having terrible websites or programmers not understanding the intricacies of bubble sorting but I think it helps to keep in mind that they are two different disciplines that inform each other, and the gap is seeming to get wider over time.

[+] amw-zero|5 years ago|reply
I don't understand - how is this a loss for CS? You used algorithm analysis to identify an inefficiency an improved that inefficiency. That is what Computer Science is. CS is not its notation - it is the act of doing what you just described. The notation is just supposed to help.

I don't understand the reduction of all of math and CS to complaining about people who overindulge their notations. I've been reading this recently, and it just affirms how important and ubiquitous mathematical thinking is everywhere in life: https://www.gatesnotes.com/Books/How-Not-to-be-Wrong.

[+] m12k|5 years ago|reply
Computer Science is just academic/theoretical programming. While having a relevant education usually beats not having one, real-world experience usually beats theoretical knowledge when it comes to achieving real-world gains. E.g. if I want something computed as quickly as possible I'm probably better off asking an average game engine programmer than an average CS professor, since they'll optimize for cache efficiency and branch prediction rather than just big O notation.
[+] PaulHoule|5 years ago|reply
One time I had a famous computer science professor mail me a C program (maybe 200 lines of code) that would compile OK but would crash when I would try to run it.

I set a breakpoint at the beginning of the main() method and it crashed before it even got to main().

That had me scratching my head for a minute, then I realized that the program was static initializing a 2^32 byte array. The prof had a 64-bit computer (probably DEC Alpha) which you could do that on, but it didn't work on my 32-bit x86 machine.

It turned out that the program never used the array that it allocated so deleting one line from the code got it to run.

Most CS profs never have to really finish the systems they work on so they lack many of the skills and attitudes of the professional programmer. Some of them are artisans with code, but mostly they making a living writing papers and teaching classes and the code is tertiary.

[+] cabaalis|5 years ago|reply
There's a happy medium between getting a job done and applying theorums/bigO/fancy data structures all the time. Probably what you came across and subsequently optimized was an implementation of the former.

Fun fact: Just getting the job done works out well most of the time, technical debt is introduced but as in your example it served its purpose.

[+] arcticbull|5 years ago|reply
Data structures is a weird one, it's basically a memorization class. It's the kind of thing you just have to do, a bunch of times, then it clicks. That was my experience, anyways. It's largely only relevant when interviewing, and I always cram for that like I did in college, by implementing a whole ton of data structures the week before.
[+] ricardo81|5 years ago|reply
Brill Tagger, per chance? I had tinkered with its code at some point, don't think I realised a 1000x speed up, though.
[+] AshamedCaptain|5 years ago|reply
I used to know a professor who switched from the hardware department to the software department. His reason? With hardware I can reduce runtime by 10 or 100. With software I can reduce runtime by log(x).
[+] nightcracker|5 years ago|reply
If your grid is well-behaved enough you can go a lot faster still using locality sensitive hashing.
[+] innagadadavida|5 years ago|reply
> global grid is 50x25 and the local grid is 1000x500. For each grid cell in the local grid, we want to know to which grid cell in the global grid it corresponds

Isn’t this just a simple quantization? (Lat/20,Lon/20) should do it?

[+] salmon|5 years ago|reply
I find optimization porn like this to be so satisfying.
[+] Frost1x|5 years ago|reply
The only issue I have with optimization porn is that it often handwaves away the time and thought that went into finding clever optimizations. In some cases it's clear how a problem can be better optimized. In many cases, it's not so clear. It takes someone experienced in development, an understanding of software and computational complexity, and the insight to see optimization opportunities or have used them before.

I also live in scientific computing world where the author lives and many other scientists don't appreciate the complexity of these optimizations the author hand waves away. It's easy to after the fact say, "well obviously look at this.." and show after the fact runtime performance improvements, it's another story to explain how you arrived at this, the time thinking about the problem sitting at dinner and before bed, while exercising or having coffee etc.

My biggest gripe with developers in this regard is how clever people like to act. They have a challenging problem and often spend significant amounts of time, then pretend like it happened over lunch. This does happen but from my experience, it doesn't happen regularly, not as many like to portray. You often iterate through several other optimization approaches aor ideas that fail or are discarded.

It's similar to when you see obnoxious compressed mathematical notation in a presentation where the author jumps through something they spent years on like it's trivial. If it was trivial, it wouldn't have taken you years and although a solution presented and checked to a problem is much clearer now than before, it's by no means obvious. This gives other professionals irrational expectations of resource allotment (in terms of your paid time) to solve these problems and makes your life more difficult than it needs to be.

Imagine if after Einstein spent time developing special relativity he skipped it all, hand waved, and said, "well clearly e=mc^2..." no--absolutely not clearly, but that's amazing insight.

[+] CyberDildonics|5 years ago|reply
Step 1: start with something grossly, ridiculously inefficient.

Step 2: Brag about your speed up

[+] hnaccy|5 years ago|reply
I was lucky enough to come upon a similar speedup in some code that was used a fair bit at first job; multiple order of magnitude speed up by applying some data structure knowledge.

I feel like base level algorithm/ds knowledge is getting higher so these will disappear, even if you don't know the exact solution your intuition will make you think "there's no way this needs to be so slow" and then you do some research/thinking.