Proving that Android’s, Java’s and Python’s sorting algorithm is broken

[+] bsdetector|11 years ago|reply

Their corrected version:

    if (n > 0 && runLen[n-1] <= runLen[n] + runLen[n+1]  
    || n-1 > 0 && runLen[n-2] <= runLen[n] + runLen[n-1])

In the first clause they add earlier to later runLen elements, but in the second they add later to earlier elements. Switching the order just makes the expression harder to understand, like reusing variables within a scope. Addition is commutative and there's nothing technically wrong with it, but this construction makes it appear like there may be something special about element n when there's not.

The programmer also has to do mental arithmetic to check the bounds. Bounds checks can be written so that the largest index subtracted is also the value tested:

    if (n >= 1 && runLen[n-1] <= runLen[n] + runLen[n+1]  
     || n >= 2 && runLen[n-2] <= runLen[n-1] + runLen[n])

This makes it easier to see that the bounds are correct. Formal methods found an important bug that would not have been found otherwise, but a lot of lesser bugs can be prevented just by writing clear and consistent code.

[+] mkesper|11 years ago|reply

From the article: The reaction of the Java developer community to our report is somewhat disappointing: instead of using our fixed (and verified!) version of mergeCollapse(), they opted to increase the allocated runLen “sufficiently”. As we showed, this is not necessary. In consequence, whoever uses java.utils.Collection.sort() is forced to over allocate space. Given the astronomical number of program runs that such a central routine is used in, this leads to a considerable waste of energy.

[+] tveita|11 years ago|reply

It looks like the Python version that's good for 2^49 items uses 1360 bytes for this array, allocated on the stack.

I wouldn't worry about an extra couple of bytes nearly as much as I would worry about changing the behaviour of a function used in an "astronomical number of program", so this looks like a pretty reasonable and conservative choice, at least for an immediate patch.

[+] pron|11 years ago|reply

They fixed it. The question of whether or not it was the "proper" fix (out of the two proposed solutions) is complicated. The maximum working array length was changed from 40 (ints) to 49. The difference in this case means that even an additional cache line won't be necessary, and if it is -- only for arrays 120K long. This may have been judged to be better than adding more branches to each iteration (and the code lengthened) which will need to run no matter the input size.

My guess is that they ran a JMH benchmark with each proposed fixed, and picked the better one. In either case, I doubt that makes much difference.

[+] bsdetector|11 years ago|reply

Which costs more, doubling the number of runLength comparisons or allocating more memory? Without knowing that answer it's hard to say which is the better solution (assuming more memory actually fixes the bug).

[+] muyuu|11 years ago|reply

Really symptomatic of the way the JDC works.

[+] xamuel|11 years ago|reply

[deleted]

[+] thomasahle|11 years ago|reply

This is actually really cool. They tried to verify Timsort as implemented in Python and Java using formal methods. When it didn't seem to work, they discovered that there was actually a missing case in both implementations, which could lead to array out of bounds exceptions.

Really shines as an example of how important proof is in computer science.

[+] stingraycharles|11 years ago|reply

In the Haskell community there is a tool called QuickCheck [1]: it is able to generate inputs based on preconditions, and you provide a function to verify the postconditions.

I am not sure why these kind of testing methods aren't used more often in other communities, since it makes it really easy to catch corner cases; in the Haskell world, at least, using QuickCheck is somewhat pervasive.

[1] https://wiki.haskell.org/Introduction_to_QuickCheck2

EDIT: I have never done this before, but could anyone explain why I am being downvoted? I wasn't making the claim that this is a substitute for a formal proof, I was merely adding this information to the discussion since it seemed relevant.

[+] bglazer|11 years ago|reply

Has anyone used the KeY project in industry?

I'd certainly prefer proofs over unit tests. However, I don't understand formal proof systems sufficiently well to know whether these they would work in terms of your typical "app" that makes RPC calls, DB changes, and generally has lots of moving parts and statefulness.

[+] jordigh|11 years ago|reply

Oh, crap. I suppose this also means we have the bug in our GNU Octave implementation:

http://hg.savannah.gnu.org/hgweb/octave/file/0486a29d780f/li...

Well, time to patch it there too.

[+] Animats|11 years ago|reply

Nice. As usual, entry and exit conditions aren't that hard to write; it's loop invariants that are hard.

  /*@ loop_invariant
    @  (\forall int i; 0<=i && i<stackSize-4; 
    @             runLen[i] > runLen[i+1] + runLen[i+2])
    @  && runLen[stackSize-4] > runLen[stackSize-3])
    @*/

It's surprising how close their notation is to our Pascal-F verifier from 30 years ago.[1] Formal verification went away in the 1980s because of the dominance of C, where the language doesn't know how big anything is. There were also a lot of diversions into exotic logic systems (I used to refer to this as the "logic of the month club"). The Key system is back to plain old first-order predicate calculus, which is where program verification started in the 1970s.

For the invariant, you have to prove three theorems: 1) that the invariant is true the first time the loop is executed, given the entry conditions, 2) that the invariant is true for each iteration after the first if it was true on the previous iteration, and 3) that the exit condition is true given that the invariant is true on the last iteration. You also have to prove loop termination, which you do by showing that a nonnegative integer gets smaller on each iteration. (That, by the way, is how the halting problem is dealt with in practice.) #2 is usually the hardest, because it requires an inductive proof. The others can usually be handled by a simple prover. There's a complete decision procedure by Oppen and Nelson for theorems which contain only integer (really rational) addition, subtraction, multiplication by constants, inequalities, subscripts, and structures. For those, you're guaranteed a proof or a counterexample. But when you have an internal quantifier (the "forall int i") above, proof gets harder. Provers are better now, though.

A big practical problem with verification systems is that they usually require a lot of annotation. Somebody has to write all those entry and exit conditions, and it's usually not the original programmer. A practical system has to automate as much of that as possible. In the example shown, someone had to tell the system that a function was "pure" (no side effects, no inputs other than the function arguments). That could be detected automatically. The tools have to make the process much, much easier. Most verification is done by people into theory, not shipping products.

[1] http://www.animats.com/papers/verifier/verifiermanual.pdf

[+] inglor|11 years ago|reply

This is an excellent use case of formal verification. I can definitely see the merit in using tools like this in my code.

They mention KeY http://www.key-project.org/ . Is anyone using this here? Are there any good resources on it except for the official site (and this blog post)?

[+] iso8859-1|11 years ago|reply

It is in use for undergraduate courses in the universities that develop it, and I used it too. It ships with small training problems. What would you like to know? There are many documents on JML. The prover works for you, you just left click logic expressions in KeY and you can prove large parts by point-and-click. It can also finish proofs for you if it's trivial. Often, you only need to specify the loop invariants.

I'd recommend opening up KeY, loading one of the trivial examples first: Contraposition. This will be a quick reminder on the logic concepts like implication, but you pretty much cannot screw it up. Try to understand why the proof tree branches when you choose certain steps.

Afterwards, try something else from "Getting started" like the examples that the proof searcher can prove (for example SumAndMax) and exploring the proof tree, and trying out yourself from scratch. The automatic proofs are not always pretty, so it's more for to do the manual work first. KeY will only let you do valid proof steps, so you learn quickly how proofs work.

[+] amund|11 years ago|reply

Hi, I will ask the authors of the blog post and corresponding academic paper about additional KeY resources and follow-up.

[+] unknown|11 years ago|reply

[deleted]

[+] ezyang|11 years ago|reply

Everyone here is talking about Timsort, but you should also check out the materials they've published about the KeY project, which they used to carry out this verification. http://www.key-project.org/~key/eclipse/SED/index.html describes their "symbolic execution debugger", which lets you debug any Java code even if you don't know all the inputs (by simply taking the input as a symbolic value). The screencast is very accessible.

[+] ujjwal_wadhawan|11 years ago|reply

Spark 1.1 switched default sorting algorithm from quicksort to TimSort as well in both the map and reduce phases- https://databricks.com/blog/2014/10/10/spark-petabyte-sort.h...

[+] ericfrederich|11 years ago|reply

Where can I get that PS1 from the shell in the video? It appears to have a green check mark if the previous command returned 0 otherwise some red symbol. Looks cool

[+] tenfingers|11 years ago|reply

This is a good example of why formal verification is incredibly useful. I tried to invest some time in learning some of the proof verification languages and tools, but so far I wasn't too successful.

It looks like that to formulate a proof, I always have to rewrite the algorithm/problem first in the tool's language, which is often not easy. I could see myself making mistakes in writing the proof just as well as I do when I'm programming.

Proof validation is also tricky. Coq isn't fully automatic as I initially was expecting. I actually used "prover9" which is first-order only, but does automatic validation. I guess Coq is really useful when you need to understand the proof and interactive validation can guide you, whereas prover9 could help with automation.

The thing is, it's still too much work, even for seemingly simple algorithms, to write a proof in either system in order to improve on the current situation of unit testing (that is: if I wanted to get something with more intrinsic value than a test case).

Formally verified languages are nice, but for a gazillion of reasons you still need to verify what's running currently.

[+] amund|11 years ago|reply

The authors have posted a follow-up posting: about KeY - the tool used to prove the bug in TimSort - http://envisage-project.eu/key-deductive-verification-of-sof...

[+] imaginenore|11 years ago|reply

And I always thought sort functions are always tested with millions of randomly generated sequences and the results are compared with the results of other implementations known to be good.

[+] maxerickson|11 years ago|reply

That was done. It's talked about here:

http://svn.python.org/projects/python/trunk/Objects/listsort...

The bug would only be triggered by generating a truly massive array (the implementation mentions that it will work for up to 2^64 elements: http://svn.python.org/projects/python/trunk/Objects/listobje... search for MAX_MERGE_PENDING).

[+] dagw|11 years ago|reply

Sure, but what is the probability that you generate exactly the type of sequence needed to trigger this bug.

[+] thomasahle|11 years ago|reply

Very often random sequences are very different from sequences encountered in the wild.

[+] TillE|11 years ago|reply

As long as your < operator works, you don't really need another sort implementation to compare with.

Also, it seems like the bug is only truly present in the Java version due to a slightly different implementation, even though the original Python (and the "fixed" Java) is technically incorrect.

[+] nichochar|11 years ago|reply

This is bad-ass. Thanks for taking the time and investigating something that so many overlook yet use every day (myself included)

[+] maxerickson|11 years ago|reply

Fixed in python:

http://bugs.python.org/issue23515

[+] RubyPinch|11 years ago|reply

anyone know the list of bugs for this?

https://bugs.openjdk.java.net/browse/JDK-8072909

It seems they havn't submitted to any other trackers, which is a bit unfortunate

[+] vstolz|11 years ago|reply

There is also http://bugs.java.com/view_bug.do?bug_id=8011944, where IIUC the suggested "fix" was to use a VM switch to enable the old (and slower) sorting...

[+] andrewstuart2|11 years ago|reply

I'm surprised that modern implementations don't use something like quicksort, since it has linear space requirements (sorts in-place) and good constants on average for its n*log2(n) running time.

[+] Arnt|11 years ago|reply

Quicksort has poor worst-case behaviour, so it's easy to carry out a DoS attack on network services that use it. Language maintainers don't like restricting their library to input from friendly users, at least not when there are other algorithms that work reasonably with unfriendly input.

[+] masklinn|11 years ago|reply

Timsort is stable, has an O(n) best case and O(n log n) worst case and has specific optimisations for e.g. partially-sorted sequences.

It can also be implemented in-place to reduce space requirements from O(n) to O(log n).

> since it has linear space requirements (sorts in-place)

Quicksort has a O(log n) worst-case space requirement

[+] JanKanis|11 years ago|reply

Timsort is especially good at reducing the number of comparisons needed by taking advantage of already ordered subsequences in the input. This advantage comes at a cost of more overhead compared to e.g. quicksort. Therefore Timsort is faster if the comparison operations are expensive, quicksort if they are cheap. Java takes advantage of this by using Timsort to sort arrays of objects with a user specified comparison function, and quicksort to sort arrays of primitives. (Or at least it used to.) In Python everything including comparisons is relatively slow so they just use Timsort all the time.

[+] geophile|11 years ago|reply

Many years ago, I ran into an intermittent performance problem that was caused by a self-propagating worst case in our quicksort implementation. If you sorted a sequence with one item out of place in an otherwise sorted array, you would get an n-1/1 split, and the pattern would repeat itself on the larger side of the split.

We fixed this by picking a pseudo-random pivot, I think.

[+] thaumasiotes|11 years ago|reply

> its n*log2(n) running time

If you're going to talk about asymptotic running time, why specify log_2? As soon as you elide a multiplicative constant, all logs are equivalent.

[+] xxs|11 years ago|reply

JDK is using quicksort (more like variation of) for sorting primitives. (being not stable doesn't matter for primitives)

[+] irascible|11 years ago|reply

TimSort is supposed to have better performance on real world datasets.

[+] wheaties|11 years ago|reply

Wish they'd put that up on Github or Bitbucket. There are so many things to be learned from that. I don't want to just download things or just use a tool (although it's pretty awesome that they made the tool available.)

[+] noahl|11 years ago|reply

Just a nitpick, but the article makes a mathematical mistake. In the last paragraph of section 1.2, it says

    For performance reasons, it is crucial to allocate as
    little memory as possible for runLen, but still enough to
    store all the runs.  *If the invariant is satisfied by all
    runs, the length of each run grows exponentially (even faster
    than fibonacci: the length of the current run must be
    strictly bigger than the sum of the next two runs lengths).*

However, fibonacci growth is strictly faster than exponential. In fact, this is why n * log(n) is the lower bound on the number of comparisons a comparison-based sorting algorithm must use: because n! is approximately n * log(n).

140 comments