Bug Prediction at Google

[+] nickolai|14 years ago|reply

So in summary, their system highlights the files that had 'a lot of bugs lately'. While useful, this is more of a trend analysis and extrapolation rather than actual prediction. The title made me hope of something that would highlight some potential bug nests from static code analysis or the like. Interseting, but a little disappointing.

[+] Lewisham|14 years ago|reply

Hi. I'm Chris, the author of the blog post/work.

You're not wrong, although bug prediction is a very nebulous term in the academic literature which encompasses a pretty broad range of techniques. The term is used to properly ground the work in what's come before.

It's perhaps useful to talk a little bit about how I got where I ended up, which wasn't discussed much in the post for length reasons. I originally spent a lot of time trying to implement FixCache, a pretty complicated algorithm that looks at things such as spacial locality of bugs (bugs appear close to each other), temporal locality of bugs (bugs are introduced at the same time) and files that change together (bugs will cross cut these). It then puts it all in a LIFO cache and flags files that way. This most definitely fits in the "bug prediction" line of work, and a number of follow-up papers independently verified its success. but I was struggling deploying FixCache. This was largely because the open source code that had been written beforehand at the university lab [1] wasn't designed for MapReduce, and also because I was becoming worried that it was very difficult to understand how it flagged files, which could lead to the tool being written off as outputting false positives and then ignored.

About halfway through the internship, a pre-print of the Rahman et al. [2] paper was handed to me, which indicated that maybe everything FixCache was doing could be massively simplified and get largely the same result. We trialled both algorithms using user studies, and found that developers thought the output of the Rahman algorithm better matched their intuition as to where the hotspots in their code base was, so we went with that.

So, after this very roundabout discussion, we reach the question of "If Rahman is a subset of the FixCache algorithm, and FixCache is a bug prediction algorithm, is Rahman a bug prediction algorithm, too?" and I'll leave that for others to judge, but that's why it's been described as so :)

[1] https://github.com/SoftwareIntrospectionLab/FixCache [2] http://macbeth.cs.ucdavis.edu/fse2011.pdf

[+] esrauch|14 years ago|reply

Without considering the number of submits or file length you will end up with files that are simply longer and edited more getting undue high scores. A single 100 line file will be scored to be twice as buggy as the exact same code split into two 50 line files.

Considering source file lengths roughly follow a Zipfian type distribution, it seems like at least the top 5% length files would get flagged regardless unless the bug distribution is extremely uneven.

[+] Lewisham|14 years ago|reply

Hi, I'm Chris, author of the blog post/work.

You're on the right lines, and I spent some time thinking about this. There is a couple of things that led me not to concern myself with file length. Before I start, I would note that a file just being big and getting changed isn't a problem: it has to be changed and have a bug ticket with the "bug" classification put against it.

1. As others have stated, the research does strongly point towards LOC as being part of the metric that should be looked at. [1] is a paper that discusses LOC, but it's behind the IEEE pay wall and I wasn't able to find a free version. Sorry. I'm on holiday and a bit pressed for time this morning, but I can root around for more papers if you would like. Pretty much all the models we have for bug prediction will invariably throw LOC into the pot (although as they often do PCA too, one could argue that the LOC going down could be one of the components ;) ).

2. Google, by and large, is an OOP company. The intuitive reason for why LOC leads to defects is that the file is no longer meeting the principle of single responsibility. I would hypothesize that once your file has gone past this, defects are going to start to snowball as your file is on the way to becoming what I called a Gateway file, where large amounts of functionality pass in and out of the file, even if it only does simple stuff. So not only is it going to attract defects because it's long, but it might well attract even more defects because when other functionality is modified, that file has to as well. The added problem is these files resistant to refactoring because of the possibility of breaking so much functionality.

A Very Big Philosophical Argument is whether you would put configuration files under this classification, as config files affect all sorts of functionality. Personally, I would, but most of the engineers disagreed with me, so they were left out.

3. In a very unscientific way, I did experiment with all sorts of modifiers to try and even out files that may have been "unfairly" flagged due to some metric like LOC or number of commits or whatever. I eventually abandoned this, not just because I found the results largely unsatisfying, but I realized I really was doing things in an unscientific manner without any real justification for why I was doing it. Without a sound justification for evening it out, which LOC doesn't have, I decided not to do it.

Hope this helps a little bit. I very much see your point, and a number of engineers brought this up during user studies, but those big files that were getting flagged really were problems.

[1] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4297...

[+] tesseract|14 years ago|reply

I've seen a few studies (sorry, no citations handy, if I have time I'll revisit this comment) which concluded that above some fairly low floor, bugginess is strongly positively correlated with file length.

Which is to say, always flagging the top 5% longest files as being among the buggiest has a good chance of being the right thing to do.

[+] nl|14 years ago|reply

A single 100 line file will be scored to be twice as buggy as the exact same code split into two 50 line files.

I suspect that isn't the case. Subjectively, longer files are harder to understand and in languages like Java (one file = one class, ignore inner classes etc) is likely to represent more complicated and tightly coupled code.

If the same file keeps appearing, then split it in half and see what happens...

[+] po|14 years ago|reply

I was thinking the same thing. I was expecting them to analyze the diffs and keep a cache of code blocks that are changing in each bugfix. Keeping track of bugs at the class/function/method level would seem to be more powerful.

[+] JesseAldridge|14 years ago|reply

You could cut up every file into chunks of 10 or so lines and then run those chunks through the algorithm. Seems like that should yield better results. You ought to be able to pull line numbers from the commits.

[+] jdp23|14 years ago|reply

If you're into defect prediction, it's worth checking out the work the Empirical Software Engineering group at Microsoft Research has done over the years -- http://research.microsoft.com/en-us/groups/ese/publications....

[+] igrigorik|14 years ago|reply

https://github.com/igrigorik/bugspots - for anyone that's curious to try it on their own git repo.

(improvements welcome! :-))

[+] Toddward|14 years ago|reply

This is awesome - might roll a Python port if I have some time in the next few days.

[+] silverlake|14 years ago|reply

looks good, but you might move the pattern matching in scanner.rb out to a config file so users can search for other keywords.

[+] Lewisham|14 years ago|reply

This is so awesome, thanks igrigorik :)

[+] mjbellantoni|14 years ago|reply

I came to the comments hoping to find that this existed!

[+] jacques_chester|14 years ago|reply

What I like about Google is that they take uncontroversial "should-dos" -- you should unit test, you should review, you should pay more attention to bug-prone models -- and then build and improve tools that supports those practices.

To me one of the reasons agile so swept our field was because it rode the back of the increasing availability of two genres of toolset: the unit testing framework and the bug/issue/story/feature tracker.

[+] nimrody|14 years ago|reply

Trouble is, lots of nontrivial bugs are not module (or file) specific. The occur due to interaction between modules. Implicit assumptions on input are violated, invalid output is produced and continues propagating in the system.

[Hope this technology will help make Chrome more stable. Do they even consider resource leaks bugs?]

[+] gujk|14 years ago|reply

That doesn't matter so much, because this project is about highlighting delicate code regardless of reason, not assigning blame.

[+] eric-hu|14 years ago|reply

I would gladly use this tool and provide feedback if it were released as something that tacks onto git.

[+] bostonvaulter2|14 years ago|reply

Or even better. On to Gerrit!

[+] hndl|14 years ago|reply

What about files with limited or no history. Cycolmatic Complexity [1] to the rescue?

[1] http://en.wikipedia.org/wiki/Cyclomatic_complexity

[+] unknown|14 years ago|reply

[deleted]

[+] cwills|14 years ago|reply

Prevention is better than cure... So the next step would be to warn the developer before they make the code changes or during. Hooking into the IDE to visually indicate hot spots within the source file being edited on would be neat.

[+] Lewisham|14 years ago|reply

Yes, this is exactly right. We didn't have time in the period of an internship to hook into the IDE (along with all the other tasks, like user studies, algorithm design etc etc), so just went with the warning in source control.

Given more time, I absolutely would have done this. Nicolas Lopez at UC Irvine is thinking about putting tools like this at the IDE level, if you're interested.

[+] piptastic|14 years ago|reply

Unfortunately there were 18 hot spots in the bug prediction code.

[+] tibbon|14 years ago|reply

Also of note, this blog is beautiful in the classic view.

[+] MaysonL|14 years ago|reply

I'd vehemently differ, other than perhaps as a piece of abstract art: the black bar is attention-grabbing to negative effect, and the opening animation is self-indulgent, as is the massive whitespace atop. Also, the width of the page seems rather strange: why so wide, with no content?

[+] hillbilly|14 years ago|reply

I cannot read beyond the first few lines on my Droid. Very disappointing.

[+] wazoox|14 years ago|reply

Well, meh. It doesn't scroll when pressing the spacebar, so this is quite bad from a usability point of view.

[+] jnazario|14 years ago|reply

neat idea. i whipped up a version in python (and for svn, our scm of choice) and plowed through some codebases. i mimic igrigorik's ruby/git version's output (it's very usable).

the code's up here:

http://monkey.org/~jose/blog//viewpage.php?page=bugspot

[+] ashutoshm|14 years ago|reply

http://youtu.be/SzRqd4YeLlM

[+] jayeshsalvi|14 years ago|reply

I'll recommend Nicholas Taleb's "The Black Swan" to the authors of this prediction tool. Statistics cannot be used to "predict" future behavior of a complex system.

[+] nl|14 years ago|reply

Statistics cannot be used to "predict" future behavior of a complex system

They certainly can, and commonly are! Think of weather prediction, Bayesian diagnostic systems, Google's statistical translator and spell checker, etc etc.

Taleb's point is that they fail sometimes - perhaps more often than you think - and so you should be very careful in placing bets on statistical predictions.

In this case the cost of a failed prediction is very low, and a successful prediction is worth quite a lot, so it is a net win for Google.

[+] adanto6840|14 years ago|reply

[deleted]

[+] T_S_|14 years ago|reply

I've got a great bug prediction tool: ghc, the haskell compiler. Whenever I use a value of the wrong type my program won't compile. In my code it is always a bug when that happens. (Some smarter people can write non-buggy code the compiler doesn't like.)

Sometimes I can trick ghc into using the wrong type. Like when I use "String" to mean "File". That's when all the bugs show up as run-time errors. But I should know a string is not a file. And hopefully somebody writes a module that knows what a "File" really is and stops me from writing buggy code if I use it.

[+] srean|14 years ago|reply

I am not surprised that you are being downvoted. In my experience, one may question the superiority of dynamic typing at HN only to the detriment of ones karma. Well, its not HN alone, dynamic languages are certainly popular and quite persuasively championed.

Now I belong to the "had been persuaded before, but now I am not so sure" category. More so after discovering absolutely stupid typos and compile time checkable errors in long running Python code that I had written. It really sucks when your code bails out after 16 hrs of number crunching because of some silly error on your part.

I used to think, "who makes type errors of all things?" it turns out that that idiot is me. I am not claiming that dynamic languages are bad, or that checked languages are always good, but I do benefit from checked code. A better programmer perhaps wouldn't benefit as much.

Rather than implementing a type checking system badly and in an ad-hoc manner, I now lean more towards compile time checked languages, especially for pieces of code that I expect will run for a long time.

I think the sweet spot would be optionally type-checked / type-inferred languages, for example Cython, Common Lisp. Strongly type-checked languages _tend_ to be all or nothing.

EDIT: @nl yeah! absolutely, they arent by any means a bug predicting algo.

[+] Lewisham|14 years ago|reply

Hi, I'm Chris, author of the blog/work.

While I love me some type systems, I worry you're thinking a bit too code-centric.

For example, let's say I describe to you a really hard problem, and you come up with a solution (that type checks!), and then we launch to users and they find a use case we never thought of, and our code suddenly breaks. I am sure this has happened to you. The problem isn't at the code level (that's just the manifestation), but our current understanding of the problem.

IMHO, once you reach a certain level of competency, a sizable minority, if not majority, of bugs that make their way into a source control system (particularly one with code review, like Google's) are not line-by-line problems, but wider problems with actually understanding the task at hand and all that encompasses. When you work on the complexity of problems that Google does, you're tending towards a very large chunk of bugs being members of this class.

However, we don't have good tools (yet) to have a computer properly check the semantic meaning of our code. Bug prediction sits as a sort of baby step. It's the computer making a best-effort guess of where issues will be. The algorithm designed here is saying "Look this file has been a struggle. It's probably going to continue being a struggle, so we need to focus attention when we change anything to it."

[+] jdp23|14 years ago|reply

Defect detection != defect prediction

[+] nl|14 years ago|reply

Type errors are only one type of bug.

[+] JabavuAdams|14 years ago|reply

These are trivial kinds of errors. The big fuckups are invariably more subtle and people / requirements oriented.

When all you've got is a hammer ....

65 comments