top | item 3455673

Why is C faster than Java: git vs JGit

325 points| acqq | 14 years ago |marc.info | reply

102 comments

order
[+] snprbob86|14 years ago|reply
I almost skipped this link; I assumed it was typical borring blog noise. It's not.

This is an insightful post from the git mailing list which shows some of the real limitations that a top tier developer hits when trying to write Java code as fast as neatly optimized C code. Definitely worth reading.

[+] jaylevitt|14 years ago|reply
Yep. The usual "Program X is faster in C than Java" gets a barrage of "That's because you know C better". Shawn is a performance-obsessed Java expert, Eclipse committer and longtime Google coder who works on JGit. If he says Java is slower than C at this, then Java is slower than C at this.

EDIT: but as wcoenen points out, this was written in 2009 and Java 1.7 does a better job with some of this.

[+] kokey|14 years ago|reply
Actually, it also makes me want to get into another project in C again.
[+] pconf|14 years ago|reply
I wonder why none of these benchmarks measure the performance of _optimized_ Java? All of our builds use Proguard optimization and some of the resulting bytecode is noticeably faster.
[+] nickik|14 years ago|reply
Same here. Nothing groundbreaing about Java vs C threw.

Edit: I mean most of the problems (maybe features) of java are known.

[+] ot|14 years ago|reply
All the points are valid but they are peculiar to Java, not to all managed high-level languages. C#/.NET, for example, have unsigned types, value-type arrays and structs, memory mapped files and specialized collections.

As an example, the C# port of Sqlite is sometimes faster than the C version on queries, although updates are slower, despite Sqlite is a highly optimized C library.

EDIT: link http://code.google.com/p/csharp-sqlite/wiki/Benchmarks

[+] kevingadd|14 years ago|reply
It's also worth pointing out that the C# port of SQlite omits using certain C mechanisms (like pointers) in favor of passing copies of byte arrays around. You'd expect this to make it slower, but in many cases, it doesn't! (C# supports pointers, but the port doesn't use them so that it'll work in limited environments like Silverlight)
[+] mdeicaza|14 years ago|reply
Additionally, C# collections do not need to box.

List<int> uses an actual array of integers as its backing store instead of an array of objects that contain boxed copies of the integers.

There is a port of JGit to .NET called NGit, which is what we use for MonoDevelop.

The port is maintained with an automatic tool that converts Java code to C# code, you can find it here:

https://github.com/slluis/ngit

[+] erichocean|14 years ago|reply
Although Sqlite is highly optimized, it's got what amounts to its own VM internally, and doesn't really use C to the fullest.

It's not surprising to me that the C# version of that VM does as well (or better) for this particular codebase.

[+] stcredzero|14 years ago|reply
All the points are valid but they are peculiar to Java, not to all managed high-level languages. C#/.NET, for example, have unsigned types, value-type arrays and structs, memory mapped files and specialized collections.

I heard about related stuff happening while I was working at a Smalltalk vendor. The programmer who was implementing the network cryptography library would just call up the VM engineer and ask for goodies like support for large bit arrays, and he'd get them in for the next VM release. The programmer was able to beat some of RSA data security's (poorly implemented) reference DLL's written in C by 3% with a Smalltalk program.

In the Smalltalk environments, it's easy to implement "primitives" coded in C, just in case you weren't internal to the vendor. With open VMs like Squeak, you can add your own bytecode to support optimizations if you want to.

[+] zokier|14 years ago|reply
Slightly offtopic, but I wonder how much overhead in those benchmarks comes from calling native code from .NET runtime? An interesting data point could be benchmarking equivalent implementation in C or C++, avoiding the overhead of native-managed transition.
[+] zxypoo|14 years ago|reply
This is an old email... there's been many improvements to both JGit, Java/JVM and other areas of interest.

Shawn and I gave a presentation at the Googleplex not so long ago about JGit [1]. In particular, you may be interested in the 'JGit at Google' section.

There are some cases where JGit is faster than CGit, but the benefits of JGit are that it's easy to embed. There are projects like gitblit and other IDEs that use the library. On top of that, you have crazy folks like NGit [2] who cross compile the library using Sharpen so it can be used by the .NET community...

[1] - https://docs.google.com/present/edit?id=0ATM14GNiXaXfZGZkeHp... [2] - https://github.com/slluis/ngit

[+] gelliott|14 years ago|reply
That's really interesting - according to this presentation JGit clone is significantly faster than native git clone (2.3x in the example).

I'd love to hear more about any code changes that lead to this result.

[+] cookiecaper|14 years ago|reply
Note the date: 2009-04-30 18:43:19

Many of us have already read this and it's been submitted to HN several times before. This, of course, does not mean that it's not worth reposting, but interested parties may want to dig up some of the past discussions.

[+] njs12345|14 years ago|reply
I find it kind of interesting that in Haskell, which is arguably even higher level than Java, most of these optimisations are eminently possible..

EDIT: This obviously came across a bit as language fanboyism, so I guess I should mention that the language features that let you do many of them let you shoot yourself in the foot just as easily as you can in C, and you can certainly argue that with a strong FFI you might as well just call into C if you really need that kind of low level performance..

[+] kmm|14 years ago|reply
I've heard the argument before that in need, one can use a FFI to optimize bottlenecks in high-level code, but I've never understood.

Won't using a high-level language incur an omnipresent speed slump? And even if a bottleneck exists, how would using a FFI remedy crucial problems in the language, like the absence of unsigned types or that all types are boxed. The types will have to be unboxed anyway, so whether that happens in foreign code or in the interpreter/JIT code won't matter.

[+] SeanLuke|14 years ago|reply
I build fairly high-performance Java code. And get hit with three major gotchas which prevent it from approaching C code.

- There's no way to do array access without null pointer and index checks each and every time.

- Generics with basic types, and their unfortunate embedding into syntax (like the new for() syntax), are awful. Boxing and unboxing incur a ludicrously high penalty, and generics push coders away from using arrays. Unlike in C++, generics have been the enemy of performance.

- Poor quality collections classes (ArrayList and HashMap are notoriously bad)

Sure there's a few other things like pointer walking etc. in C, and Java's poor floating point, but the big three above are the killers.

[+] jcdavis|14 years ago|reply
> - There's no way to do array access without null pointer and index checks each and every time.

Actually there is. Have you checked out the sun.misc.Unsafe class? Lots of very dangerous gems in there, among them the ability to calculate array offsets and access the array elements directly. (check out arrayBaseOffset + arrayIndexScale + getObject/getLong/etc)

[+] Yrlec|14 years ago|reply
I had a similar experience when I was doing some Galois Field arithmetic in Java. You pay a huge penalty because of the absence of unsigned types. In our case we had to use long instead of int, which is extra costly, since many basic operations in Java return int by default.
[+] buff-a|14 years ago|reply
More specifically, the problem of trying to write a binary compatible java implementation of a neatly optimized solution written in C. So, the program in question is executed, reads a whole bunch of binary data from a whole bunch of different files, does some calculations on that data and then exits.

The questions are, if you had to develop a distributed version control system in java: a) would you solve it the same way, b) would your solution be faster or slower, and c) would it take more or less time to write it and be easier to maintain?

Clearly you would not solve it the same way, for example it might stick around in memory as you worked. Could it then appear faster, from a user's perspective? Possibly. Might it be easier to maintain. Also possible.

Pretty much by definition, if you are writing it in C, a binary compatible solution is not going to run as fast if you port it to java.

I don't think that is a conclusion that has much value.

[+] willvarfar|14 years ago|reply
So why do they write and use jgit at google instead of just git?
[+] durin42|14 years ago|reply
Because cgit is a bunch of binaries that expect to call each other. That makes it harder to abstract out the storage layer, and we don't use vanilla repositories sitting on a filesystem. Things are backed by some other storage abstraction, which isn't always very posix-filesystem like.
[+] unwind|14 years ago|reply
From the project page:

The original goal of JGit/EGit was to provide an Eclipse plugin for working with software using the Git SCM. The Eclipse plugin is still the main goal of many of the developers, but we are open to anyone wanting to interface with other tools, Netbeans, Ant, Maven etc. For those, the JGit part provides a high performance API for working with Git repositories. The main other user of JGit, besides EGit, is Gerrit Code Review, which used by projects such as JGit (ofcourse), EGit (by implication) and Android.

Not sure if that provides a very good motivation, but there you go. :)

[+] aurelianito|14 years ago|reply
As they wrote at the end of the article:

"But, JGit performs reasonably well; well enough that we use internally at Google as a git server."

He is not advocating to never use Java. He is just pointing out some things that may or may not be important when choosing a language to write a program.

[+] masklinn|14 years ago|reply
It's a java library, making it much easier to use from java code than a command-line utility and all the ensuing parsing and munging of string data (which is not java's forte either)
[+] tommorris|14 years ago|reply
There is a few things that require jgit: a while back, I found a really nice library (I'd look it up, but am on 3G on a train) that let you use Amazon S3 as a git remote. The author had just written some bridge code between jgit and an AWS library. Works suprisingly well, you just have to remember to use `jgit push` rather than `git push`.
[+] rayiner|14 years ago|reply
A lot of this is poor API design, and the product of Java's baggage as something that needs to have well-defined safety semantics for internet applications. It is not a necessary constraint of high-level languages that they don't offer the ability to get down to the metal. SBCL, for example, offers a lot of mechanisms for unboxed primitive arrays, unsafe declarations, and these days even SSE intrinsics.
[+] babebridou|14 years ago|reply
I've had the issue using maps with primitive keys. I solved it by isolating the performance critical functionality and not using the Collections framework there, instead writing my own data structure for it (with heavy influence from the hashmap one).

This tends to be my general philosophy, by the way. Reuse code to get something working fast, isolate what really causes bad performances, then solve only those problems by going under the hood. If the performance issues remain, cheat by pretending it doesn't exist, by making sure we're never in a worst case scenario and handling the worst case scenario differently.

In my "IntHashMap" case, the worst case scenario was gathering the keySet. I made sure that I'd only call it when I really really needed it. The rest was "fast enough" once I had removed the underlying Integer Object on the key.

[+] bajsejohannes|14 years ago|reply
> when you do use Java NIO MappedByteBuffer, we still have to copy to a temporary byte[] in order to do any real processing

Does anyone know why this is the case?

[+] cube13|14 years ago|reply
>So. Yes, its practical to build Git in a higher level language, but you just can't get the same performance, or tight memory utilization, that C Git gets. That's what that higher level language abstraction costs you. But, JGit performs reasonably well; well enough that we use internally at Google as a git server.

I think that this is the key takeaway for the entire post.

One of the reasons I generally dislike any of the "X IS BETTER THAN Y" bakeoffs is that performance is now so implementation dependent that these comparisons are pretty much moot. Given that basically any non-trivial implementation can be improved, it's difficult to say that anything is faster, especially when one considers developer skill.

Developers should not be chasing the abstract, absolute best performance. Instead, the language used should be the one that delivers performance that is good enough for their client's needs. If they can get it with something that we're familiar with, that's great. If they need to learn a new tool, that's also good. But it doesn't make much sense to throw away all the knowledge that a developer has about a certain language to chase "better performance" with a different one. Most likely, the first effort implementations on a new language won't be nearly as good as the implementations on the more familiar language.

It's generally true that optimized Java won't ever be as fast as optimized C. But for the vast majority of cases, it doesn't need to. Java's speed is enough for those cases. And in the small minority where it's not sufficient, C is still around.

[+] malkia|14 years ago|reply
I wonder if mercurial gets rewritten in "C" whether there would be any speedup.
[+] dochtman|14 years ago|reply
I'm sure there would be some speedup, the question is whether it would be worth it (and I suppose that can only be adequately be assessed by the developers, who now have to maintain C code instead of Python).

But for some perspective from a former Mercurial developer: lots of the more performance-sensitive code has already been rewritten in C. Rewriting the rest of it would simply be a question of diminishing returns. One thing that would improve is hg's startup time; starting up Python just takes a while, which kind of sucks for command-line programs like VCS clients that tend to have many short-running invocations.

[+] itmag|14 years ago|reply
I remember a post on here recently which said that sometimes a high-level language can be faster than C, because you can convey more of your algorithmic intent and thus the compiler can optimize better for you.

It gave an example where the compiler's knowledge that something is an immutable array means better optimization. Which you can't express in C.

[+] tomandersen|14 years ago|reply
Ahh, getting to the metal. Only in C do I get that feeling. For some reason even C++ just fails on that 'fresh metallic taste' test.

Most code should not be C.