top | item 7387052

(no title)

saltysugar | 12 years ago

Kafka is mostly in Scala (which runs on JVM). Scala provides many attractive features to build scalable and type-safe code. And about performance, Kafka is used of LinkedIn and it process hundreds of gigabytes of data, close to a billion messages per day (reported in their paper in 2011), and the engineers claim that they're processing terabytes of data a day now.

Not sure on what basis you claim the choice of language to be "questionable", but keep in mind that Scala's type-safety and many other features are much more difficult to achieve in C/C++. Cleaner code is sometimes more important than some tiny gain in performance.

Also in terms of scaling, Kafka cleverly takes advantage of many aspects in their design to ensure low-latency high-throughput.

* Little random I/O

* Relying heavily on the OS pagecache for data storage

Performance-wise, Kafka can outshine some of the in-memory message storing message queues.

Source: http://kafka.apache.org/documentation.html#design

http://research.microsoft.com/en-us/um/people/srikanth/netdb...

discuss

shmerl|12 years ago

That document acknowledges inherent JVM limitations:

> Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

So they had to work around that. In my view it's not a tiny issue. I'd say, instead of working around such inherent limitations, it's better not to have them to begin with when making high performance systems. That was my main point above. Time spent dancing around such problems defeats the purpose of supposed easiness of development.

saltysugar|12 years ago

You don't even bother to read past the section about the limitations that the engineers were well aware of in advance - they re-emphasize a common concern for people who question their choice of JVM.

The next line reads: "As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache..."

And if you don't know about pagecache, it's an in-memory cache managed by the OS and has nothing to do with JVM's memory at all.

And you forgot that C++ isn't the easiest language when it comes to designing a distributed system. Scala, as I mentioned, offers many other features that suit the needs of the team. Of course if you're a good engineer you'll know that there are trade-offs such as compiling time, but that's the same for every engineering decision.

noelwelsh|12 years ago

They didn't "work around it". Kafka was designed from the start to avoid memory pressure.

BTW, I've talked to finance people who have JVM applications that haven't had a major GC in months. Really, it's not a big deal.

kasey_junk|12 years ago

Having written a few low latency systems a few of which are on the JVM, I will say that in all of those cases, allocating/freeing memory is always a slow down regardless of GC or not (cache coherence is almost always the deal breaker here). So in those systems, you simply do not allocate/deallocate along the critical path.

Is this difficult in Java? Yes. It's also difficult in C++. Just because it is difficult doesn't mean it is impossible.

So if I am resorting to managing my own memory anyway, why would I use the JVM? Because typically the code that is on the critical path is a small percentage of the entire code base, and the other advantages of the JVM (tooling, language features, libraries etc.) out way the downsides.

That's not always the case, and I don't have any specific knowledge of Kafka, but just because something needs to be low/consistent latency doesn't mean it can't (or shouldn't) be written on the JVM.

fauigerzigerk|12 years ago

You are right that this can be a major problem with the JVM and working around it can be a lot of work. But you need to consider what kind of system we're talking about in this particular case.

This is a persistent message queue for log messages. Messages are coming in sequentially and subscribers read them sequentially. It makes zero sense to keep tons of messages in memory inside complex data structures as they are not indexed or searched or analyzed.

So in this particular case it's not actually a workaround. It's just sensible design and I wouldn't do it any differently in C++ either.