top | item 2137124

Haskell improves log processing 4x over Python

114 points| jmintz | 15 years ago |devblog.bu.mp | reply

41 comments

order
[+] andrewcooke|15 years ago|reply
The work sounds very cool (and they are hiring), but (only) a factor of 4 speedup over Python is (to repeat a phrase from elsewhere today) like boasting that you're the tallest midget ;o)
[+] jamwt|15 years ago|reply
Hi, article author here.

It's important to note that this particular job is largely bound on a.) I/O and b.) format serialization tasks. Both Python's BSON and JSON libraries are mature and have their critical sections written in C, so a speedup of 4x is still noteworthy. The Haskell version, on the other hand, is pure Haskell.

[+] jbellis|15 years ago|reply
Agreed. Even where you can optimize the hot code in C, Python is no speed demon. Cassandra's java stress test can push out about 10x as many ops/s as the python one, even though Thrift C extension for Python is quite good.

/still a Python fan

[+] Peaker|15 years ago|reply
Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?

The thing I'm unsure about, is how difficult it would be for (very) talented developers to just jump in. We have really talented developers, and everyone is super time-constrained, so many are wary of diving into a language as different as Haskell. Was it hard for your developers to figure Haskell out? Did your previous use of Scala help? How long did it take them to dive into Scala?

[+] jamwt|15 years ago|reply
I would say the two real barriers to writing effective Haskell projects are a.) "getting" monads, and b.) understanding the implications of laziness, especially with regard to space leaks and unconsumed thunks. Everything else isn't that big of a deal.

It's all much easier to digest, though, even for "really talented developers", if they have some experience with another functional language first. OCaml is a nice stepping stone before digging into the abstractions involved in understanding Haskell's powerful type system. Scala is good too, but having the object stuff mixed in there can lead you to rely on some patterns that aren't going to be available in a non-OOP language. I think the scheme/clojure path isn't bad either, but it's probably ideal to spend some time in the "statically typed" wing of the functional universe before going to Haskell.

[+] microtonal|15 years ago|reply
From personal experience: I didn't make much progress in Haskell until I stopped using Scala. The problem is that Scala allows you to mix and match different paradigms and if you come from a mostly-imperative/OO background, you tend to use Scala as an OO language with some functional constructs.

To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.

Learning Haskell, optimization in a lazy world was the most difficult task. Often, I still have problems predicting how efficient particular code will be. The complexity of monads is somewhat overstated, though it doesn't help that some tutorials make something big and esoteric out of it. It is nothing more than a type class, that specifies how to combine computations that result in some 'boxed value'.

[+] Locke1689|15 years ago|reply
The author is mostly write about the usage cases of Haskell, but simply "systems" is a bit misleading because there are certain performance characteristics of lazy programs which make them bad choices for some systems programs. Any type of real-time system, for example, can suffer unpredictable performance in critical sections, which is pretty undesirable.
[+] awj|15 years ago|reply
Not to argue the example, but Python's garbage collection disqualifies it for real-time systems as well. In fact, I'm having a hard time find a "system" task for which Python (as a language) is qualified by Haskell is not.
[+] jamwt|15 years ago|reply
While I agree with you that Haskell (or, really, any GC'd language) is unsuitable for real-time systems, I disagree that my statement about its excellent suitability for systems programming in general is misleading. There are many, many domains (read: most) that, in my experience, are called "systems programming" that have nothing to do with hard or soft real-time requirements.

Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.

[+] ynniv|15 years ago|reply
Are the logs being read from disk? In my experience, python is highly optimized for reading (possibly compressed) files from disk. If your infrastructure keeps logs in memory, python will lose this advantage and compete on computational performance where Haskell has the advantage. This is important for those of us who grind logs on disk and might be considering a language switch.
[+] enneff|15 years ago|reply
What do you mean by optimized? Python makes the same read and write syscalls everyone else does.

What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.

[+] jamwt|15 years ago|reply
Nope, this is a process that BLPOP's logs from some Redis queue, does some processing on them, then writes them to disk.
[+] kordless|15 years ago|reply
I'd be interested in hearing more about how the author is using the resulting data set. Doing extractions at event generation time can be very useful if you know what you are after in advance, but not so good for adhoc analysis.

Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?

[+] jamwt|15 years ago|reply
Yeah, the whole pipeline is actually quite more faceted than can be deduced from this summary. This stage actually just persists the events into a consolidated transaction log. Then, there are secondary processes that scan these transaction logs (in batch) and distribute data into various databases for system, business, and user analytics. I can't go into too much detail there, but the actual digesting and reporting side is more involved.