The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions

[+] bbkane|2 years ago|reply

I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.

I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".

[+] jasonwatkinspdx|2 years ago|reply

It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.

[+] timetopay|2 years ago|reply

A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.

[+] latchkey|2 years ago|reply

I hate to "me too", but you also nailed that analysis.

[+] faizshah|2 years ago|reply

I was curious how long it would take with Polars (for scale), apparently 33s: https://github.com/Butch78/1BillionRowChallenge/tree/main

I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?

[+] sharno|2 years ago|reply

That’s the question worth asking imo. I was wondering how fast is the idiomatic Java solution

[+] gigatexal|2 years ago|reply

Where’s the source data I’d like to attempt ingesting this and processing it with DuckDb.

[+] geysersam|2 years ago|reply

Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.

[+] jsmith99|2 years ago|reply

I’m surprised at the poor performance of python here. For reference there are several very brief R examples which are just 2-3 seconds. Eg http://blog.schochastics.net/posts/2024-01-08_one-billion-ro...

[+] michae2|2 years ago|reply

For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/

In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:

- unsafe.Pointer can be used to read memory without bounds checks

- many functions in the bytes and bits packages in the standard library are written in assembly

- debug.SetGCPercent and SetMemoryLimit to turn off GC

- runtime.LockOSThread to lock a goroutine to a thread

- print is slightly faster than fmt.Printf (but writes to stderr)

[+] benhoyt|2 years ago|reply

Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.

Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!

[+] Exuma|2 years ago|reply

Damn, that is fine work. I know nothing and am humbled

[+] hoten|2 years ago|reply

> I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!

These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.

[+] benhoyt|2 years ago|reply

Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.

[+] camgunz|2 years ago|reply

I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!

0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...

[+] thangalin|2 years ago|reply

Back in 2010, I used PostgreSQL for a web app that queried 270 million rows of climate data from Environment Canada:

https://www.youtube.com/watch?v=10KEr3sEG80

I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:

https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s

The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.

[+] heavenlyblue|2 years ago|reply

[deleted]

[+] fizx|2 years ago|reply

It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.

I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.

[+] benhoyt|2 years ago|reply

It's a good point, but I believe because of the "-b" option (characters as bytes), using LC_ALL=C doesn't make a difference.

[+] verytrivial|2 years ago|reply

I just want to go on record that given the simplistic (i.e fun) problem, a shell developer would have had the answer to a first, specific set of billion rows done while all the other language were still putting on their shoes.

[+] jonahx|2 years ago|reply

Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:

    AY fastest Go version 2.90s 36.2
    TW fastest Java version 0.953s 110

I would have expected Go to win. That JVM works pretty good...

[+] Mawr|2 years ago|reply

That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.

The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.

Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."

[+] geodel|2 years ago|reply

Java version is written by lead researcher and founder of GraalVM at Oracle labs. It is really native AOT compiled code comparable to best in C++/Rust.

It is Java entry because language used is Java but finally compiled artifact is far far away from typical compiled Java artifact.

[+] dsff3f3f3f|2 years ago|reply

The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.

[+] threatofrain|2 years ago|reply

I don't think Go has ever demonstrated that it deserved to be thought of as casually faster than Java.

[+] parkcedar|2 years ago|reply

The fastest Java version is even beating his baseline of `cat` to `/dev/null`

[+] bufferoverflow|2 years ago|reply

3x difference is not "close".

[+] unknown|2 years ago|reply

[deleted]

[+] neonsunset|2 years ago|reply

Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).

But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.

[+] threeseed|2 years ago|reply

JVM has always been on par if not often faster than hand-written C code.

Go's advantage has always been that it is good enough at a lot of things.

[+] tutfbhuf|2 years ago|reply

Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.

[+] michalsustr|2 years ago|reply

> the fastest, heavily-optimised Java solution runs in just under a second on my machine

I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?

[+] WesolyKubeczek|2 years ago|reply

I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.

[+] JensRantil|2 years ago|reply

Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.

I submitted a github issue on this for the other implementation I looked at here[1].

[1] https://github.com/shraddhaag/1brc/issues/2

[+] pillusmany|2 years ago|reply

He already uses custom parsing.

How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.

[+] K0nserv|2 years ago|reply

I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.

0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279

[+] KingOfCoders|2 years ago|reply

Wouldn't you need a fixed length of the temps?

  00.1
 -10.3

or

  0.1 (with an ending space)

so you can look up 5 bytes in the map? (+/i, two digits, dot, one digit)

[+] nottorp|2 years ago|reply

I have a feeling that a naive implementation in Java would be a lot worse than a naive implementation in Go so optimizing matters more there.

Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.

[+] masklinn|2 years ago|reply

Depends what you call “naive”, but the “idiomatic Java solution” from last week’s post (https://questdb.io/blog/billion-row-challenge-step-by-step/) clocked in at 71 seconds, or 1:11. And just running it on graal was down to 66.

“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.

[+] pjmlp|2 years ago|reply

Depends on which Java implementation is used.

People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.

[+] benhoyt|2 years ago|reply

For what it's worth, on my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version. So Go is a bit better for the naive version.

[+] cangeroo|2 years ago|reply

Regarding Java, It probably could be done with arrays and object reuse (arenas). But it's slightly less ergonomic. And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.

[+] speedgoose|2 years ago|reply

Golang is actually not as efficient as Java in quite a few benchmarks.

~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~

Here is a source: https://sites.google.com/view/energy-efficiency-languages

[+] avinassh|2 years ago|reply

> I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.

I am curious, can it be made even faster than this?

[+] Alifatisk|2 years ago|reply

Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.

[+] blue_pants|2 years ago|reply

There's a nodejs version which takes 23s

https://github.com/1brc/nodejs

[+] sireat|2 years ago|reply

Those super optimized solutions are fun to read about.

However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.

Thus a slower but more robust solution would be more realistic.

[+] worldwidelies|2 years ago|reply

I’d like to see a 1 trillion row challenge.

[+] afiodorov|2 years ago|reply

My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.

[+] Beltalowda|2 years ago|reply

How did you generate the timings on: https://benhoyt.com/images/go-1brc-profile-r9-source.png ?

191 comments