I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.
A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.
For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/
In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.
> I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.
I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.
I just want to go on record that given the simplistic (i.e fun) problem, a shell developer would have had the answer to a first, specific set of billion rows done while all the other language were still putting on their shoes.
That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.
The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.
Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."
Java version is written by lead researcher and founder of GraalVM at Oracle labs. It is really native AOT compiled code comparable to best in C++/Rust.
It is Java entry because language used is Java but finally compiled artifact is far far away from typical compiled Java artifact.
The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.
Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).
But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.
Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.
> the fastest, heavily-optimised Java solution runs in just under a second on my machine
I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?
I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.
Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.
I submitted a github issue on this for the other implementation I looked at here[1].
I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.
“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.
People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.
Regarding Java,
It probably could be done with arrays and object reuse (arenas).
But it's slightly less ergonomic.
And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.
> I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.
I am curious, can it be made even faster than this?
Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.
My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.
[+] [-] bbkane|2 years ago|reply
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
[+] [-] jasonwatkinspdx|2 years ago|reply
[+] [-] timetopay|2 years ago|reply
[+] [-] latchkey|2 years ago|reply
[+] [-] faizshah|2 years ago|reply
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
[+] [-] sharno|2 years ago|reply
[+] [-] gigatexal|2 years ago|reply
[+] [-] geysersam|2 years ago|reply
[+] [-] jsmith99|2 years ago|reply
[+] [-] michae2|2 years ago|reply
In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
[+] [-] benhoyt|2 years ago|reply
Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!
[+] [-] Exuma|2 years ago|reply
[+] [-] hoten|2 years ago|reply
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
[+] [-] benhoyt|2 years ago|reply
[+] [-] camgunz|2 years ago|reply
0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...
[+] [-] thangalin|2 years ago|reply
https://www.youtube.com/watch?v=10KEr3sEG80
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
[+] [-] heavenlyblue|2 years ago|reply
[deleted]
[+] [-] fizx|2 years ago|reply
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.
[+] [-] benhoyt|2 years ago|reply
[+] [-] verytrivial|2 years ago|reply
[+] [-] jonahx|2 years ago|reply
[+] [-] Mawr|2 years ago|reply
The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.
Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."
[+] [-] geodel|2 years ago|reply
It is Java entry because language used is Java but finally compiled artifact is far far away from typical compiled Java artifact.
[+] [-] dsff3f3f3f|2 years ago|reply
[+] [-] threatofrain|2 years ago|reply
[+] [-] parkcedar|2 years ago|reply
[+] [-] bufferoverflow|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] neonsunset|2 years ago|reply
But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.
[+] [-] threeseed|2 years ago|reply
Go's advantage has always been that it is good enough at a lot of things.
[+] [-] tutfbhuf|2 years ago|reply
[+] [-] michalsustr|2 years ago|reply
I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?
[+] [-] WesolyKubeczek|2 years ago|reply
[+] [-] JensRantil|2 years ago|reply
I submitted a github issue on this for the other implementation I looked at here[1].
[1] https://github.com/shraddhaag/1brc/issues/2
[+] [-] pillusmany|2 years ago|reply
How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.
[+] [-] K0nserv|2 years ago|reply
0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279
[+] [-] KingOfCoders|2 years ago|reply
[+] [-] nottorp|2 years ago|reply
Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.
[+] [-] masklinn|2 years ago|reply
“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.
[+] [-] pjmlp|2 years ago|reply
People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.
[+] [-] benhoyt|2 years ago|reply
[+] [-] cangeroo|2 years ago|reply
[+] [-] speedgoose|2 years ago|reply
~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~
Here is a source: https://sites.google.com/view/energy-efficiency-languages
[+] [-] avinassh|2 years ago|reply
I am curious, can it be made even faster than this?
[+] [-] Alifatisk|2 years ago|reply
[+] [-] blue_pants|2 years ago|reply
https://github.com/1brc/nodejs
[+] [-] sireat|2 years ago|reply
However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.
Thus a slower but more robust solution would be more realistic.
[+] [-] worldwidelies|2 years ago|reply
[+] [-] afiodorov|2 years ago|reply
[+] [-] Beltalowda|2 years ago|reply