top | item 40032973

(no title)

emcq | 1 year ago

The 1.4s is _after_ having the file loaded into RAM by the kernel. Because this is mostly I/O bound, it's not a fair comparison to skip the read time. If you were running on a M3 mac you'd might get less than 100ms if the dataset was stored in RAM.

If you account for time loading from disk, the C implementation would be more like ~5s as reported in the blog post [1]. Speculating that their laptop's SSD may be in the 3GB/s range, perhaps there is another second or so of optimization left there (which would roughly work out to the 1.4s in-memory time).

Because you have a lot of variable width row reads this will be more difficult on a GPU than CPU.

[1] https://www.dannyvankooten.com/blog/2024/1brc/

discuss

pama|1 year ago

The performance report followed the initial request: run 6 times and remove the best and worst outliers, so the mmap optimization is fair game. Agreed that the C code has room left for some additional optimization.

emcq|1 year ago

If we are going to consider using prior runs of the program having the file loaded in RAM by the kernel fair, why stop there?

Let's say I create a "cache" where I store the min/mean/max output for each city, mmap it, and read it at least once to make sure it is in RAM. If the cache is available I simply write it to standard out. I use whatever method to compute the first run, and I persist it to disk and then mmap it. The first run could take 20 hours and gets discarded.

By technicality it might fit the rules of the original request but it isn't an interesting solution. Feel free to submit it :)

ww520|1 year ago

Also this uses 16 threads while the contest restricts to running in 8 cores. Needs to compare the benchmarks in the same environment to make a fair comparison.

pama|1 year ago

The AMD Ryzen 4800U has 8 cores total so the author follows the contest restriction. This CPU supports hyperthreading. (I’d be very interested in seeing hyperoptimized CUDA code using unlimited GPU cores FWIW.)