(no title)
antonhag | 1 year ago
I tried recreating your DataInputStream + BufferedInputStream (wrote the 1brc data to separate output files, read using your code - I had to guess at ResultObserver implementation though). On my machine it roughly in the same time frame as yours - ~1min.
According to Flight Recorder:
- ~49% of the time is spent in reading the strings (city names). Almost all of it in the DataInputStream.readUTF/readFully methods.
- ~5% of the time is spent reading temperature (readShort)
- ~41% of the time is spent doing hashmap look-ups for computeIfAbsent()
- About 50GB of memory is allocated - %99.9 of it for Strings (and the wrapped byte[] array in them). This likely causes quite a bit of GC pressure.
Hash-map lookups are not de-serialization, yet the lookup likely affected the benchmarks quite a bit. The rest of the time is mostly spent in reading and allocating strings. I would guess that that is true for some of the other implementations in the original post as well.[1] https://github.com/openjdk/jmc
edit: better link to JMC
marginalia_nu|1 year ago
It doesn't seem like it can be true that 90% of the time is spent in string parsing and hash lookups if the same operation takes 10% of the time when reading from a filechannel and bytebuffer.
antonhag|1 year ago
haglin|1 year ago
I would guess at least some of the bottlenecks are in hardware, the operating system or in native code (including the JVM) in this case.