top | item 42471015

(no title)

ElPeque | 1 year ago

Yeah, exactly.

I basically scrape all the grib index files to know all the offsets into all variables for all time. I store that in clickhouse.

When the API gets a request for a time range, set of coordinates and a set of weather parameters, first I pre-compute the mapping of (lat,lon) into the 1 dimensional index in the gridded data. That is a constant across the whole dataset. Then I query the clickhouse table to find out all the files+offset that need to be processed and all of them are queued into a multi-processing pool. And then processing each parameter implies parsing a grib file. I wrote a grib2 parser from scratch in golang so as to extract the data in a streaming fashion. As in... I don't extract the whole grid only to lookup the coordinates in it. I already pre-computed the index, so I can just decode every value in the grid in order and when I hit and index that I'm looking for, I copy it to a fixed size buffer with the extracted data. When You have all the pre-computed indexes then you don't even need to finish downloading the file, I just drop the connection immediately.

It is pretty cool. It is running in very humble hardware so I'm hoping I'll get some traction so I can throw more money into it. It should scale pretty linearly.

I've tested doing multi-year requests and the golang program never goes over 80Mb of memory usage. The CPUs get pegged so that is the limiting factor.

Grib2 complex packing (what the NBM dataset uses) implies lots of bit-packing. So there is a ton more to optimize using SIMD instructions. I've been toying with it a bit but I don't want to mission creep into that yet (fascinating though!).

I'm tempted to port this https://github.com/fast-pack/simdcomp to native go ASM.

discuss

tomnicholas1|1 year ago

That's pretty cool! Quite specific to this file format/workload, but this is an important enough problem that people might well be interested in a tailored solution like this :)

ElPeque|1 year ago

Thank you!