top | item 34565857

(no title)

mnkmnk | 3 years ago

Unlike JSON, orc requires batching of rows to write to disk. It's because it does a lot of computation - maintaining indexes, encoding columns (run-length, dictionary), calculating statistics, maintaining bloom filters, compressing columns etc. Doing this at the source where you are more interested in serving an individual request as quickly as possible doesn't look like a good idea. If you want the orc files to be useful, you need to batch a lot of rows together otherwise you don't get the benefits of columnar storage. So logs in the happy path will be delayed, and in the unhappy path if the process crashes, recent logs are gone. JSON isn't really bad as a logging format. And it can be stored temporarily to then asynchronously convert to a columnar format.

I'm looking forward to the next post.

discuss

No comments yet.