top | item 46667979

(no title)

khaledh | 1 month ago

Several years ago we had a data processing framework that allowed teams to process data incrementally, since most datasets were in the range of terabytes/day. The drawback is that it's append-only; i.e. you can't update previously processed output; you can only append to it. One team had a pipeline that needed to update older records, and there was a long discussion of proposals and convoluted solutions. I took a look at the total size of the input dataset and it was in the range of a few gigabytes only. I dropped into the discussion and said "This dataset is only a few gigabytes, why don't you just read it in full and overwrite the output every time?" Suddenly the discussion went quiet for a minute, and someone said "That's brilliant!". They only needed to change a few lines of code to make it happen.

discuss

No comments yet.