top | item 24746835

(no title)

karbarcca | 5 years ago

It can make huge amounts of difference in a production system; where I work, we process terabytes of csv data every day; saving minutes per file can add up to enormous differences in CPU cost/time for a production system running 24/7.

I agree that for a data scientist doing exploratory analysis locally on their computer, it doesn't make nearly as much a difference (also because they're usually not working on crazy large files).

The performance work in the CSV.jl package (that the article is about) was very much geared towards these kinds of production scenarios.

discuss

order

mcrad|5 years ago

Right - sounds like you have more of a production support role vs. a data analysis workflow kind of task. Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.

ChrisRackauckas|5 years ago

>Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.

Why should the exploratory and production teams be using completely different tools? That seems like it would cause frictions in productivity and make there be gaps that introduce translation errors. I would venture to say that just having the exploratory and production teams working using the same code base is a very strong productivity gain, and we've seen this is true in many companies.

mr_toad|5 years ago

> process terabytes of csv data every day

All stored on NVMe SSDs? Because unless you have really fast IO the CSV parser isn’t going to be the bottleneck.

andi999|5 years ago

I am curious: how do you transfer these amounts of data fast?

karbarcca|5 years ago

Unfortunately almost exclusively via http rest apis. It's not great, but it's the lowest common denominator between the vast "ingestion" service we've built (connectors to web apis, local application for local file upload, raw api endpoints, etc.).

We've started exploring the apache arrow format as a compressible binary format with a dedicated wire format just to cut down on parsing processing costs.