top | item 43649470

(no title)

dajtxx | 10 months ago

I am working on a system at present where the data scientist has done the calculations in an R script. We agreed upon an input data.frame and an output csv as our 'interface'.

I added the SQL query to the top of the R script to generate the input data.frame and my Python code reads the output CSV to do subsequent processing and storage into Django models.

I use a subprocess running Rscript to run the script.

It's not elegant but it is simple. This part of the system only has to run daily so efficiency isn't a big deal.

discuss

order

shoemakersteve|10 months ago

Any reason you're using CSV instead of parquet?

epistasis|10 months ago

CSV seems to be a natural and easy fit. What advantage could parquet bring that would outweigh the disadvantage of adding two new dependencies? (One in Python and one in R)

pletnes|10 months ago

Many of the reasons csv is bad is because you don’t control both reader and writer. Here, if you’re 2 persons that collaborate OK, they should be fine.