top | item 34478295

(no title)

jasonpbecker | 3 years ago

What you're saying is I should build and maintain my own delimited file parser with my own logic for healing these files and logging, rather than using any number of hugely popular delimited file parsers used by hundreds of thousands (millions?) of people with many strong programmers maintaining a well-tested code base.

The best way to do exactly what you're saying is just use R and do:

``` data.table::fread('my file.txt') |> arrow::write_parquet('new_file.parquet') ```

That will do the exact same thing-- sanitize the file, parsing and transforming the data correctly, logging questionable lines, and outputting a binary file that can be used by other systems later.

When you're working with thousands of files and hundreds of millions of lines every day and your client will be rightfully pissed if their data is off by $100,000 and my only resolution is to wait 2 weeks for someone in IT on their end upstream to _maybe_ fix the file, hopefully without introducing a new error...

Writing my own delimited file parser over a huge amount of community effort sounds like the worst case of not-invented-here syndrome ever. What stinks is how willing most of those projects are to fail silently.

discuss

order

nuc1e0n|3 years ago

These popular file parsers should have better validation mechanisms. Or better yet, have separate validation prior to the rest of any workflow.

jasonpbecker|3 years ago

Yes-- that's precisely what I'm saying. They should be better at validation, and my experience is that `data.table::fread` in R is the best in class.