(no title)
jasonpbecker | 3 years ago
The best way to do exactly what you're saying is just use R and do:
``` data.table::fread('my file.txt') |> arrow::write_parquet('new_file.parquet') ```
That will do the exact same thing-- sanitize the file, parsing and transforming the data correctly, logging questionable lines, and outputting a binary file that can be used by other systems later.
When you're working with thousands of files and hundreds of millions of lines every day and your client will be rightfully pissed if their data is off by $100,000 and my only resolution is to wait 2 weeks for someone in IT on their end upstream to _maybe_ fix the file, hopefully without introducing a new error...
Writing my own delimited file parser over a huge amount of community effort sounds like the worst case of not-invented-here syndrome ever. What stinks is how willing most of those projects are to fail silently.
nuc1e0n|3 years ago
jasonpbecker|3 years ago