top | item 34492468

(no title)

jasonpbecker | 3 years ago

One way to think about this is that CSVs are poorly specified, and what specifications do exist are ignored by CSV generating processes from popular RDBMSs. The advantage of a binary format with a specification and approved standard that is adopted would be having robust test suites that are followed that makes data generation unlikely to produce a bad file.

In the case of text delimited files, it is simply too easy and too common to generate, from the start, a malformed file that other systems cannot read. Because data loss is inherent in a text-based format, folks don't even bother to check if the files they generate can be successfully interpreted by their own system. PostgreSQL, Oracle, and MS SQL will all gladly produce CSV files that cannot be read back successfully. I'm not talking about some loss of metadata, I'm talking cannot be read.

In the "real world", of course I run validations on the data I accept. A common one for me, since the files are essentially "append only" when they're updated is to check for meaningfully fewer records than previous data loads. That's my best way of determining that when the file was read, records were dropped or lost because of things like quoting being messed up or an incomplete file transfer.

It's still not great that a mismatched quote, which is quite common, doesn't even trigger a warning in the validation methods of these parsers.

discuss

order

nuc1e0n|3 years ago

CSV isn't poorly specified, but it is mostly poorly implemented relative to the specifications that exist (Microsoft's products are some of the worst offenders for this). Data loss is not inherent in a text-based format, any more than it is in binary formats. The only difference being is that it's more difficult to produce or extract data from binary files. That's detrimental to information transfers and is why I would recommend specifically not to use binary formats.

JSON is much easier to validate and has similar ease of data transfer as with CSV. It can have minimal overhead as well if the data is stored as arrays (or an array of arrays).