top | item 7801370

(no title)

dbro | 11 years ago

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

discuss

zAy0LfpBZLC8mAC|11 years ago

You tend to have more control over the way the data is produced than you think, and you should make use of it. It's idiotic to work around broken producers over and over and over again, each time with a high risk of introducing some bugs, instead of pushing back and getting the producer fixed once and for all. Often the problem is simply in the perception that somehow broken output is just "not quite right", and therefore nothing to make a fuss about. That's not how reliable data processing works. You have a formal grammar, and either your data conforms to it or it does not, and if it doesn't, good software should simply reject it.

Your csvquote is something completely different, though it seems like you yourself might be confused about what it actually is when you use the word "ambiguous". There is nothing ambiguous about commas and newlines in CSV fields. If it were, that would be a bug in the grammar. It just so happens that many unix shell tools cannot handle CSV files in any meaningful way, because that is not their input grammar. Now, what your csvquote actually does is that it translates between CSV and a format that is compatible with that input grammar on some level, in a reversible manner. The thing to recognize is that that format is _not_ CSV and that you are actually parsing the input according to CSV grammar, so that the translation is actually reversible. Such a conversion between formats is obviously perfectly fine - as long as you can prove that the conversion is reversible, that the round-trip is the identity function, that the processing you do on the converted data is actually isomorphic to what you conceptually want to do, and so on.

BTW, I suspect that that code would be quite a bit faster if you didn't use a function pointer in that way and/or made the functions static. I haven't tried what compilers do with it, but chances are they keep that pointer call in the inner loop, which would be terribly slow. Also, you might want to review your error checking, there are quite a few opportunities for errors to go undetected, thus silently corrupting data.