Problems with CSVs (2016)

[+] nightcracker|8 years ago|reply

I disagree with this list of "falsehoods programmers believe".

Most of those lists describes objectively wrong assumptions programmers have about some real-world phenomenon. But in this case I'd argue that a lot of these points are simply not correctly formatted CSV files. Just because someone handed you a file with the extension .csv does not mean it's a CSV file proper, and it certainly does not mean that you have to guess at what it intends to encode without assuming any of the things on the list.

For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.

[+] DeusExMachina|8 years ago|reply

You can certainly reject the file, but that's not going to make your user happy. The (sad) reality is that you have often to deal with these edge cases in software.

The user does not care that the CSV is not proper. He just wants to open it, and if it doesn't work, they will blame your program, not the source of the CSV.

And you only control the former. I have had to deal with a CSV export from a web service that was "not proper". I notified their support, but that problem was never fixed.

Luckily, the user in that case was just me. But if I was making software for someone else, blaming the right source of the problem would have not taken me anywhere.

[+] codingdave|8 years ago|reply

True, but if your job involves data transformations or migrations, then bad files, dirty data and such things are just part of the job. You deal with them. Rejecting incoming data because it isn't perfect is a path to bad customer service, and a poor reputation.

[+] masklinn|8 years ago|reply

> For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.

"CSV" only specify the metacharacters (at best), unless it's explicitly stated that the CSV is a single consistent encoding it's no more improper than inconsistently encoded Linux paths. It's stupid and a pain in the ass to be sure, but it's not improper.

And when you don't control the entire pipeline, you will eventually hit this issue.

In fact the first falsehood programmers believe about CSVs is that it's a good serialisation, interchange (between separate systems) or human-edition format. While it's routinely used for all of those it's also absolute garbage at all of them.

[+] vinceguidry|8 years ago|reply

> For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.

That's literally the only one on there I felt that way about. All the other ones can and should be accounted for in your application if it accepts arbitrary CSVs.

Encoding is a special case because it's impossible to do anything other than guess at a plaintext document's encoding. If you're getting byte code parsing errors, the code should try a few different encodings before rejecting the document.

This won't help if there's a mix of encodings in the document. If this is the case, nothing can help and you will have garbage going in.

[+] lifthrasiir|8 years ago|reply

Not surprisingly, most lists of "falsehoods" are like that.

I believe it is naturally impossible or intractable to consider every item of those lists. Rather, they are convenient checklists for consideration when you are delving into the realm of those complicated subjects. Do you reject, say, 30% of those items? That's fine. Just make sure that you are explicitly making tradeoffs and will consult the lists again when your initial tradeoffs wear off.

[+] philjr|8 years ago|reply

Dealing with real world data and applications means being liberal in what you accept.

If you're producing CSV files, then you can be more strict.

https://en.wikipedia.org/wiki/Robustness_principle

[+] lou1306|8 years ago|reply

A colleague got stuck trying to parse a 1-million-lines CSV for a while, then he found out that the last line was a PHPMyAdmin timeout error. Maybe they should add "All CSV contain no error messages" to the list :)

[+] purple-again|8 years ago|reply

Lol "reject" the file. The author in this story could be me and the reason we both have so much experience working with that long list of fucked up CSV files is because we don't have the power to 'reject' anything. Our client gives us the data and tells us to fuck off until we solve their problem. Professional services != consumer services in servicer/servicee power disparity.

[+] k__|8 years ago|reply

This.

I mean if an API sends invalid JSON you can't parse it...

[+] nerdponx|8 years ago|reply

Fine, then call it "falsehoods programmers believe about people who try to create CSV files".

[+] kennydude|8 years ago|reply

Sometimes you have to, or customers will hate you for it

[+] emodendroket|8 years ago|reply

This post, on a different subject, captures my feelings:

https://news.ycombinator.com/item?id=13260082

> I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.

Like, OK, I shouldn't use sep. Good to know. What should I use instead? Why tell people that \ isn't the escape delimiter without explaining the way the quoting system works?

And frankly, the stuff about Excel is divorced from reality. More than 90% of the time, the reason you're making a CSV is because somebody wants to look at the data in Excel and you don't want to deal with xlsx. If your concern is something else CSV is probably the wrong choice. Thus, for most programmers, Excel is the reference implementation of CSV.

[+] ryandrake|8 years ago|reply

I like this genre because many developers (and non-developers who are imagining features) tend to trivialize things that are inherently complex. “Just add N seconds and you’ll have the target date!” “Just parse the user’s address and return the street name” “Just add a radio button for the user’s gender!” Resulting in underestimates and blown schedules, database designs that are not future proof, and at the end of the day demonstrably wrong software.

Sure, good examples of how to handle each edge case would be ideal, but merely pointing out all the bad assumptions to someone is a valuable first step.

[+] ubermonkey|8 years ago|reply

I think there's more going on than condescending gotchas here. I see the whole genre as fairly tongue-in-cheek enumerations of pitfalls that, all too often, are baked into projects as unexamined assumptions.

When we read them, we have the opportunity to check our own assumptions -- about the subject at hand (today, CSVs) and also (hopefully!) about other subjects we may encounter later.

[+] tomp|8 years ago|reply

I think an `.csv` is very useful for any kind of one- or two-dimensional numerical data, and even a lot of non-numerical data. It's very simple, human-readable. E.g. if you have an API, a `.csv` backend in addition to other (e.g. `.json` or `.mat` or similar) makes it much easier for a human to inspect the data.

[+] megaman22|8 years ago|reply

If only creating xlsx files programmatically wasn't such a clusterfuck

[+] sbarre|8 years ago|reply

I've had to deal with CSV data and Excel a lot in my career, and I learned one trick (sorry) a few years back that has made my life so much better:

Here's a scenario I bet many people have faced: Export an Excel sheet to CSV that has high-ASCII characters in it like accents. The export gets mangled when you load it into code or a text editor after. You eventually just upload it to Google Sheets and export it from there instead. It works but it's a pain.

Instead of exporting it as a CSV from Excel, export it as a UTF-16 TXT, which is basically a TSV file.

That will correctly preserve all the character encoding.

I can't promise this will work 100% of the time but it has resolved many many encoding issues going to/from Excel.

[+] chubot|8 years ago|reply

Yes his other article said this too:

https://donatstudios.com/CSV-An-Encoding-Nightmare

As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.

I work with CSVs in R, but I don't work with Excel that much. Thanks for the useful tip.

[+] masklinn|8 years ago|reply

My one trick there is to just not ask for CSV, and import Excel files directly. Most languages have libraries to read basic (non-formula) excel files, they work fine, and provide richer data models than CSV (though somewhat risky as you have to deal with formattings and finding e.g. numbers where you expected strings and the other way around).

[+] jstimpfle|8 years ago|reply

I've spent a lot of time thinking about a better format that is close enough to CSV to be practical, but has more precisely defined semantics and structure, also to support better usability (decreasing the need for manual integrity checks after parsing). I wanted at least a defined encoding and defined schema (table definitions with fixed number of typed columns). Optionally Unique Keys and Foreign Keys, but that quickly leads to a situation where there are more possible features with diminishing returns to consider.

I ended up with this [1] and a python implementation [2], and it turned out not too bad. I've also done a more pragmatic C implementation (couple hundred LOC) in a toy project [3] (wsl.c and *.wsl files), and it turned out quite usable.

I think what prevents adoption of such a thing is that it's very hard to standardize on primitive types and integrity features.

[1] http://jstimpfle.de/projects/wsl/main.html [2] https://github.com/jstimpfle/python-wsl/ [3] https://github.com/jstimpfle/learn-opengl

[+] rwmj|8 years ago|reply

You could come up with the most wonderful format in the world, but unless it's transparently readable and writable by Excel then it will never replace CSV. How does Excel handle your whitespace-separated files?

[+] baldfat|8 years ago|reply

I went the same route but ended up adopting feather from the makers of Python's Pandas and R's tidyverse, Wes McKinney and Hadley Wickham. https://blog.rstudio.com/2016/03/29/feather/

> SUMMARY: Feather's good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages. http://wesmckinney.com/blog/feather-its-the-metadata/

The approach is data frames on disk storage.

> data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (_factors), or _string. Every column can have missing values.

This fits the needs of all of my CSV usage.

[+] jacquesm|8 years ago|reply

It's the age old trap of trying to fix a format or a protocol, only to end up with yet another format or protocol.

[+] tomp|8 years ago|reply

This seems to be just .csv with " " as delimiters, "[]" used for quoting and line-comments starting with "%".

[+] yoz-y|8 years ago|reply

Many of the complaints stem from the fact that many CSV writers do not follow any spec and produce garbage data. Kind of like when the industry decided that reading badly formed HTML is beneficial.

For the sake of sanity I'd recommend splitting the logic of reading csv into two parts. One that takes garbage and produces a correctly formatted file (e.g. properly quoted values, equal number of columns, single encoding) and the second part that actually does CSV parsing adhering to strict rules.

[+] edanm|8 years ago|reply

That's actually a very interesting idea/pattern for dealing with problems.

However, I'm not sure what it means practically speaking on a real project - most of the time, you're presenting the user with some kind of upload screen, then do processing behind the scenes, and display results. You can (and probably should) structure things such that you're internally converting the "bad" csv file to a "good" csv file, and parsing it normally afterwards, but this is all behind the scenes anyway.

[+] js8|8 years ago|reply

Maybe if there was a professional organization of software engineers, it could settle these disputes and decide who is to blame for the wrong implementation.

[+] unknown|8 years ago|reply

[deleted]

[+] dahart|8 years ago|reply

The content of this article is great. But the title represent it's own bad assumption & falsehood. Just because a CSV doesn't conform to the spec doesn't mean the person who produced the CSV file believes their CSV file to be correct or misunderstands that CSV is complicated. For most devs I've dealt with, the reason has been conscious and admitted laziness, trying to get something done faster knowing it's not 100% correct. Conforming to the spec is harder and more confusing than splitting on commas or tabs in a script, while splitting on commas or tabs works 80% of the time.

It's lame and lazy to write & read CSV with one-off code, but we already knew that, CSV is just deceptively simple looking so the temptation is strong. But what we most need isn't the list of things we're doing wrong. We already know we're doing it wrong. What we need to do is to use a library that does it right. What we need to have is a list of libraries and tools that are easy to use and fix all the items in the author's list. It might also be useful to suggest simple things a hacker can do in a couple lines of code in a one-off CSV reader/writer that covers as many cases as possible.

[+] JoeAltmaier|8 years ago|reply

Garbage data has been with us since the beginning, and will always be with us. Whether its bad csv, html, xml, json, half the data integrator's job will be 'sanitizing' the data. Which means dealing with exceptions in an empirical manner.

Our first effort at absorbing Wall Street financial data satellite stream (stock trades, bond rates and all the rest) we found that every day there would be new bad data. Trades out of order; missing decimal points; badly spelled security names; clumsy attempts to 'edit in' trades after close. The world's financial data should probably have been better managed, but it was all had-generated back then (90's) and mostly viewed as a stream by brokers on a terminal screen, so the human filter could understand most of the BS. But a database, not so much.

[+] outsideoflife|8 years ago|reply

As someone mentioned in the comments in the article I think it is very common for people to use LibreOffice Calc to work with CSV because Excel does not handle UTF-8 all that well. In Libre Office you can open an Excel workbook and export a csv in UTF-8 and ask it to double-quote all of the fields too (which is a very good thing to do to csv files)

[+] roel_v|8 years ago|reply

While 33 and 34 are true (33. Excel is a good tool for working with CSVs 34. Excel is an OK tool for working with CSVs), there is one reality that makes them irrelevant: when working with data that is in any way or form touched manually during its lifespan (that includes 'looking at it'), dealing with Excel is inevitable.

[+] jameshart|8 years ago|reply

Right, the more important widely-held falsehood is "If you tell people not to edit your CSV file in Excel, they won't."

[+] brusch64|8 years ago|reply

I never understood why Excel doesn't work easily with Unicode CSV files. Well I know the workaround, but I still don't understand why they don't improve the handling of CSV files.

[+] donatj|8 years ago|reply

Author here, just waking up. I object to the title change on the strongest terms. It’s simply not “Problems with CSVs”. That’s not at all what the post is.

The list isn’t problems, and if you read it as a list of problems it’s nonsense.

Also why on earth did the fact that it is 14 months old need to be noted in the title - has anything changed in the last 14 months? Not that I am aware of.

[+] LeonM|8 years ago|reply

47. CSVs are an API

48. CSVs are useful to make machine-2-machine interfaces

49. CSVs for importing/exporting data eliminates all those pesky programmers taking up too much time with the API design

50. CSVs are real time

51. CSVs are a robust mechanism to export/import data

[+] endriju|8 years ago|reply

52. All tools working with CSV follow RFC 4180[1]

[1] https://tools.ietf.org/html/rfc4180

[+] matthewmacleod|8 years ago|reply

Yeah, I mean those are not always true but they're also not always false.

[+] tlrobinson|8 years ago|reply

> All CSVs contains a single consistent encoding

> All records contain a single consistent encoding

> All fields contain a single consistent encoding

What? How are you expected to handle these cases?

Generally you have to guess the encoding of a plain text file like a CSV. I’m fairly sure the common case is the entire file will be a consistent encoding. If you were to guess per-record or per-field I suspect it’s more likely you’d guess some records/fields wrong than encountering a file with varying encoding.

I’d be interested to see some real world examples of CSVs with varying encoding and how existing software handles that.

[+] sly010|8 years ago|reply

A lot of broken encoding is the result of various copy pasting and OCR issues during data collection.

If there is one thing I could change about all spreadsheet software is to paste without formatting by default. Would make everyones life so much easier...

[+] bscanlan|8 years ago|reply

Dealing with the backscatter from CSV misunderstandings can be fairly challenging - for a lot of us, the customer experience is improved by being as accommodating as possible instead of correct. We at Intercom released a Ruby CSV parser that "is a ridiculously tolerant and liberal parser which aims to yield as much usable data as possible out of such real-world CSVs".

https://github.com/intercom/hippie_csv

[+] imtringued|8 years ago|reply

The biggest problem with CSV is that it looks easier than it actually is and then they go ahead and write their own CSV "printer"/parser which is usually just a ",".join([1,2,3]) or "a,b,c".split(",").

In reality CSV has a similar complexity akin to JSON. You have to consider the possibility of quoting, escaping, encoding, delimiters, etc... You should always use a library to generate and parse CSV to avoid these issues.

[+] scrumper|8 years ago|reply

(Needs a 2016 in the title).

CSV is a great example of what happens when you do the opposite of "be strict in what you emit, be liberal in what you accept." It doesn't matter though, because it's just about good enough, and you almost never want a _general_ CSV solution, rather you need one specific to a particular problem or workflow (so you can handle your situational idiosyncrasies.)

[+] ex3ndr|8 years ago|reply

Any alternatives to CSVs? In our startup we are using of JSON object on each line but may be there is a more established format for doing this?

[+] ktpsns|8 years ago|reply

It really depends on the purpose. For scientific computing, for instance, CSV is used as a "standard" for quick tables of numbers. Sufficiently "normal" files can be read by almost all scientific tools (speaking of libraries such as numpy, systems such as Mathematica, Maple, or ecosystems such as R, basically even C++ and Fortran allow this task to be implemented in a couple of lines).

However, also in this context CSV has major drawbacks, for instance not defining column headers, comment lines, number formatting. A proper drop-in replacement in my subject is HDF5 (https://www.hdfgroup.org/) which is mostly used for being a binary representation of n-dimensional tables with couplings to major programming systems.

However, I never heard of HDF5 outside of science. Therfore I make this example here.

[+] meuk|8 years ago|reply

Excel is a horrible tool, in general. I remember that I exported a very long list of numbers to a .CSV. Excel then formatted the numbers (like 53564566934 to the form 52564E+6).

If you then copy and paste, the numbers are actually converted to 52564000000. Result: I had to do a lot of work again. Which is partly my fault, and partly a horrific design.

[+] Sami_Lehtinen|8 years ago|reply

Just saying, great timing. Because I just today had this discussion covering everything on the list. Different character encodings in file, fields which contain line feeds, fields which contain field delimiters without escaping or using quotes and so on. But that's very common. People use whatever source for data, copy paste it into Excel and then think it's good CSV after that. I usually handle CSV as required, every file can be and usually is different. You'll write I/O code according case specific requirements. In many cases, this means some manual fixing and custom parser. -> Job done.

Edit: Honestly, I don't even remember when I would have seen RFC4180 compliant file. That's just the truth out there.

[+] dspillett|8 years ago|reply

A couple of extras:

* CSV files will have consistent EOL markers

* CSV files will always have a trailing EOL marker

* CSV files will never have a trailing EOL marker

* Any file with a name ending .csv is a CSV file (or something close to)

> Excel can losslessly save CSVs it opens

A particular problem we've had many times with some clients is Excel messing around with date/time formats: files with dates formatted YYYY-MM-DD being changed to American style, columns containing both date and time having one or the other chopped off, dates being converted to numeric representations, ...

188 comments