Why CSV is still king

[+] thbb123|1 year ago|reply

Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.

During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.

Had to give up.

[+] viraptor|1 year ago|reply

They're hard to type though. You need to teach people how to use those -vs- just using a comma.

[+] baggy_trough|1 year ago|reply

A valuable lesson in user experience triumphing over pedantic correctness.

[+] hanche|1 year ago|reply

Sqlite still supports it: .mode ascii

[+] snthpy|1 year ago|reply

I've recently learned about them and an trying to revive usage - .asv and .usv files.

The .usv separators make things easier to read at the expense of a bit more space.

Main point for me making the parsing so much simpler.

Who writes .csv files by hand anyway?

[+] cqqxo4zV46cp|1 year ago|reply

Standards-compliance and using esoteric features over catering for the realities of usability. Your coworkers were right to steamroll you.

[+] tanin|1 year ago|reply

What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.

Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

[+] arp242|1 year ago|reply

> Many tools are getting it wrong.

They're not getting it wrong, they're just assuming a different variant.

There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.

They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.

Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

[+] 999900000999|1 year ago|reply

I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.

Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...

[+] AnonC|1 year ago|reply

> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.

[1]: https://www.moderncsv.com/

[+] 1vuio0pswjnm7|1 year ago|reply

https://www.ietf.org/rfc/rfc4180.txt

Works for SQLite at least, but not sure about other software.

[+] kawakamimoeki|1 year ago|reply

As is the case with Markdown, many parsers have prioritized ease of implementation over formal rigor.

[+] lenkite|1 year ago|reply

I wish there was a text format that used the ascii unit separator and record separator. It would have solved so many problematic edge cases.

[+] cm2187|1 year ago|reply

The one tools get the most wrong is that there is no escaping of the new line character.

[+] Nihilartikel|1 year ago|reply

I've found the Unicode cat emoji to be an effective delimiter to avoid escaping more common chars in my cat-separated-value artifacts.

Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.

[+] exidex|1 year ago|reply

There is https://github.com/SixArm/usv which is exactly that, but with special unicode characters

[+] theendisney4|1 year ago|reply

In the 80's i thought we should have an entire character set just for code. While never implemented the idea arguably aged well.

I also considered a dedicated keyboard like apl just to be dense about it.

Have each character signed by the keyboard so that we have proof by whoem it was typed and when.

People who dont work here don't get to write code. It just wont happen. haha

[+] Hackbraten|1 year ago|reply

Instructions unclear, my puppy emoji is now chasing its own tail

[+] unknown|1 year ago|reply

[deleted]

[+] geekodour|1 year ago|reply

last line unclear ⬛ an example would be great!

[+] zarzavat|1 year ago|reply

Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.

TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.

[+] uncharted9|1 year ago|reply

It's 2024 and Excel still doesn't natively parse CSV with tabs as delimiters. When I send such csv files to my colleagues, they complain about not being able to open them directly in Excel. I wish Excel could pop up a window like LibreOffice does to confirm the delimiter before opening a csv file.

[+] didntcheck|1 year ago|reply

The one flaw I do see with them is blank cells

If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too

If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that

If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable

[+] MattPalmer1086|1 year ago|reply

That's what I always use when I need to write out some tabular data. Haven't had any problem importing them into anything.

[+] btreecat|1 year ago|reply

In that case, why not use "|" (pipe character)?

[+] calibas|1 year ago|reply

Why don't we use 0x1F (␟) instead of "," or TAB to separate units and 0x1E (␞) to separate records?

It seems like half the problems with CSV were solved back in the 70s with ASCII codes.

[+] gaganyaan|1 year ago|reply

Somebody went one step further and invented a format that uses the unicode codepoints as separators:

https://github.com/SixArm/usv

[+] NikkiA|1 year ago|reply

Because nobody made keyboards with those keys. Had they stuck on a 'next unit' and 'next record' key pair that sent them, we'd all be fine, but instead we got overly redundant text editing keys rather than keyboards more suited to data entry.

[+] impure|1 year ago|reply

I switched to TSV files for my app. None of my values contain tabs so I don't have to escape anything.

[+] jillesvangurp|1 year ago|reply

I tend to prefer that over CSV as well. But usually I go for ndjson files since that's a bit more flexible for more complex data and easier to deal with when parsing. But it depends on the context what I use.

However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.

And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.

Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.

[+] SoftTalker|1 year ago|reply

The ASCII specification defines characters for separating fields, groups, records, and files, but I've rarely seen them used.

[+] cm2187|1 year ago|reply

How do you escape newline characters?

[+] userbinator|1 year ago|reply

I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.

[+] Kon-Peki|1 year ago|reply

CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.

There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.

If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.

[+] LorenPechtel|1 year ago|reply

But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.

Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.

[+] theanonymousone|1 year ago|reply

I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.

Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.

Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.

All IMHO, again.

[+] endgame|1 year ago|reply

I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?

[+] EvanAnderson|1 year ago|reply

I did an ETL project years ago from a legacy app that used these delimiters. It was gloriously easy. No need to worry about escaping (as these characters were illegal in the input). It's a shame they didn't catch on.

[+] al_borland|1 year ago|reply

If I had to take a guess, I’d say the answer is as simple as there is no key for them on the keyboard.

[+] breck|1 year ago|reply

I love CSVs.

I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)

Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)

This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)

[+] jeff-hykin|1 year ago|reply

> Efforts to standardize them

I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.

https://github.com/jeff-hykin/typed_csv

[+] mannyv|1 year ago|reply

CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.

I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.

[+] fragmede|1 year ago|reply

> Why CSV Will Remain King

it'll only remain king as long as we let it.

move to using Sqlite db files as your interchange format

[+] __mharrison__|1 year ago|reply

CSV is the VHS of data formats. Or to reference our discussion from yesterday, the markdown of data formats. It gets the job done.

I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.

[+] EvanAnderson|1 year ago|reply

A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.

[+] valiant55|1 year ago|reply

I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.

[+] maerF0x0|1 year ago|reply

I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.

[+] deafpolygon|1 year ago|reply

CSV is still king because of one thing: inertia

It's just much easier to keep using it, since you're already doing it.

In the meantime, how about XML? /awaits the pack of raving mad HNers

[+] nuc1e0n|1 year ago|reply

As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.

144 comments