Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.
During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.
- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.
Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.
Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
They're not getting it wrong, they're just assuming a different variant.
There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.
They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.
Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.
I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.
Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...
> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.
Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.
TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.
It's 2024 and Excel still doesn't natively parse CSV with tabs as delimiters. When I send such csv files to my colleagues, they complain about not being able to open them directly in Excel. I wish Excel could pop up a window like LibreOffice does to confirm the delimiter before opening a csv file.
If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too
If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that
If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable
Because nobody made keyboards with those keys. Had they stuck on a 'next unit' and 'next record' key pair that sent them, we'd all be fine, but instead we got overly redundant text editing keys rather than keyboards more suited to data entry.
I tend to prefer that over CSV as well. But usually I go for ndjson files since that's a bit more flexible for more complex data and easier to deal with when parsing. But it depends on the context what I use.
However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.
And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.
Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.
I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.
CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.
There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.
If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.
But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.
Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.
I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.
Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.
Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.
I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?
I did an ETL project years ago from a legacy app that used these delimiters. It was gloriously easy. No need to worry about escaping (as these characters were illegal in the input). It's a shame they didn't catch on.
I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.
CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.
I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.
A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.
I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.
I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.
As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.
[+] [-] thbb123|1 year ago|reply
During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
Had to give up.
[+] [-] viraptor|1 year ago|reply
[+] [-] baggy_trough|1 year ago|reply
[+] [-] hanche|1 year ago|reply
[+] [-] snthpy|1 year ago|reply
The .usv separators make things easier to read at the expense of a bit more space.
Main point for me making the parsing so much simpler.
Who writes .csv files by hand anyway?
[+] [-] cqqxo4zV46cp|1 year ago|reply
[+] [-] tanin|1 year ago|reply
- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.
- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.
Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.
Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
[+] [-] arp242|1 year ago|reply
They're not getting it wrong, they're just assuming a different variant.
There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.
They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.
Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.
[+] [-] 999900000999|1 year ago|reply
Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...
[+] [-] AnonC|1 year ago|reply
Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.
[1]: https://www.moderncsv.com/
[+] [-] 1vuio0pswjnm7|1 year ago|reply
Works for SQLite at least, but not sure about other software.
[+] [-] kawakamimoeki|1 year ago|reply
[+] [-] lenkite|1 year ago|reply
[+] [-] cm2187|1 year ago|reply
[+] [-] Nihilartikel|1 year ago|reply
Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.
[+] [-] exidex|1 year ago|reply
[+] [-] theendisney4|1 year ago|reply
I also considered a dedicated keyboard like apl just to be dense about it.
Have each character signed by the keyboard so that we have proof by whoem it was typed and when.
People who dont work here don't get to write code. It just wont happen. haha
[+] [-] Hackbraten|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] geekodour|1 year ago|reply
[+] [-] zarzavat|1 year ago|reply
TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.
[+] [-] uncharted9|1 year ago|reply
[+] [-] didntcheck|1 year ago|reply
If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too
If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that
If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable
[+] [-] MattPalmer1086|1 year ago|reply
[+] [-] btreecat|1 year ago|reply
[+] [-] calibas|1 year ago|reply
It seems like half the problems with CSV were solved back in the 70s with ASCII codes.
[+] [-] gaganyaan|1 year ago|reply
https://github.com/SixArm/usv
[+] [-] NikkiA|1 year ago|reply
[+] [-] impure|1 year ago|reply
[+] [-] jillesvangurp|1 year ago|reply
However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.
And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.
Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.
[+] [-] SoftTalker|1 year ago|reply
[+] [-] cm2187|1 year ago|reply
[+] [-] userbinator|1 year ago|reply
[+] [-] Kon-Peki|1 year ago|reply
There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.
If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.
[+] [-] LorenPechtel|1 year ago|reply
Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.
[+] [-] theanonymousone|1 year ago|reply
Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.
Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.
All IMHO, again.
[+] [-] endgame|1 year ago|reply
[+] [-] EvanAnderson|1 year ago|reply
[+] [-] al_borland|1 year ago|reply
[+] [-] breck|1 year ago|reply
I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)
Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)
This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)
[+] [-] jeff-hykin|1 year ago|reply
I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.
https://github.com/jeff-hykin/typed_csv
[+] [-] mannyv|1 year ago|reply
I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.
[+] [-] fragmede|1 year ago|reply
it'll only remain king as long as we let it.
move to using Sqlite db files as your interchange format
[+] [-] __mharrison__|1 year ago|reply
I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.
[+] [-] EvanAnderson|1 year ago|reply
[+] [-] valiant55|1 year ago|reply
[+] [-] maerF0x0|1 year ago|reply
[+] [-] deafpolygon|1 year ago|reply
It's just much easier to keep using it, since you're already doing it.
In the meantime, how about XML? /awaits the pack of raving mad HNers
[+] [-] nuc1e0n|1 year ago|reply