I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.
> First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.
See the "Conventions for lossless conversion to TSV" on the current version of https://en.wikipedia.org/wiki/Tab-separated_values . There are really only 3 chars to escape - the escape char, and 2 delimiters - to make everything easy to deal with (even binary data in fields - what "lossless" means here).
(Python code for ASCII-separated values. TL;DR: it's so stupid simple that I feel any objection to using this format has just got to be wrong. It's like "Hey, we don't have to keep stabbing ourselves in the face." ... "But how will we use our knives?" See? It's like that.)
Closely related- I remain mystified as to why FIX Protocol chose to use control characters to separate fields, but used SOH rather than RS/US or the like.
USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.
That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. but they are still characters that may appear in the data. How do you represent a value containing ␟ Unit Separator in USV?
In-band signalling is ever going to remain in-band signalling. And in-band signalling will need escaping.
Thanks. I haven't tried this, but it should actually work already in standard AWK input mode with FS and RS set to those Unicode separators. I'll test it tomorrow when back at my laptop.
I've been toying with the idea of starting a change.org campaign, asking Microsoft to add support to Excel for importing and exporting such files. I don't see any way to make any progress if Microsoft won't push the industry forward.
When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a space explosion as per some image/video formats.
With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.
I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).
Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.
Someone linked a wikipedia format guide for TSV, but the world seems to have settled on using the escape codes \\, \t, \n with their obvious meanings, then allowing arbitrary binary.
That should be parallelism friendly, even with UTF-8, where an ascii tab or newline byte always mean tab and newline.
That's written in Python and uses the "agate" library which uses Python's built-in "csv" module. I did a couple of simple benchmarks against Python's csv module in the article: https://benhoyt.com/writings/goawk-csv/#performance (Go/GoAWK is a bit more than 3x as fast)
I also did a quick test using csvcut to pull out a single field, and compared it to GoAWK. Looks like GoAWK is about 4x as fast here:
$ time csvcut -c agency_id huge.csv >/dev/null
real 0m25.977s
user 0m25.240s
sys 0m0.424s
$ time goawk -i csv -H -o csv '{ print @"agency_id" }' huge.csv >/dev/null
real 0m6.584s
user 0m7.434s
sys 0m0.480s
The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.
That's the thing about AWK, it's already does everything. No need to extended it much at all.
In my country (Spain) we traditionally use commas as a decimal separator, but I think CSV should not support this.
The way I see it, CSV's purpose is information storage and transfer, not presentation.
Presentation is where you decide the font face, the font size, cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.
In information transfer and storage it's much more important that the parsing/generation functions are as simple as possible. So let's go with one numeric format only. I think we should use the dot as a decimal separator since it's the most common separator in all programming languages. Maybe extend it to include exponential notation as well, because that is what other languages like json support. But that's it.
I hold the same opinion about dates tbh.
(The same goes for dates, btw - yyyy-mm-dd or death)
Fair comeback. I think of CSV as modern, but Wikipedia tells me it's almost as old as AWK (depending on how you count). It seems to me it's used more heavily now as an exchange format, compared to say 15-20 years ago, but I could be wrong.
It is a nice addition, but I would like to see this taken further - structural regular expression awk. It is waiting to be implemented for 35 years now.
>A big thank-you to the library of the University of Antwerp, who sponsored this feature. They’re one of two major teams or projects I know of that use GoAWK – the other one is the Benthos stream processor.
That's great to hear.
Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.
I'm not considering adding general structured formats like XML or JSON, as they don't fit AWK's "fields and records" model or its simplistic data structure (the associative array) very well. However, I have considered adding JSON Lines support, where each line is a record, and fields are indexable using the new @"named-field" syntax (possibly nested like @"foo.bar").
For non-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes can handle (cut -d\034, etc). It’s worth setting up as an alias, and you can store \034 in $D or whatever for convenience.
During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.
I phrase that carefully. "Better"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, most definitely yes.
A solid upside is a single encoding story for JSON. CSV is a mess and can't be un-messed now. Size bloat from endless repetition of the object keys is a significant disadvantage, though.
I can't tell whether the UNIX people have lost their way, or just the demands of modern shell scripts cannot be met by typical shell philosophy - that is, piping together the output of small, orthogonal utilities.
The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.
> demands of modern shell scripts cannot be met by typical shell philosophy
That. Pipes and unstructured binary data isn't compositional enough, making the divide between the kinds of things you can express in the language you use to write a stage in a pipeline and the kinds of things you can express by building a pipeline too large.
>In general, using FPAT to do your own CSV parsing is like having a bed with a blanket that’s not quite big enough. There’s always a corner that isn’t covered. We recommend, instead, that you use Manuel Collado’s CSVMODE library for gawk.
[+] [-] jeroenjanssens|3 years ago|reply
[+] [-] kazinator|3 years ago|reply
https://github.com/adamgordonbell/csvquote
The History section of the readme explains that they are historically related.
[+] [-] throwawayboise|3 years ago|reply
[+] [-] db65edfc7996|3 years ago|reply
[0]: https://github.com/johnkerl/miller
[+] [-] llimllib|3 years ago|reply
https://miller.readthedocs.io/en/latest/why/ has a nice section on "why miller":
> First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.
[+] [-] pstuart|3 years ago|reply
[+] [-] mro_name|3 years ago|reply
No improvement of CSV handling will ever improved on that.
[+] [-] belter|3 years ago|reply
"ASCII Delimited Text – Not CSV or TAB delimited text": https://news.ycombinator.com/item?id=7474600
[+] [-] cb321|3 years ago|reply
[+] [-] carapace|3 years ago|reply
https://git.sr.ht/~sforman/sv/tree/master/item/sv.py
(Python code for ASCII-separated values. TL;DR: it's so stupid simple that I feel any objection to using this format has just got to be wrong. It's like "Hey, we don't have to keep stabbing ourselves in the face." ... "But how will we use our knives?" See? It's like that.)
[+] [-] RhysU|3 years ago|reply
[+] [-] andylynch|3 years ago|reply
[+] [-] adamgordonbell|3 years ago|reply
[+] [-] jph|3 years ago|reply
https://github.com/sixarm/usv
USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.
[+] [-] tomsmeding|3 years ago|reply
That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. but they are still characters that may appear in the data. How do you represent a value containing ␟ Unit Separator in USV?
In-band signalling is ever going to remain in-band signalling. And in-band signalling will need escaping.
[+] [-] benhoyt|3 years ago|reply
[+] [-] cylinder714|3 years ago|reply
https://www.ccsv.io/
And this perennial favorite here at HN:
https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...
I've been toying with the idea of starting a change.org campaign, asking Microsoft to add support to Excel for importing and exporting such files. I don't see any way to make any progress if Microsoft won't push the industry forward.
[+] [-] cb321|3 years ago|reply
With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.
I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).
Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.
[+] [-] hedora|3 years ago|reply
That should be parallelism friendly, even with UTF-8, where an ascii tab or newline byte always mean tab and newline.
[+] [-] rgoulter|3 years ago|reply
[+] [-] benhoyt|3 years ago|reply
I also did a quick test using csvcut to pull out a single field, and compared it to GoAWK. Looks like GoAWK is about 4x as fast here:
[+] [-] parasense|3 years ago|reply
That's the thing about AWK, it's already does everything. No need to extended it much at all.
[+] [-] CRConrad|3 years ago|reply
[+] [-] andi999|3 years ago|reply
[+] [-] otikik|3 years ago|reply
The way I see it, CSV's purpose is information storage and transfer, not presentation.
Presentation is where you decide the font face, the font size, cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.
In information transfer and storage it's much more important that the parsing/generation functions are as simple as possible. So let's go with one numeric format only. I think we should use the dot as a decimal separator since it's the most common separator in all programming languages. Maybe extend it to include exponential notation as well, because that is what other languages like json support. But that's it.
I hold the same opinion about dates tbh.
(The same goes for dates, btw - yyyy-mm-dd or death)
[+] [-] ta8645|3 years ago|reply
[+] [-] malkocoglu|3 years ago|reply
[+] [-] benhoyt|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] hawski|3 years ago|reply
[+] [-] asicsp|3 years ago|reply
That's great to hear.
Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.
[+] [-] benhoyt|3 years ago|reply
[+] [-] rufugee|3 years ago|reply
[+] [-] chasil|3 years ago|reply
Chapter 2 is the complete description of the language in 40 pages.
I think this link works...
https://ia803404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...
[+] [-] altairprime|3 years ago|reply
[+] [-] adolph|3 years ago|reply
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
includes csv to tsv: https://github.com/eBay/tsv-utils
HT: https://simonwillison.net/
[+] [-] gpvos|3 years ago|reply
[+] [-] jerf|3 years ago|reply
I phrase that carefully. "Better"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, most definitely yes.
A solid upside is a single encoding story for JSON. CSV is a mess and can't be un-messed now. Size bloat from endless repetition of the object keys is a significant disadvantage, though.
[+] [-] carlmr|3 years ago|reply
I think F#-interactive (FSI) with its Hindley-Milner type-inference, would have been a much better base for a shell.
[+] [-] ognyankulev|3 years ago|reply
[+] [-] junon|3 years ago|reply
[+] [-] tyingq|3 years ago|reply
[+] [-] torginus|3 years ago|reply
The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.
[+] [-] tonyg|3 years ago|reply
That. Pipes and unstructured binary data isn't compositional enough, making the divide between the kinds of things you can express in the language you use to write a stage in a pipeline and the kinds of things you can express by building a pipeline too large.
[+] [-] anthk|3 years ago|reply
[+] [-] jjtheblunt|3 years ago|reply
[+] [-] uhtred|3 years ago|reply
[+] [-] forgotpwd16|3 years ago|reply
>In general, using FPAT to do your own CSV parsing is like having a bed with a blanket that’s not quite big enough. There’s always a corner that isn’t covered. We recommend, instead, that you use Manuel Collado’s CSVMODE library for gawk.