top | item 31350550

Modernizing AWK, a 45-year old language, by adding CSV support

258 points| benhoyt | 3 years ago |benhoyt.com | reply

143 comments

order
[+] jeroenjanssens|3 years ago|reply
I often use csvquote [1] whenever I need to process CSV with a command-line tool that doesn't support it. For example:

    csvquote test.csv | awk '{print $1, $2}' | csvquote -u
[1] https://github.com/dbro/csvquote
[+] throwawayboise|3 years ago|reply
Yep, was going to post the same if someone hadn't already. This is the unix way. Small tools with narrow focus, strung together in pipelines.
[+] db65edfc7996|3 years ago|reply
I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.

[0]: https://github.com/johnkerl/miller

[+] llimllib|3 years ago|reply
xsv does similar stuff for CSV, and very rapidly: https://github.com/BurntSushi/xsv

https://miller.readthedocs.io/en/latest/why/ has a nice section on "why miller":

> First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.

[+] pstuart|3 years ago|reply
Just starred the repo -- thanks for the tip!
[+] mro_name|3 years ago|reply
I recently learned via https://news.ycombinator.com/item?id=31257248 that ASCII has the idea of records and fields ever since. It's just not used, but workaround CSV.

No improvement of CSV handling will ever improved on that.

[+] cb321|3 years ago|reply
See the "Conventions for lossless conversion to TSV" on the current version of https://en.wikipedia.org/wiki/Tab-separated_values . There are really only 3 chars to escape - the escape char, and 2 delimiters - to make everything easy to deal with (even binary data in fields - what "lossless" means here).
[+] carapace|3 years ago|reply
I'm just gonna leave this here.

https://git.sr.ht/~sforman/sv/tree/master/item/sv.py

(Python code for ASCII-separated values. TL;DR: it's so stupid simple that I feel any objection to using this format has just got to be wrong. It's like "Hey, we don't have to keep stabbing ourselves in the face." ... "But how will we use our knives?" See? It's like that.)

[+] RhysU|3 years ago|reply
Could one get away with regular AWK using these combined with a streaming to/from CSV converter?
[+] andylynch|3 years ago|reply
Closely related- I remain mystified as to why FIX Protocol chose to use control characters to separate fields, but used SOH rather than RS/US or the like.
[+] jph|3 years ago|reply
Ben this is great, thank you. Would you consider adding Unicode Separated Values (USV)?

https://github.com/sixarm/usv

USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.

[+] tomsmeding|3 years ago|reply
> no escaping

That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. but they are still characters that may appear in the data. How do you represent a value containing ␟ Unit Separator in USV?

In-band signalling is ever going to remain in-band signalling. And in-band signalling will need escaping.

[+] benhoyt|3 years ago|reply
Thanks. I haven't tried this, but it should actually work already in standard AWK input mode with FS and RS set to those Unicode separators. I'll test it tomorrow when back at my laptop.
[+] cb321|3 years ago|reply
When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a space explosion as per some image/video formats.

With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.

I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).

Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

[+] hedora|3 years ago|reply
Someone linked a wikipedia format guide for TSV, but the world seems to have settled on using the escape codes \\, \t, \n with their obvious meanings, then allowing arbitrary binary.

That should be parallelism friendly, even with UTF-8, where an ascii tab or newline byte always mean tab and newline.

[+] rgoulter|3 years ago|reply
I'd be curious to see a comparison with the csvkit suite. https://csvkit.readthedocs.io/en/latest/index.html
[+] benhoyt|3 years ago|reply
That's written in Python and uses the "agate" library which uses Python's built-in "csv" module. I did a couple of simple benchmarks against Python's csv module in the article: https://benhoyt.com/writings/goawk-csv/#performance (Go/GoAWK is a bit more than 3x as fast)

I also did a quick test using csvcut to pull out a single field, and compared it to GoAWK. Looks like GoAWK is about 4x as fast here:

  $ time csvcut -c agency_id huge.csv >/dev/null

  real 0m25.977s
  user 0m25.240s
  sys 0m0.424s
  $ time goawk -i csv -H -o csv '{ print @"agency_id" }' huge.csv >/dev/null

  real 0m6.584s
  user 0m7.434s
  sys 0m0.480s
[+] parasense|3 years ago|reply
I alway just use awk to process csv files.

    awk -F '^"|","|"$|,' '{print $2,$3}' whatever.csv
The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.

That's the thing about AWK, it's already does everything. No need to extended it much at all.

[+] CRConrad|3 years ago|reply
What about fields with "," in them?
[+] andi999|3 years ago|reply
Can you also set the decimal separator? Some countries use ',' in numbere like 10,5
[+] otikik|3 years ago|reply
In my country (Spain) we traditionally use commas as a decimal separator, but I think CSV should not support this.

The way I see it, CSV's purpose is information storage and transfer, not presentation.

Presentation is where you decide the font face, the font size, cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.

In information transfer and storage it's much more important that the parsing/generation functions are as simple as possible. So let's go with one numeric format only. I think we should use the dot as a decimal separator since it's the most common separator in all programming languages. Maybe extend it to include exponential notation as well, because that is what other languages like json support. But that's it.

I hold the same opinion about dates tbh.

(The same goes for dates, btw - yyyy-mm-dd or death)

[+] ta8645|3 years ago|reply
Can we please, please, stop that? If we go Metric, will you please standardize on period as decimal point? Shake on it. Let's make it happen.
[+] malkocoglu|3 years ago|reply
Modernizing ... by adding CSV support !?!
[+] benhoyt|3 years ago|reply
Fair comeback. I think of CSV as modern, but Wikipedia tells me it's almost as old as AWK (depending on how you count). It seems to me it's used more heavily now as an exchange format, compared to say 15-20 years ago, but I could be wrong.
[+] hawski|3 years ago|reply
It is a nice addition, but I would like to see this taken further - structural regular expression awk. It is waiting to be implemented for 35 years now.
[+] asicsp|3 years ago|reply
>A big thank-you to the library of the University of Antwerp, who sponsored this feature. They’re one of two major teams or projects I know of that use GoAWK – the other one is the Benthos stream processor.

That's great to hear.

Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.

[+] benhoyt|3 years ago|reply
I'm not considering adding general structured formats like XML or JSON, as they don't fit AWK's "fields and records" model or its simplistic data structure (the associative array) very well. However, I have considered adding JSON Lines support, where each line is a record, and fields are indexable using the new @"named-field" syntax (possibly nested like @"foo.bar").
[+] altairprime|3 years ago|reply
For non-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes can handle (cut -d\034, etc). It’s worth setting up as an alias, and you can store \034 in $D or whatever for convenience.
[+] adolph|3 years ago|reply
For anything down and dirty, what's wrong with -F'"'? For anything fancy there are plenty of things like the below.

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

includes csv to tsv: https://github.com/eBay/tsv-utils

HT: https://simonwillison.net/

[+] gpvos|3 years ago|reply
During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.
[+] jerf|3 years ago|reply
JSON Lines would probably beat that out: https://jsonlines.org/

I phrase that carefully. "Better"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, most definitely yes.

A solid upside is a single encoding story for JSON. CSV is a mess and can't be un-messed now. Size bloat from endless repetition of the object keys is a significant disadvantage, though.

[+] carlmr|3 years ago|reply
I still think objects are great, but PowerShell makes it so hard to deal with them.

I think F#-interactive (FSI) with its Hindley-Milner type-inference, would have been a much better base for a shell.

[+] tyingq|3 years ago|reply
Gnu awk also has a csv extension that comes with gawkextlib. I think it may even be installed on many Linux distros by default.
[+] torginus|3 years ago|reply
I can't tell whether the UNIX people have lost their way, or just the demands of modern shell scripts cannot be met by typical shell philosophy - that is, piping together the output of small, orthogonal utilities.

The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.

[+] tonyg|3 years ago|reply
> demands of modern shell scripts cannot be met by typical shell philosophy

That. Pipes and unstructured binary data isn't compositional enough, making the divide between the kinds of things you can express in the language you use to write a stage in a pipeline and the kinds of things you can express by building a pipeline too large.

[+] anthk|3 years ago|reply
A single tool to parse JSON and pipe it? That's Unix too.
[+] jjtheblunt|3 years ago|reply
you made me think a possible corollary (?) question would be if the json people don't perfectly overlap with the unix people.