Miller: Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON

[+] zX41ZdbW|3 years ago|reply

I recommend using clickhouse-local[1] for these tasks.

It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].

[1] https://clickhouse.com/blog/extracting-converting-querying-l...

[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...

[+] password4321|3 years ago|reply

Discussion of a similar tool last month:

yq: command-line YAML, JSON, XML, CSV and properties processor

https://news.ycombinator.com/item?id=34656022

Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-flip, csvq, zq, and zsv.

[+] networked|3 years ago|reply

I have made a repository cataloguing tools like this: https://github.com/dbohdan/structured-text-tools.

[+] donio|3 years ago|reply

Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.

[+] dmix|3 years ago|reply

Anyone else find jq really hard to use? I found the syntax/interface bewildering at times. Although not as bad as awk.

[+] dima55|3 years ago|reply

There's also https://github.com/dkogan/vnlog/ which is a wrapper around the existing coreutils, so all the options work, and there's nothing to learn

[+] asicsp|3 years ago|reply

Seems like you meant https://github.com/TomWright/dasel (instead of damsel)

[+] nextaccountic|3 years ago|reply

the obvious missing reference is jq https://github.com/stedolan/jq

[+] elesiuta|3 years ago|reply

Another similar discussion last year where Miller was mentioned:

New(ish) command line tools

https://news.ycombinator.com/item?id=31009313

[+] rustyf|3 years ago|reply

Even without aliases, I still prefer PowerShell on Windows. Once you've turned the text into an object, you can pipeline it to hell.

Get-Content .\example.csv | ConvertFrom-Csv | Where-Object -Property color -eq red

[+] wankle|3 years ago|reply

No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.

[+] 7thaccount|3 years ago|reply

Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.

[+] unxdfa|3 years ago|reply

I much prefer this approach as you only have to parse stuff once during the processing pipeline rather than at every step.

Unfortunately after working with it for a few years I utterly despise PowerShell for many other reasons.

[+] MayeulC|3 years ago|reply

I agree that this is something missing on classic UNIX shells: typed output streams. I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.

Searching for that comment, I came across relevant stories:

https://news.ycombinator.com/item?id=27535357

https://news.ycombinator.com/item?id=25225113

[+] snidane|3 years ago|reply

Great tool.

BUT, leaks memory like crazy.

Despite documentation stating the verbs are fully streaming.

> Fully streaming verbs > These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.

https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...

[+] coldtea|3 years ago|reply

>BUT, leaks memory like crazy.

Huh?

a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?

b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?

[+] asicsp|3 years ago|reply

Previous discussion: https://news.ycombinator.com/item?id=28298729 (273 points | Aug 25, 2021 | 66 comments)

[+] dang|3 years ago|reply

Thanks! Macroexpanded:

Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON - https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33 comments)

Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON - https://news.ycombinator.com/item?id=28298729 - Aug 2021 (66 comments)

Miller v5.0.0: Autodetected line-endings, in-place mode, user-defined functions - https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20 comments)

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76 comments)

[+] carb|3 years ago|reply

This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).

  $ cat example.csv 
  color,shape,flag,index
  yellow,triangle,1,11
  red,square,1,15
  red,circle,1,16
  red,square,0,48
  purple,triangle,0,51
  red,square,0,77
  
  # pretty printing
  $ column -ts',' example.csv 
  color   shape     flag  index
  yellow  triangle  1     11
  red     square    1     15
  red     circle    1     16
  red     square    0     48
  purple  triangle  0     51
  red     square    0     77
  
  # sorting with skipped headers is a mess. 
  $ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
  color   shape     flag  index
  red     square    0     77
  purple  triangle  0     51
  red     square    0     48
  red     circle    1     16
  red     square    1     15
  yellow  triangle  1     11

[+] darrenf|3 years ago|reply

> `jq` supports `@csv` for output conversion but not input

Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):

     jq -R 'split(",")' example.csv | jq -sr '[[.[0]],.[1:]|sort_by(.[4])|reverse|.[]]|.[]|@csv'

    "color","shape","flag","index"
    "red","square","0","77"
    "purple","triangle","0","51"
    "red","square","0","48"
    "red","circle","1","16"
    "red","square","1","15"
    "yellow","triangle","1","11"

NB. there is also an entry in the `jq` cookbook for parsing CSVs into arrays of objects (and keeping numbers as numbers, dealing with nulls, etc) https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...

[+] asicsp|3 years ago|reply

Another option (based on https://unix.stackexchange.com/q/11856/109046):

    (sed -u '1q' ; sort -r -k4 -t',') <example.csv | column -ts','

[+] 2h|3 years ago|reply

> sorting with skipped headers is a mess

I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:

https://go.dev/play/p/9bASZ97lLWv

[+] melx|3 years ago|reply

+1 for easy install (`dnf install miller` on my Fedora).

But seems like it cannot handle a simple use case: CSV without header.

$ mlr --csv head -n 20 pp-2002.csv

mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.

You have to explicitly pass it (FYI `implicit-csv-header` is terrible arg name)

$ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv

While `head` obliges rightly

$ head -n 20 pp-2002.csv

[+] WinstonSmith84|3 years ago|reply

This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following: ``` I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that ```

And it replied with

mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv

I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly

[+] 0xbadcafebee|3 years ago|reply

Csvkit is the usual way to handle working with CSVs (or Excel, or JSON, or.... Postgres): https://csvkit.readthedocs.io/en/latest/index.html#why-csvki...

brew install csvkit and enjoy

[+] hindsightbias|3 years ago|reply

Shell tool favorite from the past: https://github.com/stefan-schroedl/tabulator

[+] pcthrowaway|3 years ago|reply

Miller is a welcome addition to my toolbox. I just wish it was as 'easy' to use as jq... part of that might be how infrequently I work with CSVs

[+] coldtea|3 years ago|reply

Huh? jq has far more difficult syntax...

[+] ar9av|3 years ago|reply

Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily

[+] mdaniel|3 years ago|reply

In some sense, there's no answer to your question because everyone's tool workflow is their own.

That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`

[+] mcdonje|3 years ago|reply

While I get what they mean by "tabular JSON", that's a bit of a misnomer.

[+] didgetmaster|3 years ago|reply

I was wondering what they meant by that term as well. Does every JSON document in the array have to have the exact same structure (including the ordering of key-value pairs)? What happens if row #1000 introduces a new key-value pair not seen before? What if the value for a key is an array?

[+] unknown|3 years ago|reply

[deleted]

[+] 4dayworkweek4u|3 years ago|reply

This is basically what NuShell does (with the shell -- but also with files too). Why don't these guys join the NuShell team?

[+] avodonosov|3 years ago|reply

How is it better that SQL for these tasks on tabular data?

That's the first question I have after reading the title. Haven't read the article.

Edited the first sentence. Originally it was "Why it is not called SQL if works on tabular data?".

My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that. Awk is too powerful, not sure about it.

[+] coldtea|3 years ago|reply

>How is it better that SQL for these tasks on tabular data?

It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.

>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.

SQL is a language for data in the form of tables in relational databases.

While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...

You don't need to load anything to a db, for starters.

You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.

You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).

[+] garciasn|3 years ago|reply

1. Because it doesn't use SQL syntax.

2. Because it's closer to an amalgamation of the standard shell scripting tools (cut, sort, jq, etc) than it is to a SQL variant.

[+] LMMojo|3 years ago|reply

Because SQL (Structured Query Language) is the language used to access/manipulate the data, not the name for that kind of data. There have been, and are, many databases, which essentially use tabular data, which are not SQL database.

104 comments