I recommend using clickhouse-local[1] for these tasks.
It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].
Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.
No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.
Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.
I agree that this is something missing on classic UNIX shells: typed output streams.
I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.
Searching for that comment, I came across relevant stories:
Despite documentation stating the verbs are fully streaming.
> Fully streaming verbs
> These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.
a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?
b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?
This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).
$ cat example.csv
color,shape,flag,index
yellow,triangle,1,11
red,square,1,15
red,circle,1,16
red,square,0,48
purple,triangle,0,51
red,square,0,77
# pretty printing
$ column -ts',' example.csv
color shape flag index
yellow triangle 1 11
red square 1 15
red circle 1 16
red square 0 48
purple triangle 0 51
red square 0 77
# sorting with skipped headers is a mess.
$ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
color shape flag index
red square 0 77
purple triangle 0 51
red square 0 48
red circle 1 16
red square 1 15
yellow triangle 1 11
> `jq` supports `@csv` for output conversion but not input
Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):
I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:
This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following:
```
I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that
```
And it replied with
mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv
I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly
Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily
In some sense, there's no answer to your question because everyone's tool workflow is their own.
That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`
I was wondering what they meant by that term as well. Does every JSON document in the array have to have the exact same structure (including the ordering of key-value pairs)? What happens if row #1000 introduces a new key-value pair not seen before? What if the value for a key is an array?
>How is it better that SQL for these tasks on tabular data?
It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.
>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.
SQL is a language for data in the form of tables in relational databases.
While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...
You don't need to load anything to a db, for starters.
You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.
You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).
Because SQL (Structured Query Language) is the language used to access/manipulate the data, not the name for that kind of data. There have been, and are, many databases, which essentially use tabular data, which are not SQL database.
[+] [-] zX41ZdbW|3 years ago|reply
It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].
[1] https://clickhouse.com/blog/extracting-converting-querying-l...
[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...
[+] [-] password4321|3 years ago|reply
yq: command-line YAML, JSON, XML, CSV and properties processor
https://news.ycombinator.com/item?id=34656022
Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-flip, csvq, zq, and zsv.
[+] [-] networked|3 years ago|reply
[+] [-] donio|3 years ago|reply
[+] [-] dmix|3 years ago|reply
[+] [-] dima55|3 years ago|reply
[+] [-] asicsp|3 years ago|reply
[+] [-] nextaccountic|3 years ago|reply
[+] [-] elesiuta|3 years ago|reply
New(ish) command line tools
https://news.ycombinator.com/item?id=31009313
[+] [-] rustyf|3 years ago|reply
Get-Content .\example.csv | ConvertFrom-Csv | Where-Object -Property color -eq red
[+] [-] wankle|3 years ago|reply
[+] [-] 7thaccount|3 years ago|reply
[+] [-] unxdfa|3 years ago|reply
Unfortunately after working with it for a few years I utterly despise PowerShell for many other reasons.
[+] [-] MayeulC|3 years ago|reply
Searching for that comment, I came across relevant stories:
https://news.ycombinator.com/item?id=27535357
https://news.ycombinator.com/item?id=25225113
[+] [-] snidane|3 years ago|reply
BUT, leaks memory like crazy.
Despite documentation stating the verbs are fully streaming.
> Fully streaming verbs > These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.
https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...
[+] [-] coldtea|3 years ago|reply
Huh?
a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?
b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?
[+] [-] asicsp|3 years ago|reply
[+] [-] dang|3 years ago|reply
Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON - https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33 comments)
Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON - https://news.ycombinator.com/item?id=28298729 - Aug 2021 (66 comments)
Miller v5.0.0: Autodetected line-endings, in-place mode, user-defined functions - https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20 comments)
Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76 comments)
[+] [-] carb|3 years ago|reply
[+] [-] darrenf|3 years ago|reply
Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):
NB. there is also an entry in the `jq` cookbook for parsing CSVs into arrays of objects (and keeping numbers as numbers, dealing with nulls, etc) https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...[+] [-] asicsp|3 years ago|reply
[+] [-] 2h|3 years ago|reply
I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:
https://go.dev/play/p/9bASZ97lLWv
[+] [-] melx|3 years ago|reply
But seems like it cannot handle a simple use case: CSV without header.
$ mlr --csv head -n 20 pp-2002.csv
mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.
You have to explicitly pass it (FYI `implicit-csv-header` is terrible arg name)
$ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv
While `head` obliges rightly
$ head -n 20 pp-2002.csv
[+] [-] WinstonSmith84|3 years ago|reply
And it replied with
mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv
I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly
[+] [-] 0xbadcafebee|3 years ago|reply
brew install csvkit and enjoy
[+] [-] hindsightbias|3 years ago|reply
[+] [-] pcthrowaway|3 years ago|reply
[+] [-] coldtea|3 years ago|reply
[+] [-] ar9av|3 years ago|reply
[+] [-] mdaniel|3 years ago|reply
That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`
[+] [-] mcdonje|3 years ago|reply
[+] [-] didgetmaster|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] 4dayworkweek4u|3 years ago|reply
[+] [-] avodonosov|3 years ago|reply
That's the first question I have after reading the title. Haven't read the article.
Edited the first sentence. Originally it was "Why it is not called SQL if works on tabular data?".
My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that. Awk is too powerful, not sure about it.
[+] [-] coldtea|3 years ago|reply
It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.
>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.
SQL is a language for data in the form of tables in relational databases.
While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...
You don't need to load anything to a db, for starters.
You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.
You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).
[+] [-] garciasn|3 years ago|reply
2. Because it's closer to an amalgamation of the standard shell scripting tools (cut, sort, jq, etc) than it is to a SQL variant.
[+] [-] LMMojo|3 years ago|reply
[+] [-] jbverschoor|3 years ago|reply
[+] [-] fdgsdfogijq|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]