A list of command line tools for manipulating structured text data

[+] chishaku|10 years ago|reply

Also csvkit.

  in2csv data.xls > data.csv
  in2csv data.json > data.csv
  csvcut -c column_a,column_c data.csv > new.csv
  csvcut -c column_c,column_a data.csv > new.csv
  csvgrep -c phone_number -r 555-555-\d{4}" data.csv > matching.csv
  csvjson data.csv > data.json
  csvsql --query "select name from data where age > 30" data.csv > old_folks.csv
  csvsql --db postgresql:///database --insert data.csv
  sql2csv --db postgresql:///database --query "select * from data" > extract.csv

https://github.com/wireservice/csvkit

(Submitted a pull request.)

[+] Apofis|10 years ago|reply

Oh man where were you a few months ago where I had to write simple apples just to do some of this stuff... also this will be so damn handy!:

    csvsql --query "select name from data where age > 30" data.csv > old_folks.csv

[+] benou|10 years ago|reply

awk is really, really powerful. It is fast, and you can do a lot very efficiently just playing with the 'FS' variable. And you can find it on all *nix boxes. And it works nicely with cli other tools such as cut, paste, or datamash (http://www.gnu.org/software/datamash/). As soon as it becomes too complex though it is better to resort to a real language (be it python, perl or whatever - my favourite is python + scipy).

[+] bpchaps|10 years ago|reply

I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my data analysis work. It's a really great way to play around with data to get a feel for it before formalizing it in a different language.

For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].

  time /usr/bin/parallel -Mx -j+0  -S192.168.1.3,: --block 3.5M --pipe --fifo --bg "/usr/bin/numactl -l /usr/bin/mawk -vRS='[^a-zA-Z0-9]+' '{a[tolower(\$1)]+=1} END { for(k in a) { print a[k],k} }'" < ~/A* |  /usr/bin/mawk '{a[$2]+=$1} END {for(k in a) {if (a[k] > 1000) print a[k],k}}' | sort -nr | head -10

[0] https://github.com/red-bin/wc_fun/blob/master/wordcount.sh

[+] saturncoleus|10 years ago|reply

Awk is great at what it does, but I find myself unable to keep it cached in my brain long enough to reuse it. Using awk usually means a google search of how to use it, which defeats quickly working at a term.

[+] baldfat|10 years ago|reply

AWK is a language. I always get upset when people call it a program or a tool. In a very BROAD and general sense all languages are also a program or a tool but it is first and foremost a language. Perl isn't called a tool or a program 1/100 as much as AWK. Maybe I am just petty?

[+] baldfat|10 years ago|reply

HIGHLY Recommended Book:

Data Science on the Command Line

http://datascienceatthecommandline.com/

[+] kazinator|10 years ago|reply

http://www.nongnu.org/txr

[+] kbd|10 years ago|reply

Thanks, hadn't heard of TXR. Gema[1] is another tool I'm surprised I never hear mentioned.

[1] http://gema.sourceforge.net/

[+] vojvod|10 years ago|reply

I've used some very basic TXR for refactorings that were a bit beyond my IDE's capabilities, which gave me a taste of how powerful it could be. One thing that's slowed me down in experimenting with it is having to save the script, rerun TXR and refresh the output file each time I make a change. Do you have any tips for quickly and interactively building complex scripts?

[+] GhotiFish|10 years ago|reply

Check out pup for parsing HTML. https://github.com/ericchiang/pup

pup uses CSS selectors to select elements from HTML documents. Used in conjunction with curl, it gives you a very simple and low friction way to scrape data in scripts.

[+] crb002|10 years ago|reply

I would add to that list Nokogiri, "The Chainsaw". xsltproc is ubiquitous, but writing xslt is akin to having a pack of wild monkeys compose a mural with their excrement.

[+] fiatjaf|10 years ago|reply

It isn't easy (perhaps not even possible) to get the name of the fruit Bob grows in his farm using any of these tools and the following data:

  {
    "models": [{
      "title": "fruits",
      "fields": [
        {"name": "Name", "key": "3746"},
        {"name": "Colour", "key": "4867"}
      ],
      "entities": [{
        "_id": "372612",
        "3746": "Orange",
        "4867": "orange"
      }]
    }, {
      "title": "farmers",
      "fields": [
        {"name": "Full name", "key": "8367"},
        {"name": "Address", "key": "3947"},
        {"name": "Fruits", "key": "5243"}
      ],
      "entities": [{
        "_id": "747463",
        "8367": "Bob, the farmer",
        "3947": "Farmland",
        "5243": ["372612"]
      }]
    }]
  }

[+] timmclean|10 years ago|reply

I've been meaning to learn jq, so I decided to give it a try.

    FRUITS=$(cat input.json | jq '.models | map(select(.title == "fruits")) | .[0]')
    FRUIT_NAME_KEY=$(echo "$FRUITS" | jq '.fields | map(select(.name == "Name")) | .[0].key')
    
    FARMERS=$(cat input.json | jq '.models | map(select(.title == "farmers")) | .[0]')
    FARMER_NAME_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Full name")) | .[0].key')
    FARMER_FRUITS_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Fruits")) | .[0].key')
    
    BOB=$(echo "$FARMERS" | jq '.entities | map(select(.['$FARMER_NAME_KEY'] == "Bob, the farmer")) | .[0]')
    BOB_FRUIT_IDS=$(echo "$BOB" | jq '.['$FARMER_FRUITS_KEY'] | .[]' -r)
    
    for BOB_FRUIT_ID in "$BOB_FRUIT_IDS"; do
        echo "$FRUITS" | jq '.entities | map(select(._id == "'$BOB_FRUIT_ID'")) | .[0] | .['$FRUIT_NAME_KEY']'
    done

There's a bit of bash boilerplate, but honestly it was about what I would expect, given a structure with so many layers of indirection.

Pain points:

* Switching between bash and jq's filtering language led me to use string interpolation with bash variables. Malicious inputs can probably exploit this (and it was just awkward anyway).

* A "select one" filter would be nice, instead of select + get first element.

[+] jlarocco|10 years ago|reply

First of all, I have little sympathy for people who create JSON like that. You created a messy, hard to use JSON "schema", it should be little surprise to anybody that it's messy and hard to use.

FWIW though, jq can do the query, but I'm not going to spend the time doing it.

[+] junke|10 years ago|reply

For details, see http://pastebin.com/raw/1W88ywuu

The actual test:

    (defparameter *model*
      (make-model (cl-json:decode-json-from-source #P"/tmp/json.test")))

    (with-table (farmers :title "farmers"
                         :model *model*
                         :accessors ((farmer-name "Full name")
                                     (farmer-fruits "Fruits")))
      (dolist (farmer (table-entities farmers))
        (when (ppcre:scan "Bob" (farmer-name farmer))
          (with-table (fruits :title "fruits"
                              :model *model*
                              :accessors ((fruit-name "Name"))
                              :index t)
            (return
              (loop
                 for fid in (farmer-fruits farmer)
                 collect (fruit-name (iget fruits fid))))))))

The :accessors arguments is used to resolve field access once for all when visiting a table. If :index is true, we build a temporary hash-table based on identifiers.

[+] sdegutis|10 years ago|reply

That's true even in Clojure, arguably the simplest and cleanest language ever invented for complex data transformation and extraction.

The Clojure solution to this still ends up requiring temporary variables and some sort of model transformation functionality. (Will try to post my Clojure solution in 5 hours after my next noprocrast timer is up.)

If the data could first be transformed so that it doesn't require temporary variables or ad-hoc transformation function definitions, instead making use of "paths", then it would be easier with command line tools. Such a transformation could be possible as its own command line interface.

[+] unknown|10 years ago|reply

[deleted]

[+] kbenson|10 years ago|reply

I've found fsql[1] to be extremely useful in the past.

1: https://metacpan.org/pod/distribution/App-fsql/bin/fsql

[+] Gratsby|10 years ago|reply

Don't forget cut and sed.

[+] junke|10 years ago|reply

pgloader: load from CSV (and others) to postgreSQL.

See http://pgloader.io/howto/quickstart.html

[+] fiatjaf|10 years ago|reply

Does anyone know of a tool like ranger[1] for visualizing JSON on the terminal? There is a Chrome Extension[2], but nothing useful to browse JSON on the terminal (it doesn't have to be like ranger, I'm looking for any tool that makes it easier to take a look at a JSON file).

  [1]: https://github.com/hut/ranger
  [2]: https://chrome.google.com/webstore/detail/json-finder/flhdcaebggmmpnnaljiajhihdfconkbj

[+] Tiksi|10 years ago|reply

If you use vim there's https://github.com/elzr/vim-json which gives you folding, highlighting, etc.

[+] theophrastus|10 years ago|reply

not "like ranger" yet emacs has an extensive library to edit, highlight, and tidy various data formats (XML, HTML, yaml), and json is no exception: https://github.com/thorstadt/json-pretty-print.el

[+] xrstf|10 years ago|reply

jq can pretty print JSON files, including coloring.

[+] anonfunction|10 years ago|reply

For converting arrays of objects between formats like CSV, JSON, YAML, XML (WIP), etc... I built aoot[1] which stands for "Array of objects to". It's written in Node.js and uses upstream packages whenever possible.

1. https://github.com/montanaflynn/aoot

[+] known|10 years ago|reply

http://www.commandlinefu.com/commands/browse/sort-by-votes has some cool tips

[+] pessimizer|10 years ago|reply

http://neilb.bitbucket.org/csvfix/

[+] nemoniac|10 years ago|reply

And not a single tool for s-expressions?

[+] chriswarbo|10 years ago|reply

I was after something like jq for s-expressions a while ago, but didn't find anything other than full-blown Lisps, Schemes, or libraries written in other languages: http://stackoverflow.com/questions/31232843/jq-or-xsltproc-a...

[+] unknown|10 years ago|reply

[deleted]

[+] zokier|10 years ago|reply

xsv for doing queries against CSV files probably belongs to the list too: https://github.com/BurntSushi/xsv

[+] ktRolster|10 years ago|reply

yacc, antlr, lemon, bison.....

[+] crb002|10 years ago|reply

Definitely. Most software vulnerabilities are from failure to write formal parsers on on all inputs. Is there a command line YACC for compiling simple stuff?

[+] martelo|10 years ago|reply

[deleted]

55 comments