top | item 11649142

A list of command line tools for manipulating structured text data

177 points| networked | 10 years ago |github.com | reply

55 comments

order
[+] chishaku|10 years ago|reply
Also csvkit.

  in2csv data.xls > data.csv
  in2csv data.json > data.csv
  csvcut -c column_a,column_c data.csv > new.csv
  csvcut -c column_c,column_a data.csv > new.csv
  csvgrep -c phone_number -r 555-555-\d{4}" data.csv > matching.csv
  csvjson data.csv > data.json
  csvsql --query "select name from data where age > 30" data.csv > old_folks.csv
  csvsql --db postgresql:///database --insert data.csv
  sql2csv --db postgresql:///database --query "select * from data" > extract.csv
https://github.com/wireservice/csvkit

(Submitted a pull request.)

[+] Apofis|10 years ago|reply
Oh man where were you a few months ago where I had to write simple apples just to do some of this stuff... also this will be so damn handy!:

    csvsql --query "select name from data where age > 30" data.csv > old_folks.csv
[+] benou|10 years ago|reply
awk is really, really powerful. It is fast, and you can do a lot very efficiently just playing with the 'FS' variable. And you can find it on all *nix boxes. And it works nicely with cli other tools such as cut, paste, or datamash (http://www.gnu.org/software/datamash/). As soon as it becomes too complex though it is better to resort to a real language (be it python, perl or whatever - my favourite is python + scipy).
[+] bpchaps|10 years ago|reply
I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my data analysis work. It's a really great way to play around with data to get a feel for it before formalizing it in a different language.

For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].

  time /usr/bin/parallel -Mx -j+0  -S192.168.1.3,: --block 3.5M --pipe --fifo --bg "/usr/bin/numactl -l /usr/bin/mawk -vRS='[^a-zA-Z0-9]+' '{a[tolower(\$1)]+=1} END { for(k in a) { print a[k],k} }'" < ~/A* |  /usr/bin/mawk '{a[$2]+=$1} END {for(k in a) {if (a[k] > 1000) print a[k],k}}' | sort -nr | head -10
[0] https://github.com/red-bin/wc_fun/blob/master/wordcount.sh
[+] saturncoleus|10 years ago|reply
Awk is great at what it does, but I find myself unable to keep it cached in my brain long enough to reuse it. Using awk usually means a google search of how to use it, which defeats quickly working at a term.
[+] baldfat|10 years ago|reply
AWK is a language. I always get upset when people call it a program or a tool. In a very BROAD and general sense all languages are also a program or a tool but it is first and foremost a language. Perl isn't called a tool or a program 1/100 as much as AWK. Maybe I am just petty?
[+] kazinator|10 years ago|reply
[+] vojvod|10 years ago|reply
I've used some very basic TXR for refactorings that were a bit beyond my IDE's capabilities, which gave me a taste of how powerful it could be. One thing that's slowed me down in experimenting with it is having to save the script, rerun TXR and refresh the output file each time I make a change. Do you have any tips for quickly and interactively building complex scripts?
[+] GhotiFish|10 years ago|reply
Check out pup for parsing HTML. https://github.com/ericchiang/pup

pup uses CSS selectors to select elements from HTML documents. Used in conjunction with curl, it gives you a very simple and low friction way to scrape data in scripts.

[+] crb002|10 years ago|reply
I would add to that list Nokogiri, "The Chainsaw". xsltproc is ubiquitous, but writing xslt is akin to having a pack of wild monkeys compose a mural with their excrement.
[+] fiatjaf|10 years ago|reply
It isn't easy (perhaps not even possible) to get the name of the fruit Bob grows in his farm using any of these tools and the following data:

  {
    "models": [{
      "title": "fruits",
      "fields": [
        {"name": "Name", "key": "3746"},
        {"name": "Colour", "key": "4867"}
      ],
      "entities": [{
        "_id": "372612",
        "3746": "Orange",
        "4867": "orange"
      }]
    }, {
      "title": "farmers",
      "fields": [
        {"name": "Full name", "key": "8367"},
        {"name": "Address", "key": "3947"},
        {"name": "Fruits", "key": "5243"}
      ],
      "entities": [{
        "_id": "747463",
        "8367": "Bob, the farmer",
        "3947": "Farmland",
        "5243": ["372612"]
      }]
    }]
  }
[+] timmclean|10 years ago|reply
I've been meaning to learn jq, so I decided to give it a try.

    FRUITS=$(cat input.json | jq '.models | map(select(.title == "fruits")) | .[0]')
    FRUIT_NAME_KEY=$(echo "$FRUITS" | jq '.fields | map(select(.name == "Name")) | .[0].key')
    
    FARMERS=$(cat input.json | jq '.models | map(select(.title == "farmers")) | .[0]')
    FARMER_NAME_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Full name")) | .[0].key')
    FARMER_FRUITS_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Fruits")) | .[0].key')
    
    BOB=$(echo "$FARMERS" | jq '.entities | map(select(.['$FARMER_NAME_KEY'] == "Bob, the farmer")) | .[0]')
    BOB_FRUIT_IDS=$(echo "$BOB" | jq '.['$FARMER_FRUITS_KEY'] | .[]' -r)
    
    for BOB_FRUIT_ID in "$BOB_FRUIT_IDS"; do
        echo "$FRUITS" | jq '.entities | map(select(._id == "'$BOB_FRUIT_ID'")) | .[0] | .['$FRUIT_NAME_KEY']'
    done
There's a bit of bash boilerplate, but honestly it was about what I would expect, given a structure with so many layers of indirection.

Pain points:

* Switching between bash and jq's filtering language led me to use string interpolation with bash variables. Malicious inputs can probably exploit this (and it was just awkward anyway).

* A "select one" filter would be nice, instead of select + get first element.

[+] jlarocco|10 years ago|reply
First of all, I have little sympathy for people who create JSON like that. You created a messy, hard to use JSON "schema", it should be little surprise to anybody that it's messy and hard to use.

FWIW though, jq can do the query, but I'm not going to spend the time doing it.

[+] junke|10 years ago|reply
For details, see http://pastebin.com/raw/1W88ywuu

The actual test:

    (defparameter *model*
      (make-model (cl-json:decode-json-from-source #P"/tmp/json.test")))

    (with-table (farmers :title "farmers"
                         :model *model*
                         :accessors ((farmer-name "Full name")
                                     (farmer-fruits "Fruits")))
      (dolist (farmer (table-entities farmers))
        (when (ppcre:scan "Bob" (farmer-name farmer))
          (with-table (fruits :title "fruits"
                              :model *model*
                              :accessors ((fruit-name "Name"))
                              :index t)
            (return
              (loop
                 for fid in (farmer-fruits farmer)
                 collect (fruit-name (iget fruits fid))))))))
The :accessors arguments is used to resolve field access once for all when visiting a table. If :index is true, we build a temporary hash-table based on identifiers.
[+] sdegutis|10 years ago|reply
That's true even in Clojure, arguably the simplest and cleanest language ever invented for complex data transformation and extraction.

The Clojure solution to this still ends up requiring temporary variables and some sort of model transformation functionality. (Will try to post my Clojure solution in 5 hours after my next noprocrast timer is up.)

If the data could first be transformed so that it doesn't require temporary variables or ad-hoc transformation function definitions, instead making use of "paths", then it would be easier with command line tools. Such a transformation could be possible as its own command line interface.

[+] Gratsby|10 years ago|reply
Don't forget cut and sed.
[+] fiatjaf|10 years ago|reply
Does anyone know of a tool like ranger[1] for visualizing JSON on the terminal? There is a Chrome Extension[2], but nothing useful to browse JSON on the terminal (it doesn't have to be like ranger, I'm looking for any tool that makes it easier to take a look at a JSON file).

  [1]: https://github.com/hut/ranger
  [2]: https://chrome.google.com/webstore/detail/json-finder/flhdcaebggmmpnnaljiajhihdfconkbj
[+] xrstf|10 years ago|reply
jq can pretty print JSON files, including coloring.
[+] anonfunction|10 years ago|reply
For converting arrays of objects between formats like CSV, JSON, YAML, XML (WIP), etc... I built aoot[1] which stands for "Array of objects to". It's written in Node.js and uses upstream packages whenever possible.

1. https://github.com/montanaflynn/aoot

[+] ktRolster|10 years ago|reply
yacc, antlr, lemon, bison.....
[+] crb002|10 years ago|reply
Definitely. Most software vulnerabilities are from failure to write formal parsers on on all inputs. Is there a command line YACC for compiling simple stuff?