awk is really, really powerful. It is fast, and you can do a lot very efficiently just playing with the 'FS' variable. And you can find it on all *nix boxes. And it works nicely with cli other tools such as cut, paste, or datamash (http://www.gnu.org/software/datamash/).
As soon as it becomes too complex though it is better to resort to a real language (be it python, perl or whatever - my favourite is python + scipy).
I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my data analysis work. It's a really great way to play around with data to get a feel for it before formalizing it in a different language.
For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].
time /usr/bin/parallel -Mx -j+0 -S192.168.1.3,: --block 3.5M --pipe --fifo --bg "/usr/bin/numactl -l /usr/bin/mawk -vRS='[^a-zA-Z0-9]+' '{a[tolower(\$1)]+=1} END { for(k in a) { print a[k],k} }'" < ~/A* | /usr/bin/mawk '{a[$2]+=$1} END {for(k in a) {if (a[k] > 1000) print a[k],k}}' | sort -nr | head -10
Awk is great at what it does, but I find myself unable to keep it cached in my brain long enough to reuse it. Using awk usually means a google search of how to use it, which defeats quickly working at a term.
AWK is a language. I always get upset when people call it a program or a tool. In a very BROAD and general sense all languages are also a program or a tool but it is first and foremost a language. Perl isn't called a tool or a program 1/100 as much as AWK. Maybe I am just petty?
I've used some very basic TXR for refactorings that were a bit beyond my IDE's capabilities, which gave me a taste of how powerful it could be. One thing that's slowed me down in experimenting with it is having to save the script, rerun TXR and refresh the output file each time I make a change. Do you have any tips for quickly and interactively building complex scripts?
pup uses CSS selectors to select elements from HTML documents. Used in conjunction with curl, it gives you a very simple and low friction way to scrape data in scripts.
I would add to that list Nokogiri, "The Chainsaw". xsltproc is ubiquitous, but writing xslt is akin to having a pack of wild monkeys compose a mural with their excrement.
There's a bit of bash boilerplate, but honestly it was about what I would expect, given a structure with so many layers of indirection.
Pain points:
* Switching between bash and jq's filtering language led me to use string interpolation with bash variables. Malicious inputs can probably exploit this (and it was just awkward anyway).
* A "select one" filter would be nice, instead of select + get first element.
First of all, I have little sympathy for people who create JSON like that. You created a messy, hard to use JSON "schema", it should be little surprise to anybody that it's messy and hard to use.
FWIW though, jq can do the query, but I'm not going to spend the time doing it.
The :accessors arguments is used to resolve field access once for all when visiting a table. If :index is true, we build a temporary hash-table based on identifiers.
That's true even in Clojure, arguably the simplest and cleanest language ever invented for complex data transformation and extraction.
The Clojure solution to this still ends up requiring temporary variables and some sort of model transformation functionality. (Will try to post my Clojure solution in 5 hours after my next noprocrast timer is up.)
If the data could first be transformed so that it doesn't require temporary variables or ad-hoc transformation function definitions, instead making use of "paths", then it would be easier with command line tools. Such a transformation could be possible as its own command line interface.
Does anyone know of a tool like ranger[1] for visualizing JSON on the terminal? There is a Chrome Extension[2], but nothing useful to browse JSON on the terminal (it doesn't have to be like ranger, I'm looking for any tool that makes it easier to take a look at a JSON file).
not "like ranger" yet emacs has an extensive library to edit, highlight, and tidy various data formats (XML, HTML, yaml), and json is no exception: https://github.com/thorstadt/json-pretty-print.el
For converting arrays of objects between formats like CSV, JSON, YAML, XML (WIP), etc... I built aoot[1] which stands for "Array of objects to". It's written in Node.js and uses upstream packages whenever possible.
Definitely. Most software vulnerabilities are from failure to write formal parsers on on all inputs. Is there a command line YACC for compiling simple stuff?
[+] [-] chishaku|10 years ago|reply
(Submitted a pull request.)
[+] [-] Apofis|10 years ago|reply
[+] [-] benou|10 years ago|reply
[+] [-] bpchaps|10 years ago|reply
For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].
[0] https://github.com/red-bin/wc_fun/blob/master/wordcount.sh[+] [-] saturncoleus|10 years ago|reply
[+] [-] baldfat|10 years ago|reply
[+] [-] baldfat|10 years ago|reply
Data Science on the Command Line
http://datascienceatthecommandline.com/
[+] [-] kazinator|10 years ago|reply
[+] [-] kbd|10 years ago|reply
[1] http://gema.sourceforge.net/
[+] [-] vojvod|10 years ago|reply
[+] [-] GhotiFish|10 years ago|reply
pup uses CSS selectors to select elements from HTML documents. Used in conjunction with curl, it gives you a very simple and low friction way to scrape data in scripts.
[+] [-] crb002|10 years ago|reply
[+] [-] fiatjaf|10 years ago|reply
[+] [-] timmclean|10 years ago|reply
Pain points:
* Switching between bash and jq's filtering language led me to use string interpolation with bash variables. Malicious inputs can probably exploit this (and it was just awkward anyway).
* A "select one" filter would be nice, instead of select + get first element.
[+] [-] jlarocco|10 years ago|reply
FWIW though, jq can do the query, but I'm not going to spend the time doing it.
[+] [-] junke|10 years ago|reply
The actual test:
The :accessors arguments is used to resolve field access once for all when visiting a table. If :index is true, we build a temporary hash-table based on identifiers.[+] [-] sdegutis|10 years ago|reply
The Clojure solution to this still ends up requiring temporary variables and some sort of model transformation functionality. (Will try to post my Clojure solution in 5 hours after my next noprocrast timer is up.)
If the data could first be transformed so that it doesn't require temporary variables or ad-hoc transformation function definitions, instead making use of "paths", then it would be easier with command line tools. Such a transformation could be possible as its own command line interface.
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] kbenson|10 years ago|reply
1: https://metacpan.org/pod/distribution/App-fsql/bin/fsql
[+] [-] Gratsby|10 years ago|reply
[+] [-] junke|10 years ago|reply
See http://pgloader.io/howto/quickstart.html
[+] [-] fiatjaf|10 years ago|reply
[+] [-] Tiksi|10 years ago|reply
[+] [-] theophrastus|10 years ago|reply
[+] [-] xrstf|10 years ago|reply
[+] [-] anonfunction|10 years ago|reply
1. https://github.com/montanaflynn/aoot
[+] [-] known|10 years ago|reply
[+] [-] pessimizer|10 years ago|reply
[+] [-] nemoniac|10 years ago|reply
[+] [-] chriswarbo|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] zokier|10 years ago|reply
[+] [-] ktRolster|10 years ago|reply
[+] [-] crb002|10 years ago|reply
[+] [-] martelo|10 years ago|reply
[deleted]