Explorations in Unix

[+] frankc|13 years ago|reply

I use unix in the same way and for the same purpose as the described in the blog, but I have come to the opinion that once you get into the describe and visualize phase, it's much easier to just drop into R. Reading in the kind of file being worked on here is often as simple as

foo <-read.csv("foo.csv")

Getting summary descriptive statistics, item counts, scatter plots and histograms is often as easy as

summary(foo)

table(foo$col)

plot(foo$xcol, foo$ycol)

hist(foo$col).

I think that is lot simpler than a 4 or 5 command pipeline that can be mistake-prone to edit when you want to change column names or things like that. I still do these kinds of things in the shell sometimes, and I don't know if I can put my finger on when exactly I would drop into R vs write out a pipeline, but there IS a line somewhere...

[+] theshadow|13 years ago|reply

Every developer is familiar with Shell and Unix tools, it's trivially simple to lookup documentation and get something working really quick with unix utilities even if you are not familiar with the tool in the beginning. On the other hand the idea of learning an entirely new programming language for quick and dirty trivial stuff is a little off-putting. That being said I did not know how simple it is in R to play with data, will have to check it out.

[+] chimeracoder|13 years ago|reply

I completely agree - while I love using the command-line for most tasks, R is ideally designed for this... and rightly so, since that's the whole point of the language!

R also supports the use of atomic operations on vectors (it automatically maps the operation over the vector), and other idioms that would be liabilities in other programming languages, but hugely beneficial in this one area.

I still haven't found a language that treats reading (and then processing) a CSV file more easily than R.

Conveniently, R should be really easy to pick up for someone familiar with working with a POSIX shell or Bash. For example, to see all variables defined in the current namespace, just type ls().

R is basically the POSIX mentality applied specifically to data processing, instead of general-purpose work.

[+] lutusp|13 years ago|reply

A quote: "... As if this wasn't enough, he [i.e.Tukey] also invented what is probably the most influential algorithm of all time." (emphasis added)

No, Tukey did not "invent" the FFT. He rediscovered it, as did a number of others over the years since -- who else? -- Gauss first created it.

http://en.wikipedia.org/wiki/Fast_Fourier_transform

A quote: "This method (and the general idea of an FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later discovered (Heideman & Burrus, 1984) that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms)."

[+] mpyne|13 years ago|reply

It's true the writeup didn't mention Gauss, but the PDF that the author linked did mention that.

But either way wouldn't the importance for the field of computing in this case be more on the application of the algorithm and not who published first? I.e. if no one knows about a computable algorithm used sparingly 150 years earlier then I don't think it's completely unfair to give some credit to one who later rediscovers and popularizes that algorithm.

According to the Wikipedia article the Cooley-Tukey algorithm was an independent re-discovery so it's not as if Tukey had read Gauss and then tried to steal the credit (it wasn't even noted until almost 20 years later that Cooley-Tukey was a rediscovery of a Gaussian algorithm).

It's almost unfair though... I think if we really deeply dove into what mathematicians like Gauss, Euler, Cauchy, etc. came up with that there may be other "CS" algorithms that were innovated hundreds of years before computers were available to really popularize them. Every time I read about Euler and Gauss especially I end up even more impressed.

[+] ajross|13 years ago|reply

I don't see how this correction adds to the discussion.

How are the actions of "invention" and "rediscovery" different? Is it less impressive that someone came up with a great idea simply because someone else did it first in a different context? Obviously Gauss should be celebrated too (and obviously is), but I don't see anything wrong with applauding Tukey either...

[+] mturmon|13 years ago|reply

Here's substantiation of this history:

http://www.cis.rit.edu/class/simg716/Gauss_History_FFT.pdf

Besides Gauss, many others including Runge (yes, that Runge) and Burkhardt (yes, the one on Einstein's committee) independently discovered the FFT well before the 1950s. Like so much of Gauss's work, his work on the FFT was unpublished during his lifetime.

Probably it was the conjunction of the algorithm and the emerging power of the digital computer that caused the Cooley-Tukey paper to take off at that historical moment.

[+] mpyne|13 years ago|reply

I almost skipped because I figured it would be another introductory article to how to use bash and coreutils, but this was actually very good.

[+] fcatalan|13 years ago|reply

Hits close to home. I do a lot of data conversion, arrangement and manipulation on the CLI. When some coworker inherits any of those tasks and I explain how to do it, the answer tends to be "Aaaaallright, I'll use Excel".

[+] piqufoh|13 years ago|reply

Up for unix and "EDA is the lingua franca of data science". What you can do and discard on the unix CLI takes many times longer on certain GUI based OSes.

[+] nipunn1313|13 years ago|reply

head -3 data* | cat has the same result as head -3 data*

Pipe sends stdout to stdin of the next process. cat sends stdin back to stdout. Piping to cat is rarely eventful (unless you use a flag like cat -n).

[+] pepve|13 years ago|reply

Some tools adjust their output based on it going to a terminal or not. Try 'ls' versus 'ls | cat'.

[+] ralph|13 years ago|reply

He writes

    (head -5; tail -5) <data

but that's a bit misleading. These don't work.

    seq 20 | (head -5; tail -5)
    (head -5; tail -5) < <(seq 20)

Both giving just the first five lines.

[+] js2|13 years ago|reply

For those who don't follow why this is. In the first case, data is dup'ed to stdin and so remains seekable. This is important since head buffers reads (typically 8k at a time) for efficiency, then uses lseek to reposition to just after the requested number of lines. If reading from a pipe as in the second case, lseek() fails and by the time tail runs, head has consumed all of the file.

If you use "seq 10000 | (head -5; tail -5)" you'll get the first and last lines as expected since head hasn't consumed too much of the file.

I don't think this invalidates his example, but it could mention this subtle caveat. :-)

[+] derekp7|13 years ago|reply

Are you saying his example doesn't work, or that it can't be extended to your two examples? Because when working on a file it works fine -- just not through a pipe (which both your examples are).

    seq 1 20 >file.txt
    (head -5; tail -5) <file.txt

[+] keithpeter|13 years ago|reply

rs and lam look interesting. Are these commands really only available on BSD (i.e. 'proper' Unix derivatives)? Hoping for Linux compilable code.

[+] p4bl0|13 years ago|reply

There is a "rs" package in Debian so it must be available on most Linux distributions. Otherwise it should be quite straightforward to port the code since it doesn't do anything complicated.

[+] mturmon|13 years ago|reply

"describe" is a nice idea. Just knowing the range, the mean, and the second moment can be helpful.

[+] _hnwo|13 years ago|reply

I'd be interested in what Seth's setup/theme/os of choice is .. :)

33 comments