I use unix in the same way and for the same purpose as the described in the blog, but I have come to the opinion that once you get into the describe and visualize phase, it's much easier to just drop into R. Reading in the kind of file being worked on here is often as simple as
foo <-read.csv("foo.csv")
Getting summary descriptive statistics, item counts, scatter plots and histograms is often as easy as
summary(foo)
table(foo$col)
plot(foo$xcol, foo$ycol)
hist(foo$col).
I think that is lot simpler than a 4 or 5 command pipeline that can be mistake-prone to edit when you want to change column names or things like that. I still do these kinds of things in the shell sometimes, and I don't know if I can put my finger on when exactly I would drop into R vs write out a pipeline, but there IS a line somewhere...
Every developer is familiar with Shell and Unix tools, it's trivially simple to lookup documentation and get something working really quick with unix utilities even if you are not familiar with the tool in the beginning. On the other hand the idea of learning an entirely new programming language for quick and dirty trivial stuff is a little off-putting. That being said I did not know how simple it is in R to play with data, will have to check it out.
I completely agree - while I love using the command-line for most tasks, R is ideally designed for this... and rightly so, since that's the whole point of the language!
R also supports the use of atomic operations on vectors (it automatically maps the operation over the vector), and other idioms that would be liabilities in other programming languages, but hugely beneficial in this one area.
I still haven't found a language that treats reading (and then processing) a CSV file more easily than R.
Conveniently, R should be really easy to pick up for someone familiar with working with a POSIX shell or Bash. For example, to see all variables defined in the current namespace, just type ls().
R is basically the POSIX mentality applied specifically to data processing, instead of general-purpose work.
A quote: "This method (and the general idea of an FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later discovered (Heideman & Burrus, 1984) that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms)."
It's true the writeup didn't mention Gauss, but the PDF that the author linked did mention that.
But either way wouldn't the importance for the field of computing in this case be more on the application of the algorithm and not who published first? I.e. if no one knows about a computable algorithm used sparingly 150 years earlier then I don't think it's completely unfair to give some credit to one who later rediscovers and popularizes that algorithm.
According to the Wikipedia article the Cooley-Tukey algorithm was an independent re-discovery so it's not as if Tukey had read Gauss and then tried to steal the credit (it wasn't even noted until almost 20 years later that Cooley-Tukey was a rediscovery of a Gaussian algorithm).
It's almost unfair though... I think if we really deeply dove into what mathematicians like Gauss, Euler, Cauchy, etc. came up with that there may be other "CS" algorithms that were innovated hundreds of years before computers were available to really popularize them. Every time I read about Euler and Gauss especially I end up even more impressed.
I don't see how this correction adds to the discussion.
How are the actions of "invention" and "rediscovery" different? Is it less impressive that someone came up with a great idea simply because someone else did it first in a different context? Obviously Gauss should be celebrated too (and obviously is), but I don't see anything wrong with applauding Tukey either...
Besides Gauss, many others including Runge (yes, that Runge) and Burkhardt (yes, the one on Einstein's committee) independently discovered the FFT well before the 1950s. Like so much of Gauss's work, his work on the FFT was unpublished during his lifetime.
Probably it was the conjunction of the algorithm and the emerging power of the digital computer that caused the Cooley-Tukey paper to take off at that historical moment.
Hits close to home. I do a lot of data conversion, arrangement and manipulation on the CLI.
When some coworker inherits any of those tasks and I explain how to do it, the answer tends to be "Aaaaallright, I'll use Excel".
Up for unix and "EDA is the lingua franca of data science". What you can do and discard on the unix CLI takes many times longer on certain GUI based OSes.
For those who don't follow why this is. In the first case, data is dup'ed to stdin and so remains seekable. This is important since head buffers reads (typically 8k at a time) for efficiency, then uses lseek to reposition to just after the requested number of lines. If reading from a pipe as in the second case, lseek() fails and by the time tail runs, head has consumed all of the file.
If you use "seq 10000 | (head -5; tail -5)" you'll get the first and last lines as expected since head hasn't consumed too much of the file.
I don't think this invalidates his example, but it could mention this subtle caveat. :-)
Are you saying his example doesn't work, or that it can't be extended to your two examples? Because when working on a file it works fine -- just not through a pipe (which both your examples are).
There is a "rs" package in Debian so it must be available on most Linux distributions. Otherwise it should be quite straightforward to port the code since it doesn't do anything complicated.
[+] [-] frankc|13 years ago|reply
foo <-read.csv("foo.csv")
Getting summary descriptive statistics, item counts, scatter plots and histograms is often as easy as
summary(foo)
table(foo$col)
plot(foo$xcol, foo$ycol)
hist(foo$col).
I think that is lot simpler than a 4 or 5 command pipeline that can be mistake-prone to edit when you want to change column names or things like that. I still do these kinds of things in the shell sometimes, and I don't know if I can put my finger on when exactly I would drop into R vs write out a pipeline, but there IS a line somewhere...
[+] [-] theshadow|13 years ago|reply
[+] [-] chimeracoder|13 years ago|reply
R also supports the use of atomic operations on vectors (it automatically maps the operation over the vector), and other idioms that would be liabilities in other programming languages, but hugely beneficial in this one area.
I still haven't found a language that treats reading (and then processing) a CSV file more easily than R.
Conveniently, R should be really easy to pick up for someone familiar with working with a POSIX shell or Bash. For example, to see all variables defined in the current namespace, just type ls().
R is basically the POSIX mentality applied specifically to data processing, instead of general-purpose work.
[+] [-] lutusp|13 years ago|reply
No, Tukey did not "invent" the FFT. He rediscovered it, as did a number of others over the years since -- who else? -- Gauss first created it.
http://en.wikipedia.org/wiki/Fast_Fourier_transform
A quote: "This method (and the general idea of an FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later discovered (Heideman & Burrus, 1984) that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms)."
[+] [-] mpyne|13 years ago|reply
But either way wouldn't the importance for the field of computing in this case be more on the application of the algorithm and not who published first? I.e. if no one knows about a computable algorithm used sparingly 150 years earlier then I don't think it's completely unfair to give some credit to one who later rediscovers and popularizes that algorithm.
According to the Wikipedia article the Cooley-Tukey algorithm was an independent re-discovery so it's not as if Tukey had read Gauss and then tried to steal the credit (it wasn't even noted until almost 20 years later that Cooley-Tukey was a rediscovery of a Gaussian algorithm).
It's almost unfair though... I think if we really deeply dove into what mathematicians like Gauss, Euler, Cauchy, etc. came up with that there may be other "CS" algorithms that were innovated hundreds of years before computers were available to really popularize them. Every time I read about Euler and Gauss especially I end up even more impressed.
[+] [-] ajross|13 years ago|reply
How are the actions of "invention" and "rediscovery" different? Is it less impressive that someone came up with a great idea simply because someone else did it first in a different context? Obviously Gauss should be celebrated too (and obviously is), but I don't see anything wrong with applauding Tukey either...
[+] [-] mturmon|13 years ago|reply
http://www.cis.rit.edu/class/simg716/Gauss_History_FFT.pdf
Besides Gauss, many others including Runge (yes, that Runge) and Burkhardt (yes, the one on Einstein's committee) independently discovered the FFT well before the 1950s. Like so much of Gauss's work, his work on the FFT was unpublished during his lifetime.
Probably it was the conjunction of the algorithm and the emerging power of the digital computer that caused the Cooley-Tukey paper to take off at that historical moment.
[+] [-] mpyne|13 years ago|reply
[+] [-] fcatalan|13 years ago|reply
[+] [-] piqufoh|13 years ago|reply
[+] [-] nipunn1313|13 years ago|reply
Pipe sends stdout to stdin of the next process. cat sends stdin back to stdout. Piping to cat is rarely eventful (unless you use a flag like cat -n).
[+] [-] pepve|13 years ago|reply
[+] [-] ralph|13 years ago|reply
[+] [-] js2|13 years ago|reply
If you use "seq 10000 | (head -5; tail -5)" you'll get the first and last lines as expected since head hasn't consumed too much of the file.
I don't think this invalidates his example, but it could mention this subtle caveat. :-)
[+] [-] derekp7|13 years ago|reply
[+] [-] keithpeter|13 years ago|reply
[+] [-] p4bl0|13 years ago|reply
[+] [-] mturmon|13 years ago|reply
[+] [-] _hnwo|13 years ago|reply