vsbuffalo's comments

vsbuffalo | 10 years ago | on: Why I don't use ggplot2

I really like all plotting systems in R. First, I used base graphics for a few years—and loved it. You learn your way around par(), commit esoteric argument names to memory (oma, mar, mgp, mfrow, etc). It feels powerful — you're just drawing on a screen; its history traces to the original pen plotters. Second, I learned lattice. You can't help but fall in love with lattice after a year or two with creating panel plots in base graphics. The biggest learning curve with lattice is panel functions, but once you learn to throw a browser() in a panel function for stack variable introspection, you can do anything. Somewhere on a dusty bookshelf is a well-worn lattice book I splurged on while taking an R course at UCD.

I like this article, because I think for production graphics, the author has a point. If you're placing lines, points, and labels on a screen — you can create anything. You can draw polygons and arcs. It's like drawing with raw SVG. But I'd have a hard time thinking of an exploratory data analysis situations I wouldn't reach for ggplot2 first. Since it looks at dataframe column types (integers, factors, numerics), it automatically matches these two the appropriate type of color gradient. Coloring a scatter plot by a potential confounder is one additional argument to aes(), e.g. aes(x, y, color=other_col). More than once during EDA I've done this and seen some horrifying pattern in data that shouldn't be there. That's a powerful tool for one extra function argument — the cost of checking for a confounder with color (or shape) is essentially near zero.

I'd make the case that this is a more costly operation in base graphics, and is thus much less likely to be done. You may already have your plots in a for loop to create panels, then you have a few extra lines for adjusting margins and axes (rather than facet_wrap(~col)). It took a lot of code to set that up — there's already a lot of cruft when you just need to do a quick inspection. Then you need to create a vector of appropriate size of colors, and then map this to data. Sure it's easy-ish, but it takes at least double the time as color=some_col. In EDA visualization, I want every single barrier to checking a confounder to be as small as possible—which is what ggplot2 does.

That said, I really liked this article because I do agree that going from EDA visualization to production is a hassle. Just after reading this, I remade some production ggplots with base graphics and love the simple aesthetic — which to mirror in ggplot takes a lot of hassle.

What I really long for is a lower-level data to visualization mapping (like d3) in R. d3 is a pain to learn, but it's really the only data abstraction (even though it is a low-level abstraction) that is seemingly limitless in what it does and can do. I always hope for a general data-join grammar like d3's to be the norm, built on top of base plotting (analogously: svg elements), and then have abstractions like ggplot for tabular data built on top of that.

vsbuffalo | 10 years ago | on: Text Mining South Park

You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

vsbuffalo | 10 years ago | on: Twitter Sees 6% Increase in “Like” Activity After First Week of Hearts

This isn't the most useful statistical figure. The heart feature is novel — the increase could be due entirely to folks trying it out. A more meaningful figure would be to look at the folks who have constant favoriting habits (with stars that is) and see how their behavior changed. Personally, I'm more reluctant to heart tweets, as my Twitter account is mostly professional and it feels a bit unprofessional to "heart" a colleague's tweet.

vsbuffalo | 10 years ago | on: Ben Franklin and Open Heart Surgery (1974) [pdf]

I shared this in light of the recent article by Matt Ridley, "On the Myth of Basic Science"[1] which I find less honest and filled with anecdotal examples that fit his particular narrative.

[1] www.wsj.com/articles/the-myth-of-basic-science-1445613954

vsbuffalo | 10 years ago | on: Pineapple – A standalone front end to IPython for Mac

Completely agree. I can't understand how professional developers are able to give up their text editors for iPython. Don't get me wrong, I love iPython/jupyter, but the slick interface comes with huge productivity drops due to the lack of a real text editor.

vsbuffalo | 10 years ago | on: How the Ballpoint Pen Killed Cursive

See Noodler's "Ben Bernanke" inks[1] which are fast drying and meant for lefties. The name is a play on the fast drying inks Bernanke is using to print more money.

[1]: http://www.gouletpens.com/noodlers-bernanke-black-3oz-bottle...

vsbuffalo | 10 years ago | on: How the Ballpoint Pen Killed Cursive

I love fountain pens and highly encourage everyone to try them (with a good notebook helps). First, ballpoint pens are wasteful — 1.6 billion pens a year are thrown away[1]. Fountain pens are reusable, ink is comparatively cheap and lasts forever, and finding your ink is a fun and personal experience (I really like the "bulletproof" Noodler's inks which are waterproof, bleach proof, etc.). Fountain pens last forever — which is why folks still hunt around for 40+ year old used ones.

Second, it really does make writing fun. I hated writing — my handwriting is messy, it's slow, and it's not as easy as typing. As the article argues, a good fountain pen makes it much, and in my experience much more enjoyable.

Third, it doesn't need to be expensive. Get a Lamy Safari (EF), a Lamy converter and a bottle of Noodlers ink. I also love my Faber Castell Loom[2] (it's the smoothest pen I own), and I carry around a Kaweco Al-Sport[3] everywhere (it's the perfect pocket pen).

[1] http://www.epa.gov/superfund/students/clas_act/haz-ed/ff06.p... [2] http://www.gouletpens.com/faber-castell-loom-metallic-orange... [3] http://www.jetpens.com/Kaweco-AL-Sport-Fountain-Pen-Fine-Nib...

vsbuffalo | 10 years ago | on: Tufte CSS

Well done. Though shouldn't the sidenotes use HTML5's <aside></aside>? How does the vertical alignment of sidenote callout and sidenote work?

vsbuffalo | 11 years ago | on: Top carnivores increase their kill rates as a response to human-induced fear

Science isn't about taking obvious relationships as fact, it's about using evidence and induction to show support for certain hypotheses over others. The geocentric model of the universe was once "obvious".

vsbuffalo | 11 years ago | on: Decomposing the Human Palate with Matrix Factorization

I find this deeply depressing: "the most heavily studied problem in computer science: how to get people to buy more things".

vsbuffalo | 11 years ago | on: Image Kernels Explained Visually

Really awesome stuff! I think there's a very minor bug in which missing pixels are treated as black, which is what adds a black border around the output image in the second example.

vsbuffalo | 11 years ago | on: How many genetic ancestors do you have?

Here's the thread if anyone is interested: https://news.ycombinator.com/item?id=7539143

vsbuffalo | 11 years ago | on: Emacs as the Ultimate LaTeX Editor

I hate the best editor debate because I think it's distracting from what's more important — what both editors can learn from each other and what both need to do to improve. I used Emacs for years, switched to Vim because of RSI, then recently switched back to emacs+evil. Frankly, for what I do most (R, R+knitr, C++ with clang autocomplete), no single editor is great. First, there's too little ability to switch between modes within a single buffer in both Vim and Emacs. The feature's entirely lacking in Vim AFAIK, and poly-mode in R uses a high level hack that (1) doesn't play well with other modes (including evil) and (2) has so thoroughly destroyed my documents in the past I refuse to use it now (mostly because it uses many buffers behind the scenes, which destroys undo history).

In general, if you want flawless R support in certain blocks of text (as in a .Rnw file) in between LaTeX blocks that are fully connected to AucTeX, well... you're out of luck. And Vim... Vim-R-Plugin is useful, but it's sort of a painful hack to use tmux just to get R and Vim to talk (and I'm saying this even though I love Tmux).

Vim has YouCompleteMe, which is smooth as silk compared to Emac's options (which are painful and poorly integrated, especially with clang). But some lower-level issue in Vim causes this constant error message in Vim whenever YouCompleteMe uses clang — bloody annoying. So overall, both editors have huge issues that would require serious overhauls or tedious bug fixing in various modes. Sure, Emacs does AucTeX better, but until it does everything better (or Vim does everything better) it's a flawed editor. Both are flawed editors. But sadly everyone thinks the best course of action is to start fresh — which usually creates a feature-poor flawed editor on a new shiny foundation, that fails to attract developers because it's feature poor. (apologies for ranting -- jetlag).

vsbuffalo | 11 years ago | on: From Vim to Emacs

This looks great! I can use Emacs again!! Thanks!

vsbuffalo | 11 years ago | on: From Vim to Emacs

I thought this was the case too, but YouCompleteMe is really terrific for Vim! I wish emacs had something similar.

vsbuffalo | 11 years ago | on: Useful Unix commands for exploring data

> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.

For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.

vsbuffalo | 11 years ago | on: Useful Unix commands for exploring data

> Hate to sound like Steve-Jobs here, but: "You're using it wrong."

I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.

It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.

vsbuffalo | 11 years ago | on: Statistics: Losing Ground to CS, Losing Image Among Students

> Beyond functions there is nothing good. There are three object-oriented programming systems, none of which are simple or straightforward.

I don't think this is quite fair. R was greatly inspired from Scheme and stole a lot of good ideas. S4 is very similar to CLOS, and does a fairly good job. If you come from Python or Java, this system looks crazy, but being unfamiliar with a language style does not make the language bad.

> To this day I don't have a good mental picture of genomicRange (a foundational bioinformatics package in Bioconductor for NGS data analysis) data structure and how to manipulate it (one main data structure, I believe).

It's a dataframe attached to ranges. This design extends lower classes like IRanges, and all the accessors, setters, etc are consistent. You can manipulate the range part with integer range operations and manipulate the dataframe part as a dataframe. I would check out the IRanges vignette — this is very helpful.

> R, more than any other language I know, is an implementation-defined language.

This is the most severe problem for sure. I'm cross my fingers someone forks R's implementation and does it right (which should break a lot of packages and code).

vsbuffalo | 11 years ago | on: Why cans of soup are shaped the way they are

Great post, but "data genetics"? It would be nice if they didn't misuse the word genetics — this site has nothing to do with genetics.

vsbuffalo | 11 years ago | on: Best of Vim Tips

I was a long time emacs user, and switched to Vim due to typing pain. Even with Evil, I can't see moving back. Vim + Tmux + YouCompleteMe > Emacs. There's just no comparison. No Emacs shell will ever be as good as zsh, and there's always some latency for some reason when using shells running in emacs. And, there's no youcompleteme.