vsbuffalo | 10 years ago | on: Why I don't use ggplot2
vsbuffalo's comments
vsbuffalo | 10 years ago | on: Text Mining South Park
Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.
[1]: http://andrewgelman.com/2009/07/03/how_does_statis/
[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)
vsbuffalo | 10 years ago | on: Twitter Sees 6% Increase in “Like” Activity After First Week of Hearts
vsbuffalo | 10 years ago | on: Ben Franklin and Open Heart Surgery (1974) [pdf]
[1] www.wsj.com/articles/the-myth-of-basic-science-1445613954
vsbuffalo | 10 years ago | on: Pineapple – A standalone front end to IPython for Mac
vsbuffalo | 10 years ago | on: How the Ballpoint Pen Killed Cursive
[1]: http://www.gouletpens.com/noodlers-bernanke-black-3oz-bottle...
vsbuffalo | 10 years ago | on: How the Ballpoint Pen Killed Cursive
Second, it really does make writing fun. I hated writing — my handwriting is messy, it's slow, and it's not as easy as typing. As the article argues, a good fountain pen makes it much, and in my experience much more enjoyable.
Third, it doesn't need to be expensive. Get a Lamy Safari (EF), a Lamy converter and a bottle of Noodlers ink. I also love my Faber Castell Loom[2] (it's the smoothest pen I own), and I carry around a Kaweco Al-Sport[3] everywhere (it's the perfect pocket pen).
[1] http://www.epa.gov/superfund/students/clas_act/haz-ed/ff06.p... [2] http://www.gouletpens.com/faber-castell-loom-metallic-orange... [3] http://www.jetpens.com/Kaweco-AL-Sport-Fountain-Pen-Fine-Nib...
vsbuffalo | 10 years ago | on: Tufte CSS
vsbuffalo | 11 years ago | on: Top carnivores increase their kill rates as a response to human-induced fear
vsbuffalo | 11 years ago | on: Decomposing the Human Palate with Matrix Factorization
vsbuffalo | 11 years ago | on: Image Kernels Explained Visually
vsbuffalo | 11 years ago | on: How many genetic ancestors do you have?
vsbuffalo | 11 years ago | on: Emacs as the Ultimate LaTeX Editor
In general, if you want flawless R support in certain blocks of text (as in a .Rnw file) in between LaTeX blocks that are fully connected to AucTeX, well... you're out of luck. And Vim... Vim-R-Plugin is useful, but it's sort of a painful hack to use tmux just to get R and Vim to talk (and I'm saying this even though I love Tmux).
Vim has YouCompleteMe, which is smooth as silk compared to Emac's options (which are painful and poorly integrated, especially with clang). But some lower-level issue in Vim causes this constant error message in Vim whenever YouCompleteMe uses clang — bloody annoying. So overall, both editors have huge issues that would require serious overhauls or tedious bug fixing in various modes. Sure, Emacs does AucTeX better, but until it does everything better (or Vim does everything better) it's a flawed editor. Both are flawed editors. But sadly everyone thinks the best course of action is to start fresh — which usually creates a feature-poor flawed editor on a new shiny foundation, that fails to attract developers because it's feature poor. (apologies for ranting -- jetlag).
vsbuffalo | 11 years ago | on: From Vim to Emacs
vsbuffalo | 11 years ago | on: From Vim to Emacs
vsbuffalo | 11 years ago | on: Useful Unix commands for exploring data
For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.
vsbuffalo | 11 years ago | on: Useful Unix commands for exploring data
I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.
It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.
vsbuffalo | 11 years ago | on: Statistics: Losing Ground to CS, Losing Image Among Students
I don't think this is quite fair. R was greatly inspired from Scheme and stole a lot of good ideas. S4 is very similar to CLOS, and does a fairly good job. If you come from Python or Java, this system looks crazy, but being unfamiliar with a language style does not make the language bad.
> To this day I don't have a good mental picture of genomicRange (a foundational bioinformatics package in Bioconductor for NGS data analysis) data structure and how to manipulate it (one main data structure, I believe).
It's a dataframe attached to ranges. This design extends lower classes like IRanges, and all the accessors, setters, etc are consistent. You can manipulate the range part with integer range operations and manipulate the dataframe part as a dataframe. I would check out the IRanges vignette — this is very helpful.
> R, more than any other language I know, is an implementation-defined language.
This is the most severe problem for sure. I'm cross my fingers someone forks R's implementation and does it right (which should break a lot of packages and code).
vsbuffalo | 11 years ago | on: Why cans of soup are shaped the way they are
vsbuffalo | 11 years ago | on: Best of Vim Tips
I like this article, because I think for production graphics, the author has a point. If you're placing lines, points, and labels on a screen — you can create anything. You can draw polygons and arcs. It's like drawing with raw SVG. But I'd have a hard time thinking of an exploratory data analysis situations I wouldn't reach for ggplot2 first. Since it looks at dataframe column types (integers, factors, numerics), it automatically matches these two the appropriate type of color gradient. Coloring a scatter plot by a potential confounder is one additional argument to aes(), e.g. aes(x, y, color=other_col). More than once during EDA I've done this and seen some horrifying pattern in data that shouldn't be there. That's a powerful tool for one extra function argument — the cost of checking for a confounder with color (or shape) is essentially near zero.
I'd make the case that this is a more costly operation in base graphics, and is thus much less likely to be done. You may already have your plots in a for loop to create panels, then you have a few extra lines for adjusting margins and axes (rather than facet_wrap(~col)). It took a lot of code to set that up — there's already a lot of cruft when you just need to do a quick inspection. Then you need to create a vector of appropriate size of colors, and then map this to data. Sure it's easy-ish, but it takes at least double the time as color=some_col. In EDA visualization, I want every single barrier to checking a confounder to be as small as possible—which is what ggplot2 does.
That said, I really liked this article because I do agree that going from EDA visualization to production is a hassle. Just after reading this, I remade some production ggplots with base graphics and love the simple aesthetic — which to mirror in ggplot takes a lot of hassle.
What I really long for is a lower-level data to visualization mapping (like d3) in R. d3 is a pain to learn, but it's really the only data abstraction (even though it is a low-level abstraction) that is seemingly limitless in what it does and can do. I always hope for a general data-join grammar like d3's to be the norm, built on top of base plotting (analogously: svg elements), and then have abstractions like ggplot for tabular data built on top of that.