No, shut up. What statistical programming languages can learn from Dropbox.

[+] scott_s|15 years ago|reply

The code that makes him say "what a mess," I think is beautiful:

  def summary(data, key=itemgetter(0), value=itemgetter(1)):
    for k, group in groupby(data, key):
      yield (k, sum(value(row) for row in group))

Perhaps that's because I'm a programmer, and Python is a general purpose programming language. But I think that's what his complaint boils down to: the Python statistical code looks too much like Python. Which, yeah, it does. Python is a general purpose programming language, not a domain specific language for statistical programming.

However, I don't think the programming concepts one needs to understand to make effective use of a well designed Python library are too much to ask. I've only dabbled in R, but when I did, it required me to exercise my general programming knowledge to understand list, matrices and functions. I think the author is also falling in the trap of what is obvious to him is obvious to everyone. I'm actually not sure of what the SAS code is doing, and much prefer the Python.

[+] kenjackson|15 years ago|reply

Perhaps that's because I'm a programmer, and Python is a general purpose programming language.

Exactly. You shouldn't have to be a programmer to do statistics. Just like you shouldn't have to be a network engineer to share files. What if DropBox had stuff in there about http, ports, levels of service, bandwidth, etc... You'd probably say, "Great! I always wnated to specify that DropBox use SSL4.7 draft B over CDMA EvoX.1 -- who wouldn't?"

When you're doing a DSL make it is as simple as possible. And if you have time, in v2, give it hooks to just break out and do crazy stuff... but the 90% case should be simple as pi.

[+] johnwatson11218|15 years ago|reply

I also prefer this python code to the SAS example listed. I have been trying to brush up on statistics over the last couple of years and I think this article points to an issue that occurred to me. Namely that when somebody says they "know statistics" it sort of has to mean that they know one of the big stats packages out there. It doesn't appear that anybody is really doing stats from first principles anymore. It seems like there are differences in terminology between one author and another and now with the different programing models there is a whole new level of incompatibility.

[+] jamesbkel|15 years ago|reply

>Python statistical code looks too much like Python. Which, yeah, it does. Python is a general purpose programming language, not a domain specific language for statistical programming.

I have to agree (with your criticism). I spend most of my day in SAS and R, and my Python is limited to tweaking code from my colleagues, but I don't see how either the SAS or Python listed is better or worse than the other.

I actually like the quote in the article reg. DropBox's simplicity, but I don't get the relationship to statistical programming languages.

[+] wesm|15 years ago|reply

Picking on Python for not having simpler built-in ways to do domain-specific statistical operations seems rather silly to me.

I've been involved the last few years with creating better data structures and tools for doing statistics in Python-- with excellent results (http://pandas.sourceforge.net and http://statsmodels.sourceforge.net). So I think the author should take a closer look at some of the libraries and tools out there.

[+] chc|15 years ago|reply

I think his point here is that the most visible aspects of the code are the structures built up to do the computation, rather than the computation itself. As a description of a generator loop, it reads quite nicely. But the language does not give much ground to the topic it's describing, in the way (to use the obvious example) Lisp would. I think that's what he is getting at.

[+] peterb|15 years ago|reply

Yes, I like the Python also, but you have missed the point. For MBA-types, business types, and scientists the programming concepts are too much to learn. Why should they have to learn programming when their needs are simple? It is not just "keep it simple", it is "keep it simple" for non-programmers.

[+] unknown|15 years ago|reply

[deleted]

[+] jhamburger|15 years ago|reply

To be honest I find the whole repeated "no, shut up" thing to be a bit crass and it makes me unsympathetic if anything. I hope this doesn't become a catchphrase in blogs.

[+] apu|15 years ago|reply

I thought it was perfect in the original dropbox quora post, but I agree that it doesn't quite fit here, and there is certainly a danger of it becoming a meme.

[+] dkarl|15 years ago|reply

If 90% of usage boils down to a small number of rigid patterns, then there is a simple solution: a handful of convenience functions. Often these functions are missing, because the demand for convenience functions is obscured by the fact that every experienced user defined them for himself years ago. That forces newbies to suffer through the unnecessary task of understanding the fully generalized API before they can accomplish simple tasks.

Languages that have good support for optional arguments, such as Python and Lisp, also make it possible to create APIs that are elegant and concise for experts but extremely intimidating for beginners. It may be more elegant to have a single function with a slew of optional arguments, and an experienced user may be able to accomplish any task quite concisely by specifying a few arguments, but a beginner would be better served by a handful of specific functions with specific names. API writers should consider providing those functions as simple wrappers to the general API, in order to provide a simpler learning curve for users who might never need more complex functionality. Examining how those wrapper functions are implemented can help intermediate users figure out the general API, too.

[+] hyperbovine|15 years ago|reply

Exactly. It's trivial to write a prettied-up interface to those Python functions that would make as much sense (?) as PROC MEANS. Good luck trying extend SAS to do anything the designers didn't implement as a procedure, though. Having had to navigate through a complex SAS macro or two in my day, I can assure you that it the single worst experience I have ever had in 20 years of programming.

[+] btilly|15 years ago|reply

Actually in my experience business users want one thing on that list, and if they have that thing they don't care about whether you provide the other two. They won't ask for what they want because they don't know that they can get it. But they will be happy if they get it.

They want to get at nicely organized data easily from inside of Excel. Excel is a toolbox that they already know from which they can do their own pivot tables and graphs. And they'd prefer to do that because then they can just do it instead of less efficiently having someone else do it for them.

They want it to arrive nicely aggregated and organized, since Excel is not very good at that. But they are more than happy to do the pretty reports themselves. Just get them the data.

See http://bentilly.blogspot.com/2009/12/design-of-reporting-sys... for a more detailed description of one set of experiences that taught me that.

[+] tom_b|15 years ago|reply

What Dropbox does is eliminate the programmer from the equation. You don't need your IT staff to do any special setup or to follow a special process.

I think the truth is business users don't want to work with you - they just want their data relationships to be discovered in a simple and intuitive way. If your data crawling is sufficiently good, maybe you can do that.

I understand the point the article is making, but I feel rather strongly that "(y)ou’ll want a third thing – to read in and parse data" translates to the person that builds a tool that does that nicely and automatically creates pivot tables, graphs, and other dashboard-y things will probably have to hire a team to shovel the money off so he/she can breathe.

I suppose it's an ok example to talk about this re: statistical programming languages, but my own experience in the three requirements that preface the whole discussion (pivot tables, graphs, data parsing) are a big example of something just screaming for a new solution, not a new prog language . . .

[+] skept|15 years ago|reply

There's a nice NumPy-based Python package called Tabular (http://www.parsemydata.com/tabular/index.html) that makes this super easy:

  import numpy as np
  import tabular
  
  # CSV with Region, City and Sales columns
  data = tabular.tabarray(SVfile = 'data.csv')
  
  # Calculate the total sales within each region
  summary = data.aggregate(On = ['Region'], AggFuncDict = {'Sales':np.sum}, AggFunc = len)
  summary.saveSV('summary.csv')

[+] cawhitworth|15 years ago|reply

Last I checked, though, Python wasn't a statistical programming language.

[+] elmarks|15 years ago|reply

I can't speak for the creators of R, but I have a strong suspicion that it wasn't intended for erehweb or MBAs. It's not Microsoft Excel, its R. It's used by PhD researchers in Mathematics, Statistics, Economics and Political Science.

[+] Duff|15 years ago|reply

Maybe I'm just a simpleton, but it seems odd to attack something designed to make statistics analysis easy for statisticians, because it doesn't meet the needs of mid-level managers.

Managers and other folks who need to make pivot tables, graphs and related things without programming have a great tool to do that: Excel. For these people, Excel is Dropbox.

[+] raymondh|15 years ago|reply

> People don’t use that crap. > But they do want pivot tables, ...

That's what the cookbook recipe provides, a function called summary() that makes a pivot table. Problem solved :-)

> I should be clear that my complaint is with > Python rather than the code as such.

There are plenty of ways to write the summary() function with plain, straight-forward Python code that doesn't use generators, itertools, or any other advanced feature.

So, why does the recipe author use itertools? It is because they provide a way to get C speed without having to write extension modules. Had the author used map() instead of a generator expression, the inner-loop would run entirely at C speed (with no trips around the Python eval-loop):

  for pivot_value, row in groupby(data, key):
      yield k, sum(map(value, group))

I think it's wonderful that a two-line helper function is all it takes to implement pivot tables efficiently.

[+] freyrs3|15 years ago|reply

> What about tuples, lambda functions, generator objects?” No, shut up. People don’t use that crap.

I use those every day.

[+] patio11|15 years ago|reply

Despite "No, shut up." being in the title, it does not add anything to the submission, and should be a strong hint that this is not HN material.

[+] URSpider94|15 years ago|reply

There are quite a few really nice tools out there for this kind of data analysis (generally called Business Intelligence, or BI for short). None of them that I have found are procedural, they are all based on interactive dashboards. I'm sure that there is some scripting or XML formatting required behind the scenes to get the system set up to accept data, but after that it's all point-and-click.

The systems that I've seen/evaluated are Needlebase, Birst and Spotfire. None of them are particularly cheap, but if you're in a business where real-time access to data would help your team make better decisions, they could be very valuable.

[+] vinyl|15 years ago|reply

For the record, the Lisp dialect that is mentioned in the comments is http://lush.sourceforge.net/ with respect to the discussion, it is on the R/Python side, ie. powerful general-puspose (Lisp) language, with builtin Statistics facilities.

[+] btcoal|15 years ago|reply

Anytime I hear a discussion about designing computational tools for "non-programmers" I'm reminded of the subway in Mexico City, Mexico. The subway stops have nicely detailed pictures that are descriptive of the locations around the stops. This is because many people are illiterate. It's about time people realized that programming is literacy.

Also, leave Python alone!

[+] momotomo|15 years ago|reply

Kirix Strata looks like the dropbox equivalent for this space (http://www.kirix.com/). All it does is suck in data -> Create relationships -> pivot, graph and report. The people I know that use it swear by it because it fills one small gap instead of many.

[+] sapuser|15 years ago|reply

R syntax is not too bad

  # group by Species, can be multivalued see ?by
  # sum(Sepal.Length, Sepal.Width)
  # mean(Petal.Length, Petal.Width)
  by(data = iris, INDICES = iris$Species, FUN = function(x) {y <- colSums(x[,c(1:2)]); z <- mean(x[,c(3:4)]); result <- list(y,z); result})

[+] huyegn|15 years ago|reply

I've been using http://tablib.org/ for awhile now to read in tabular data. With this summary fn and a few other functions to simplify the process of aggregating data into useful views I think you've got a winning solution to the author's complaint.

[+] unknown|15 years ago|reply

[deleted]

[+] molecule|15 years ago|reply

funny, Dropbox makes it happen w/ Python:

"Dropbox uses Python on the client-side and server side as well. This talk will give an overview of the first two years of Dropbox, the team formation, our early guiding principles and philosophies, what worked for us and what we learned while building the company and engineering infrastructure. It will also cover why Python was essential to the success of the project and the rough edges we had to overcome to make it our long term programming environment and runtime."

http://us.pycon.org/2011/blog/2011/02/07/pycon-2011-announci...

53 comments