The article says this particular instance of the debate started in 2011. Things have shifted a little since then, and I think Python has won more mindshare with Pandas, SciPy, NumPy, and all the rest. I've used both Python and R, and think the next debate will be between those two, as people find that R is not a very good programming language and lacks decent libraries for things like web scraping.
Python can be a single tool that integrates with every part of your workflow. R right now still wins in the number of algorithms implemented in it (there are statistical methods not available in R but not Python), and R has more terse syntax which some people like for interactive use. But for really Big Data, terse syntax and an endless variety of esoteric algorithms are not as important as, say, robust error handling and debugging (a weak area in R, but a strong one in Python).
What are you talking about? R has some of
the most robust debugging tools I have encountered. You should look into the functions: browser, traceback, recover, trace, dump.frames, debug, debugonce, `environment<-`. What other language allows you to insert breakpoints (or arbitrary code) in the middle of a function's body without even having to re-paste the source code into the console?
If you're unsatisfied with R's debugging tools, you are probably not familiar with them:
Disagree. I've had an R server running in production for real-time analytics under heavy load in the last six months, and I have had exactly zero problems. Literally no down time.
> The article says this particular instance of the debate started in 2011. Things have shifted a little since then, and I think Python has won more mindshare with Pandas, SciPy, NumPy, and all the rest.
In my field this isn't at all true - R is eroding SAS's market share, but Python is a non-entity.
I use of both Python (pandas) and base SAS at work for UK government.
I have lots of experience in SAS, and enjoy using it. The macro language allows for very succinct solutions to difficult data manipulation problems.
However, given SAS's huge expense it's difficult for me to identify any 'killer' areas where it's significantly better than open source tools. Indeed, I find pandas faster and easier to use for many problems.
I find it hugely frustrating that the government pays so much money for SAS licences and training when most people use it for simple use cases, where they would be better picking up transferable skills (e.g. Python, SQL, R).
My understanding is that that SAS supposed to be good at processing very large datasets because it uses RAM efficiently (only the PDV is stored in RAM). But in reality, a small minority of users are processing datasets that are too big for RAM (e.g. 16gb+) and there are probably better tools for the job in this use case.
One user here comments that SAS is like an 'improved Excel'. In fact, I find pandas much closer to Excel than SAS because (in ipython notebook at least), you get nice visual representations of your tables, and it usually isn't difficult to translate an Excel operation into a pandas one. I especially like the multi-index and pivot table based capabilities. With a background in VBA for Excel, it's also relatively easy to pick up Python.
None of this is quite so obvious in SAS, which has quite an unusual data step and macro programming language. It's very powerful, but is quite unintuitive to begin with due to a complete reliance on the program data vector.
I regularly produce data sets that are too big for RAM, and not having to worry about that in SAS was a luxury.
I'd say SAS's two big "Killer Features" are the DATA step and SAS Press. I still have yet to find R or Python nearly as pleasant to work with for manipulating the data set itself when compared to SAS, and the SAS Press is excellent at putting out books detailing a given type of analysis, and how to implement it in SAS. I still turn to them for basic references even when not using SAS.
My company uses R, Shiny, and Rserve for nearly everything. R is a great programing language - if you need to quickly and efficiently develop stat's based features for medium sized data.
R excels (get it?) at creating reproducible, fault tolerant, consistent functions that can be automated, packaged, applied to a variety of data types, and then extended later.
Our web-stack is Shiny on AWS and we call our API's built in R (ML, images, data, etc) from Android using Rserve.
A lot of the (programing?) criticisms of R will be 'solved' or become non-issues in the next few years. Multithreading, implicit vectorization, better memory handling, gpu functions, among other things are all in the pipe :)
(That said, the syntax _is_ a little weird to get use to)
-----
* We're hiring for very senior positions in data-science and more general R programers. Contact me if you're interested (JasonCEC [at] Gastrograph.com)
The problem with R is that it's just not a very good programming language. It's great for interactive analysis, but dismal for building higher level abstractions. It's like the PHP or MySQL of the data analysis world. Data types get magically converted all over the place, the global namespace is just a giant playground for every module to pollute, it has something like 5 different object systems all with subtle differences. All the defaults that are set for the convenience of interactive use undermine any kind of reliable use for building on as a platform (for example, the "simplification" concept where a 1 column data frame often magically turns into a vector).
I've forced myself to use R intensively for a couple of years now, but I must say it's still a relief every time I bail out and get back to a "real" programming language.
"""In the article, Ms. Milley said, “I think it addresses a niche market for high-end data analysts that want free, readily available code. We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
To her credit, Ms. Milley addressed some of the critical comments head-on in a subsequent blog post."""
(Boeing uses R heavily and when you fly on their aircraft, you're flying on open source)
Since SAS is a relatively simple language, why can't someone just write a transcompiler that supports a subset of SAS and move it to R? That way you have the best of both worlds (sort of).
The most difficult thing about that is how you would treat "by" statements (SAS) vs the split-apply-combine (R).
Self-plug: I sort of made a quick hack about a month ago for SAS-Python, I'm sure someone with more programming experience than me
could produce something much better (I come from a maths background).
SAS isn't a language. It's a suite of (semi-integrated) products featuring numerous languages. Key among these are the SAS DATA step (analogous to awk), the SAS Macro language (which shares terms but is in fact distinct), a number of other global languages, a number of domain-specific languages (TABULATE, LOGISTIC, GRAPH, etc.), and some proprietary implementations of general standards (SAS SQL).
The best model for mapping from SAS to R is not to try to support all of SAS's functionality within R, but to use multiple tools. When UNIX was first created, with the awk and S languages ('R' is iterated 'S', Splus is the proprietary extension of S), awk was seen as the data pre-processor for S. Today you'd likely use R (statistical analyses, graphics, matrix language), awk, Perl, Python, Ruby, or C (general programming / data manipulation), a database tool such as sqlite or Postgresql (both of which have their own considerable analysis capabilities), and other tools as appropriate.
Trying to do everything in a single tool is a domain/application mismatch.
BAE Systems has a product called NetReveal that includes a compiler called "DataServer" which compiles a large subset of SAS into Java. Legend has it that the original author wrote the first version of DataServer in 6 hours on a train ride to visit his mother.
As a programmer, I was constantly frustrated by it. I felt that SAS as a language was pretty restrictive, especially when your algorithm wasn't a natural fit for the dataset model that it uses. It was like trying to shove a square peg into a round hole. It wasn't uncommon for me to sneak some Java straight into the output, but that wasn't really a sustainable / maintainable way to use it.
So your suggestion is doable, but I'm not convinced it's worthwhile.
I used to work for one of the largest U.S insurance companies. They were always behind in transitioning to new technology (Excel 2003 could be found there in 2013). That being said, the entire staistical modeling team and research department made the switch to R and Python. Only a few clung to SAS but realized they would be forced to move to R as any collabration would need to be converted to R and not to SAS.
I believe it will be R verus Python future and SAS will not be a part of it.
I work in big insurance, too. How did you handle porting your legacy code into (R|Python)?
We had tens of thousands of lines of SAS code from over the course of 10 years (macros, clinical programs, reporting functionality) of SAS programming that the higher-ups never saw the benefit of switching to Python or R to be feasible, especially with the Biostats team working on existing projects.
Stata is seen as a less powerful, more usable version and cheaper version of SAS; it's also comes off as less flexible than R when it comes to more complex queries. Companies with large teams can afford what statistician would see as a better tool; R is seen as the favored tool for the lone analyst with extensive background, or the hard-science academic. Stata on the other hand is mainly favored by social science practicionners and some more statistics-inclined marketing people, neither dwell Hacker News too much.
Stata is great as an excel replacement and has a decent library of pre-programmed statistical models. It's GUI based and targets applied researchers in economics etc. Unfortunately, extending any functions and defining your own models is quite a hassle unless you're familiar with mata (quirky matlab-esque syntax) and R's ecosystem is much more cutting edge.
Stata is great for drag and drop data manipulation from what I remember, an excellent excel replacement. Nothing beats dragging and dropping in variables and typing: reg y x1 x2 for immediate regression results.
Is this even a discussion? Anyone serious about analyzing data will use either R, Python (with Pandas/SciPy, etc), or Julia. For truly immense data sets that require pipelines, you'll use tools like spark, hadoop, etc - but SAS is basically a slightly improved excel.
I'm an R person, but I used SAS in a past life, and I have to say, this is very wrong. SAS is nothing at all like an improved Excel in look, operation, or user-base. It can do much of what R can do (build forecast models, ML models, even OR models), but it just looks very different.
And the licensing model will make you pull your hair out. I ran into issues both with geography (not being allowed to use my license on a project in another country) and functionality (hey, that's a cool PROC...wait, I don't have the license to call it)
I think R and python will win the day, but it's not because SAS is anything like Excel. And there are a shit ton of "serious" data analysis people using SAS. They're just all in the enterprise. Every Fortune 500 company I've worked with used SAS except for one (who used R).
SAS is a pretty big system, and I have worked with SAS for about 10 years, and I can't see any similarity with Excel at all. Which part of the SAS system do you think resembles Excel ? Here is a list of their products (http://support.sas.com/documentation/productaz/index.html)
Btw. I do think I am doing serious data analysis in a bank :-)
Being a slightly improved Excel has significant advantages as well:
It's more accessible to people who are not coders but need to do statistics on real data, i.e. almost all researchers.
Although personally, I would use R/Python (the UI is more suited to me, and it's free, and I trust the results a little more, and I can read the code if I need to)... do you actually need anything more than SAS/SPSS provides? I doubt it.
[+] [-] jzwinck|12 years ago|reply
Python can be a single tool that integrates with every part of your workflow. R right now still wins in the number of algorithms implemented in it (there are statistical methods not available in R but not Python), and R has more terse syntax which some people like for interactive use. But for really Big Data, terse syntax and an endless variety of esoteric algorithms are not as important as, say, robust error handling and debugging (a weak area in R, but a strong one in Python).
[+] [-] robertk|12 years ago|reply
If you're unsatisfied with R's debugging tools, you are probably not familiar with them:
http://adv-r.had.co.nz/Exceptions-Debugging.html
[+] [-] robertk|12 years ago|reply
Additionally, weak scraping support is a myth: http://www.theswarmlab.com/r-vs-python-round-2-22/
[+] [-] Fomite|12 years ago|reply
In my field this isn't at all true - R is eroding SAS's market share, but Python is a non-entity.
[+] [-] krick|12 years ago|reply
Can you please list some important tools implemented in R but not in Python? I'd like to know how bit the gap is.
[+] [-] RobinL|12 years ago|reply
I have lots of experience in SAS, and enjoy using it. The macro language allows for very succinct solutions to difficult data manipulation problems.
However, given SAS's huge expense it's difficult for me to identify any 'killer' areas where it's significantly better than open source tools. Indeed, I find pandas faster and easier to use for many problems.
I find it hugely frustrating that the government pays so much money for SAS licences and training when most people use it for simple use cases, where they would be better picking up transferable skills (e.g. Python, SQL, R).
My understanding is that that SAS supposed to be good at processing very large datasets because it uses RAM efficiently (only the PDV is stored in RAM). But in reality, a small minority of users are processing datasets that are too big for RAM (e.g. 16gb+) and there are probably better tools for the job in this use case.
One user here comments that SAS is like an 'improved Excel'. In fact, I find pandas much closer to Excel than SAS because (in ipython notebook at least), you get nice visual representations of your tables, and it usually isn't difficult to translate an Excel operation into a pandas one. I especially like the multi-index and pivot table based capabilities. With a background in VBA for Excel, it's also relatively easy to pick up Python.
None of this is quite so obvious in SAS, which has quite an unusual data step and macro programming language. It's very powerful, but is quite unintuitive to begin with due to a complete reliance on the program data vector.
[+] [-] Fomite|12 years ago|reply
I'd say SAS's two big "Killer Features" are the DATA step and SAS Press. I still have yet to find R or Python nearly as pleasant to work with for manipulating the data set itself when compared to SAS, and the SAS Press is excellent at putting out books detailing a given type of analysis, and how to implement it in SAS. I still turn to them for basic references even when not using SAS.
[+] [-] JasonCEC|12 years ago|reply
R excels (get it?) at creating reproducible, fault tolerant, consistent functions that can be automated, packaged, applied to a variety of data types, and then extended later.
Our web-stack is Shiny on AWS and we call our API's built in R (ML, images, data, etc) from Android using Rserve.
A lot of the (programing?) criticisms of R will be 'solved' or become non-issues in the next few years. Multithreading, implicit vectorization, better memory handling, gpu functions, among other things are all in the pipe :) (That said, the syntax _is_ a little weird to get use to)
-----
* We're hiring for very senior positions in data-science and more general R programers. Contact me if you're interested (JasonCEC [at] Gastrograph.com)
[edited for spelling]
[+] [-] bsg75|12 years ago|reply
[+] [-] zmmmmm|12 years ago|reply
I've forced myself to use R intensively for a couple of years now, but I must say it's still a relief every time I bail out and get back to a "real" programming language.
[+] [-] dekhn|12 years ago|reply
"""In the article, Ms. Milley said, “I think it addresses a niche market for high-end data analysts that want free, readily available code. We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
To her credit, Ms. Milley addressed some of the critical comments head-on in a subsequent blog post."""
(Boeing uses R heavily and when you fly on their aircraft, you're flying on open source)
[+] [-] opensandwich|12 years ago|reply
The most difficult thing about that is how you would treat "by" statements (SAS) vs the split-apply-combine (R).
Self-plug: I sort of made a quick hack about a month ago for SAS-Python, I'm sure someone with more programming experience than me could produce something much better (I come from a maths background).
http://nbviewer.ipython.org/gist/chappers/8747253/stan_examp... https://github.com/chappers/Stan
[+] [-] dredmorbius|12 years ago|reply
SAS isn't a language. It's a suite of (semi-integrated) products featuring numerous languages. Key among these are the SAS DATA step (analogous to awk), the SAS Macro language (which shares terms but is in fact distinct), a number of other global languages, a number of domain-specific languages (TABULATE, LOGISTIC, GRAPH, etc.), and some proprietary implementations of general standards (SAS SQL).
The best model for mapping from SAS to R is not to try to support all of SAS's functionality within R, but to use multiple tools. When UNIX was first created, with the awk and S languages ('R' is iterated 'S', Splus is the proprietary extension of S), awk was seen as the data pre-processor for S. Today you'd likely use R (statistical analyses, graphics, matrix language), awk, Perl, Python, Ruby, or C (general programming / data manipulation), a database tool such as sqlite or Postgresql (both of which have their own considerable analysis capabilities), and other tools as appropriate.
Trying to do everything in a single tool is a domain/application mismatch.
[+] [-] groovy2shoes|12 years ago|reply
As a programmer, I was constantly frustrated by it. I felt that SAS as a language was pretty restrictive, especially when your algorithm wasn't a natural fit for the dataset model that it uses. It was like trying to shove a square peg into a round hole. It wasn't uncommon for me to sneak some Java straight into the output, but that wasn't really a sustainable / maintainable way to use it.
So your suggestion is doable, but I'm not convinced it's worthwhile.
[+] [-] pistolpete20|12 years ago|reply
I believe it will be R verus Python future and SAS will not be a part of it.
[+] [-] christopheraden|12 years ago|reply
We had tens of thousands of lines of SAS code from over the course of 10 years (macros, clinical programs, reporting functionality) of SAS programming that the higher-ups never saw the benefit of switching to Python or R to be feasible, especially with the Biostats team working on existing projects.
[+] [-] mbq|12 years ago|reply
[+] [-] stcredzero|12 years ago|reply
[+] [-] bertil|12 years ago|reply
[+] [-] vbs_redlof|12 years ago|reply
Stata is great for drag and drop data manipulation from what I remember, an excellent excel replacement. Nothing beats dragging and dropping in variables and typing: reg y x1 x2 for immediate regression results.
[+] [-] ropz|12 years ago|reply
http://www.teamwpc.co.uk/
produce a compiler, tools etc that run the language of SAS.
(disclaimer: I interviewed there last year)
[+] [-] reeses|12 years ago|reply
[+] [-] Fede_V|12 years ago|reply
[+] [-] mistermcgruff|12 years ago|reply
And the licensing model will make you pull your hair out. I ran into issues both with geography (not being allowed to use my license on a project in another country) and functionality (hey, that's a cool PROC...wait, I don't have the license to call it)
I think R and python will win the day, but it's not because SAS is anything like Excel. And there are a shit ton of "serious" data analysis people using SAS. They're just all in the enterprise. Every Fortune 500 company I've worked with used SAS except for one (who used R).
[+] [-] jensgk|12 years ago|reply
SAS is a pretty big system, and I have worked with SAS for about 10 years, and I can't see any similarity with Excel at all. Which part of the SAS system do you think resembles Excel ? Here is a list of their products (http://support.sas.com/documentation/productaz/index.html)
Btw. I do think I am doing serious data analysis in a bank :-)
[+] [-] ronaldx|12 years ago|reply
It's more accessible to people who are not coders but need to do statistics on real data, i.e. almost all researchers.
Although personally, I would use R/Python (the UI is more suited to me, and it's free, and I trust the results a little more, and I can read the code if I need to)... do you actually need anything more than SAS/SPSS provides? I doubt it.