Stan: a probabilistic programming language

[+] probdist|10 years ago|reply

Stan is best viewed in my mind as a successor to BUGS (Bayesian Inference Using Gibbs Sampling) which more people may have heard of. In fact, there are lots of players in the probabilistic programming space now, personally I like the model of "Infer.NET" [1] from Microsoft Research, as I find variational and approximate variational inference a good solution to my problems and I like coding models in nearly C# more than in Stan format.

If you want to get started as fast as possible with computational Bayesian inference and don't need the performance and advanced features Stan, I'd recommend emcee [2] which is a lightweight python MCMC sampler.

Final recommendation to the HN crowd: Bayesian Data Analysis 3rd Edition [3] should be on your desk if you are thinking about using any of this software.

[1] http://research.microsoft.com/en-us/um/cambridge/projects/in...]

[2] http://dan.iel.fm/emcee/current/

[3] https://www.crcpress.com/Bayesian-Data-Analysis-Third-Editio...

[+] dustintran|10 years ago|reply

Hi, stan dev here. I think viewing Stan as a better BUGS is helpful but limiting.

The syntax is similar, but the class of models Stan fits is far more general. The class of algorithms we have available also goes beyond MCMC, e.g., variational inference, optimization, and interfaces to Stan exist on all primary programming languages. It's more helpful to think of Stan as its own probabilistic programming language, and arguably the biggest entity with the largest user base.

[+] kristjankalm|10 years ago|reply

some useful books available as a PDF:

David Barber "Bayesian Reasoning and Machine Learning" http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=... -- Thorough and systematic, accompanied by a Matlab library

Marc Steyvers "Computational Statistics with Matlab" http://psiexp.ss.uci.edu/research/teachingP205C/205C.pdf -- Somewhat unfinished but very good hands-on guide for a total novice

David Draper "Bayesian Modeling, Inference and Prediction" https://users.soe.ucsc.edu/~draper/draper-BMIP-dec2005.pdf -- Quite good if you're already familiar with a field, might be slightly daunting for a beginner

and the classic, David MacKay's "Information Theory, Inference, and Learning Algorithms" http://www.inference.phy.cam.ac.uk/itprnn/book.html

[+] dewarrn1|10 years ago|reply

Another endorsement for Gelman's books — they're some of the best instructional statistics work that I've read.

[+] chmullig|10 years ago|reply

Stan is very much a successor to BUGS, although I'm not sure BUGS is better known outside certain social science research communities these days.

[+] tel|10 years ago|reply

+1 on all of this (except I'm a fan of full MCMC, have you used ADVI?)

[+] dj-wonk|10 years ago|reply

One of my first questions was "What are the motivations for a language over a library?" After reading more, I see that Stan has both. There are integrations with R, Python, Julia, C++, Stata, the command line, and more.

To ask an intentionally over-simplified question: if the key value add of Stan in the functionality and approach -- why bother with the overhead of an external DSL? Why not use an internal DSL -- e.g. borrow data structures and/or syntax of another language?

Arguments for an external DSL are, hopefully: cross-language portability and domain-clarity for people.

But how does this play out if you want to support multiple languages and interop?

Let's say I'm using Python, for example, and want to build up a model programmatically. The example at https://github.com/stan-dev/pystan shows using a heredoc. That's simple and easy to start.

However, a big downside of a plain-text DSL is that it can be hard to manipulate programatically. To do that, each language implementation has to be 'business' of converting to and from text. This reminds me of the 'SQL' problem ... many language SQL wrappers spend considerable time munging text. Ug.

Given my bias towards data-oriented programming, I see another approach: why not use a rich data structure language (other than plain text) as the foundation?

Why not edn, for example? If the LISPy nature of that is not preferred, then why not something else? (Does the C-community have some kind of generic data description language? I wouldn't recommend JSON necessarily but that would be better than a hand-rolled DSL, in my opinion.)

Here's why I recommend starting with 'rich' data as opposed to parsing text...

I'm not saying human DSLs are bad. I'm just saying they don't have to be the foundation. If you want a human external DSL, great -- but why not convert the human DSL to a canonical data structure? Then build everything around the data, rather than build everything around the text format?

[+] chmullig|10 years ago|reply

In my mind, there's two points to Stan. The first is the actual math -- it can fit lots of models, much faster and more flexibly than other alternatives.

The second is to go out and meet the practitioners where they are. That means making Stan models look like the equations in your paper. It means connecting to existing research software ecosystems. It means making it easy to adapt. In this case, if you end up tightly coupled to, say, specific R data structure then it becomes much, much harder to make it also work well for Python, and for Stata, etc. Additionally, it means that if I start on a model in Python then I cannot easily switch to using R (or vice-versa).

Using a DSL also has the nice property of cleanly separating Stan world vs R world. That makes it easier for users to build an accurate mental model for what happens when/where, which hopefully then leads to better performance.

Most importantly though, Stan code is incredibly concise and clear for specifying models.

[+] bengHN|10 years ago|reply

As you mentioned, there are pros and cons to inventing the Stan DSL. Originally, we wanted a language that was not too different from the BUGS language because the BUGS family was what most applied Bayesians were using (if they weren't using some specialized MCMC algorithm for their problem).

And programmatic manipulation (while not impossible) of a Stan program was not high on our priority list. Since the underlying language is C++, you can bypass the Stan language and write (or generate) the C++ directly with a lot of effort. It uses standard (including Eigen) data structures.

[+] chmullig|10 years ago|reply

Stan is a fantastic language/library/tool. (Disclosure, I've taken 2 classes with Andrew Gelman[1].

For those who haven't used it, the typical usage is to use Stan in combination with other languages (most commonly R). In the Stan language you define a model, specifying the data you'll get, potentially transformations to that data, then a set of parameters you want to fit, then finally a model saying how those parameters interact with each other and the data. You can also define priors for parameters. Then typically you save that model, and using R pass in the variables to a Stan call. The resulting fit object is returned to your original environment automatically.

It actually fits the work style reasonably well. All data munging happens as before, but instead of having some complicated model expression in, say, a glm() call, you have it in Stan language.

If you're interested in this, Gelman has two great books. Gelman & Hill's Hierarchical Models[2] which is applied and geared towards social science researchers, and Bayesian Data Analysis[3].

[1] http://andrewgelman.com/

[2] http://www.stat.columbia.edu/~gelman/arm/

[3] http://www.stat.columbia.edu/~gelman/book/

[+] bengHN|10 years ago|reply

Also, the rstanarm[1] R package (disclaimer: that I co-wrote) will be released this month, which does not require the user to write any code in the Stan language. Instead, you specify the likelihood of the data (for a few popular regression-type models) using conventional R syntax and utilize Stan's algorithms and optional priors on the parameters to draw from the posterior distribution. In the demos/ directory of [1], we have replicated most of the first half of Gelman & Hill's textbook and are starting on the second half, which heavily utilizes our stan_glmer() function that is compatible with the syntax of the glmer() function in the lme4 R package.

[1] https://github.com/stan-dev/rstanarm/

[+] mailshanx|10 years ago|reply

Here is a well deserved plug for Cam Davidson Pilon's excellent book called Bayesian Methods for Hackers: https://github.com/CamDavidsonPilon/Probabilistic-Programmin...

[+] tel|10 years ago|reply

Stan is absolutely a world class general Bayesian sampling tool. It replaces things like BUGS and uses fantastic algorithms for fast convergence. One of the major people behind it, Andrew Gelman, is a leader in applied hierarchical modeling and uses it to great effect in social sciences.

If you have even vague interest in applied statistics in fields with lesser ability to design experimental controls then you should investigate this tool.

[+] ericnovik|10 years ago|reply

I found this recent paper showing how to fit and diagnose a simple non-linear model in Stan helpful. It is written from a statistician's perspective. http://www.stat.columbia.edu/~gelman/research/published/stan...

This one has more of a CS feel to it: http://www.stat.columbia.edu/~gelman/research/published/stan...

Lastly, here is my own turbulent experience with getting started with Stan: http://ericnovik.github.io/2015/08/14/Getting-Started-with-S...

[+] tristanz|10 years ago|reply

The great thing about Stan is it is geared toward practitioners doing real work in science and social science, but still manages to push the boundary of stats research (NUTS, ADVI, penalized ML). Other probabilistic programming languages are more expressive, but are typically much harder to use for day-to-day work.

That being said, there's still a huge gulf between practitioners in science and social science that are building parametric Bayesian models and practitioners in the deep learning / ML community that are focused on building more general, scalable, machine learning algorithms for tasks like machine translation and question answering. It would be amazing if somebody could reconcile these communities by showing deep learning models can be expressed and fit effectively side-by-side with more parametric models.

[+] skimpycompiler|10 years ago|reply

Does anyone have the experience with speed of convergence for some advanced models? And, defining sequence models in these probabilistic programming languages?

Seems most (if not all) probabilistic programming languages use sampling as a general algorithm for training models.

So, I guess it would be a hard job to express something like HMMs or CRFs that would fit into the framework (I guess one can easily train HMMs/CRFs with a fixed length of a sequence, but not undefined length).

[+] nextos|10 years ago|reply

You can easily encode any generative sequence model in e.g. Church (Scheme-based), Anglican (Clojure-based) or PyMC (Python-based). These are Turing-complete, so they can model any sampleable (computable) distribution.

See http://forestdb.org/ for some examples. E.g. an infinite HMM.

I have no experience with Stan. There was some controversy on whether it is Turing-complete, but I cannot elaborate on that.

The obvious disadvantage of using a language that is too expressive for something as simple as an HMM is that you loose the nice performance guarantees given by all specialized algorithms such as Viterbi. Theoretically, they can achieve good efficiency by performing program transformations (e.g. with abstract interpretation). But in practice, we are still a bit far from that.

A nice trick is to implement your generative model in something very efficient (e.g. Probabilisitic C or something GPU-based). You can then forget the burden of having to encode your own sampling procedure, but at the same time inference might be tractable.

[+] chmullig|10 years ago|reply

In my experience, Stan performance is decent with many models, particularly if all relevant operations are vectorized and the data sets are "reasonable." However it's easy to accidentally walk off a cliff, and write a model that takes days to fit. Additionally the real time output is a little lackluster, so it's hard to know how you're doing until it finishes (I hear they're working on that for ShinyStan).

I haven't done any HMMs or CRFs with Stan, but don't see why you couldn't do them. Passing in the data likely requires some tricks with arrays and indexing, but it's totally doable. Probably unlikely that you'd beat standard, custom algorithms, but if your HMM was part of a larger model, it might make sense.

[+] dustintran|10 years ago|reply

Most all probabilistic programming languages in fact treat every model as equivalent to an HMM. So certainly inference on them can be done.

[+] proditus|10 years ago|reply

hi. i'm one of the Stan devs. (i work on variational inference: ADVI).

happy to answer any questions here.

[+] chmullig|10 years ago|reply

Yay Stan! So what's the deal with BBVI? I heard it's integrated in the newest Stan, but can I use it? What for?

[+] marmaduke|10 years ago|reply

hi, funny this came up on HN; I'm just trying to get a handle on fitting a dynamical systems model of neural activity propagation with Stan.

What in general are the disadvantages of variational inference vs full MCMC? Are there in general significant advantages besides the speed up?

[+] powera|10 years ago|reply

Can anyone explain to me briefly what Stan actually does that's different from what I could do in Python without it? I don't want to have to read the entire research paper to get a description in layman's terms.

[+] bengHN|10 years ago|reply

The number one thing in my opinion is that Stan's algorithm(s) for drawing from a posterior distribution produce samples that have much less dependence among adjacent draws than other simpler MCMC algorithms like Metropolis-Hastings or Gibbs samplers. Since it is often difficult to answer the question "How much dependence is too much in practice?", it is prudent to use the algorithm that yields the least dependence because the effective sample size (from the posterior distribution) per unit of wall time will usually be greater. PyMC3 has started to incorporate some of Stan's algorithms, although their implementations are not as far along.

[+] proditus|10 years ago|reply

the Stan manual [1] is like a textbook. while it's a bit long (and a fair bit longer than a research paper), i highly encourage that you take a look. it's very informative.

[1] http://mc-stan.org/documentation/

[+] svisser|10 years ago|reply

It might give the right answers.

45 comments