What’s Wrong with Bayes

[+] signalsmith|6 years ago|reply

For me, I really appreciate the Bayesian approach because it makes it very explicit that you pick a prior.

Perhaps my experience is limited, but every (supposedly non-Bayesian) model I've used in practice has been possible to re-express using Bayesian terms, priors and beliefs and so on. Then I get to look at the intitial assumptions (model/prior) and use suitable human hand-wavey judgement about whether they make sense.

Bayes is a good way to _update_ models, but if you lose sight of the fact that the bottom of your chain of deduction was a hand-wavey guess, you're in trouble.

[+] madhadron|6 years ago|reply

> it makes it very explicit that you pick a prior

But you don't, in general, pick a prior. You pick a procedure that has an expected loss under various conditions. It's one player game theory.

If you happen to have a prior, then you can use it to choose a unique procedure that has minimal expected risk for that prior given the loss function, but even so that may not be what you want. For example, you may want a minimax procedure, which may be quite different from the Bayes procedure.

[+] mikorym|6 years ago|reply

Are all priors an application of Bayes's theorem?

It is confusing to me that there is talk of Bayesian statistics vs. frequentist statistics when both are often used in conjunction. The classic example of a medical test with false positives and false negatives and the prior being incidence in the general population comes to mind. To me that is not just an example of Bayes, but a combination of frequentist statistics with Bayes's theorem.

I also seem to recall that Bayes's theorem appears in a standard first year probability and statistics course.

[+] eanzenberg|6 years ago|reply

Yeah, no thanks though. I don't want every rando adding "priors" that "feel" right to their analysis. Frequentist is straight forward. Both can (and are) abused to prove bias.

[+] abeppu|6 years ago|reply

I feel this post should be considered along with its sibling: https://statmodeling.stat.columbia.edu/2019/12/04/whats-wron...

I think reading either alone is prone to lead readers to a false understanding of Gelman's perspective.

[+] syrrim|6 years ago|reply

If the goal is to avoid bankruptcy, then the probability needs to be interpreted differently. If you bet the house every time, you're guaranteed to go bankrupt eventually. Suppose instead you bet half your money on an event of 50% probability. If you take 1:1 odds on this, then when you lose, your money is divided by 2, but when you win it is only multiplied by 1.5. Your money will tend to decrease over time. You need to pick odds 1:a such that 1+a/2=2 => a=2.

We recover our regular betting odds by betting a smaller portion of our money. If we bet a portion 1/d of our money on an event of probability 1/p, we needs odds 1:a such that 1+a/d=(d/(d-1))^(p-1). For large enough d we get a=p-1, as we would expect.

Assume again you're betting half your money each round, but take a probability of winning of 84%, as in the article. You should take that bet at 1:1.14 odds, much less than the recommended 1:5 odds.

[+] ikeboy|6 years ago|reply

This has nothing to do with interpreting probability, but with a utility function that's not linear in terms of wealth. With decreasing marginal returns to wealth, the same bet becomes less attractive at lower wealth levels.

Although this can't fully explain risk aversion, see https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.15.1.219

[+] ravar|6 years ago|reply

For the curious, look up the kelly criterion, it formalizes this thinking.

[+] jefft255|6 years ago|reply

In robotics, particularly in bayesian filtering (KFs and so on), I find the idea of a "prior" solid and I don't see any frequentist alternatives. Your prior is easy to understand: whatever your posterior for your state was at the previous timestep, updated using the actions you wanted your robot to accomplish. Inference is then refining this prior using the observation that the robot makes.

There's nothing hand-wavy about that; if you do bayesian statistics with bad priors of course you're going to get bad inference. I guess the author just warns about being careful about your assumption which is always good.

[+] skybrian|6 years ago|reply

I'm curious what happens when you reboot your robot. What's the first prior?

[+] Majromax|6 years ago|reply

> Example abridged: a draw from N(phi,1) for unknown phi is 1. Bayesian reasoning with a uniform prior gives an 86% posterior probability that phi > 0

I'm not sure I see the problem here? If it's counterintuitive, it's only because we treat N(0,1) as the normal distribution, so our true prior is that if we pick a distribution out of a hat we're more likely to have N(0,1) than anything else.

Suppose I truly know nothing but what is given in the quote. On the basis of symmetry, I'd have to conclude that P(phi<0) is the same as P(phi>2). If the blogger had phrased this as "86% posterior probability that phi < 2", I don't think it would be so surprising.

In fact, the blogger describes this draw as:

> after seeing an observation that is statistically indistinguishable from noise.

which to me presupposes a great deal of information about what 'noise' is supposed to look like.

[+] Akababa|6 years ago|reply

I don't know, this seems to be a really low-effort blog post. The given example is obviously contrived from the unreasonable improper (-\infty,\infty) prior and the low \sigma^2=1 likelihood. If it was really "pure noise" then you'd have \sigma^2=\infty which rightly gives you a flat posterior.

For sure Bayesian gives you more flexibility with your assumptions, so it's easier to shoot yourself in the foot. But when used correctly it can be more powerful, and often easier to interpret.

[+] contravariant|6 years ago|reply

Ironically the article that the example is from offers quite a nice rebuttal:

> None of these examples are meant to shoot down Bayes. Indeed, if posterior inferences don’t make sense, that’s another way of saying that we have external (prior) information that was not included in the model. (“Doesn’t make sense” implies some source of knowledge about which claims make sense and which don’t.) When things don’t make sense, it’s time to improve the model. Bayes is cool with that.

[+] roenxi|6 years ago|reply

There is a certain intellectual laziness in this perspective as might be expected from a short blog post - obviously Bayes' formula is theoretically sound because it is trivial to deduce and prove.

So we know that if the conclusion is not acceptable then either the method, the prior or the evidence is not acceptable. Evidence and method can be ruled out; so the prior was not reasonable.

Basically, he's saying that he doesn't believe the prior is flat. A reasonable thing to say too - as he says practically speaking if we suspect the distribution is probably random noise then the prior is we are probably looking at noise. So in practice the prior is heavily weighted towards 0. It isn't intellectually honest to use an uniformed prior unless you think the probability of a process being statistical noise is almost 0.

[+] 6gvONxR4sf7o|6 years ago|reply

>obviously Bayes' formula is theoretically sound because it is trivial to deduce and prove.

Quantum mechanics doesn't follow the usual probability rules, so you can't really say "obviously Bayes' formula is theoretically sound." It certainly seems like Bayes theorem should apply universally but apparently it doesn't. Or at least, the jury's still out.

https://en.wikipedia.org/wiki/Quantum_probability

[+] knzhou|6 years ago|reply

But this isn't actually a criticism of Bayes at all. Yes, the result depends on your prior. But the result always depends on your preconceptions -- even in frequentist statistics, where it determines which statistical tests you use and which hypotheses you test and what p-value cutoff is reasonable. It's better to have this up front.

Or, you can publish Bayesian update factors, which are prior-independent.

[+] j7ake|6 years ago|reply

The example should of course ring caution bells but at least in Bayes you can figure out why your inference is doing unreasonable things by examining each of your assumptions. In this case it’s the prior that needs fixing.

Are there alternative methods that are better than the Bayes method for this toy example?

[+] TTPrograms|6 years ago|reply

Seriously, as soon as he said "flat prior on theta" I had huge alarm bells go off. Garbage in garbage out.

[+] olooney|6 years ago|reply

Just for context, Andrew Gelman is one of the creators of Stan[1], one of the most popular probabilistic programming platforms for Bayesian interference. He has written a popular textbook on Bayesian methods, Bayesian Data Analysis[2].

Everyone hates picking priors in Bayesian analysis. If you pick an informative prior, you can always be criticized for it (in peer review, for a business decision, etc.) The usual dodge is to use a non-informative prior (like the Jeffreys prior[3].) I interpret Gelman's point as saying this can also lead to bad decisions. Thus, Bayesian analysts must thread the needle between Scyllia and Charybdis when picking priors. That's certainly a real pain point when using Bayesian methods.

However, it's pretty much the same pain point as choosing regularization parameters (or choosing not to use regularization) when doing frequentist statistics. For example, sklearn was recently criticized for turing on L2 regularization by default which could be viewed as a violation of the principle of least surprise, as well as causing practical problems when inputs are not standardized. But leaving regularization turned off is equivalent to choosing an non-informative or even improper prior. (informally in many cases, and formally identical for linear regression with normally distributed errors[4].) So Scyllia and Charybdis still loom on either side.

My problem with Bayesian models, completely unrelated to Gelman's criticism, is that the partition function is usually intractable and really only amenable to probabilistic methods (MCMC with NUTS[5], for example.) This makes them computationally expensive to fit, and this in turn makes them suitable for (relatively) small data sets. But using a lot more data is the single best way to allow a model to get more accurate while avoiding over-fitting! That is why I live with the following contradiction: 1) I believe Bayesian models have better theoretical foundations, and 2) I almost always use non-Bayesian methods for practical problems.

[1]: https://mc-stan.org/

[2]: https://www.amazon.com/Bayesian-Analysis-Chapman-Statistical...

[3]: https://en.wikipedia.org/wiki/Jeffreys_prior

[4]: https://stats.stackexchange.com/questions/163388/l2-regulari...

[5]: http://www.stat.columbia.edu/~gelman/research/published/nuts...

[+] perl4ever|6 years ago|reply

"Everyone hates picking priors in Bayesian analysis."

Everybody hates searching for their keys in the dark.

[+] howlin|6 years ago|reply

Bayesian modeling can be very powerful when it works but it can also be catastrophic when it fails. It helps to think about this in an adversarial decision theoretic context where you play a prediction game against an opponent (usually called Nature).

We can think of the game as discovering the best model to explain a set of observations. The Bayesian believes that Nature picks the true model that generated the observations by sampling the prior. This is actually a huge assumption to make, which is why Bayesian methods work so well when the assumption is close to the truth.

Frequentists make the assumption that Nature chooses the underlying true model from a set of possible models. Beyond restricting the set of models Nature can choose from, frequentists make no further assumptions about the selection process. This is a strictly weaker assumption than the Bayesian makes, which means frequentist methods will do better when the specified prior grossly misrepresents Nature's decision making process.

There are even weaker assumptions that can be made about how Nature chooses the data. Regret-based model inference allows for a more adversarial game with Nature where the data may not come from the class of models considered at all. If Nature truly behaves this way, then Bayesian decision making can catastrophically fail.

[+] c2471|6 years ago|reply

This ignores the main strength of a Bayesian workflow. You can straight forwardly quantify the effect of your prior choice on your inference - pick a different prior; how much does that change the inference, etc etc. A good Bayesian workflow does not assume a prior to be true; it should be based on available evidence, and then stressed. To be a bit more concrete, let's say we wish to model the height of kangaroos. We come up with a model form, say regression, and a bunch of potential features. If we are Bayesian we might say; "I think nature prefers simple stable solutions, so I'll put a N(0,d) prior on my weights. We then compute a posterior and get a range of credible values. We can then say, "hey, what if I'm wrong and actually it's a student t, or it's flat prior or X or y or z", and use principled tools like marginal likelihood to say which family of models works best, do prior posterior comparisons to see how observations changed our prior etc etc.

If we do this under a frequentist framework we compute the regression coefficients, and can get some confidence bounds with some appeal to asymptotics (and nobody I've ever seen actually makes any attempt to validate these assumptions). And even when we are done, we get a confidence interval that has such a truly unintuitive definition that almost every person who is not a stats PhD fundamentally misinterprets.

To say frequentists make less assumptions is not true- they are just less explicit, and I consider it a strength not a weakness to highlight choices made by the statistician.

[+] selectionbias|6 years ago|reply

My problem with the 'Bayes=rationality' type of argument is that it ignores context and isn't really a case for reporting Bayesian vs frequentist estimates. If I am a researcher publishing results then I have an audience who interpret my results. If my audience is Bayesian and accept my model then all I need to do is report sufficient statistics and they can make their own Bayesian inferences given their priors, or better yet, I can just post my whole dataset. The very reason we need to report things like credible sets or confidence intervals rather than just sufficient statistics is because audiences in the real world want summary stats that they can easily interpret and are transparent. The best approach to inference is one that is the most useful to audiences, and that depends on context and practicalities rather than on some underlying philosophy of subjective vs objective probabilities.

[+] metasj|6 years ago|reply

Many analyses of the world aren't bayesian /or/ frequentist, they use much simpler pattern-matching, with feedback loops that update the approach used as well as the conclusion. Problems start w/ assuming you have to choose one of those approaches to estimate the future...

[+] ummonk|6 years ago|reply

>Put a flat prior on theta and you end up with an 84% posterior probability that theta is greater than 0. Step back a bit, and it’s saying that you’ll offer 5-to-1 odds that theta>0 after seeing an observation that is statistically indistinguishable from noise. That can’t make sense. Go around offering 5:1 bets based on pure noise and you’ll go bankrupt real fast.

If you think it's likely to be pure noise, why the hell would you put a flat prior on it?

Note also that nonflat priors are implicit in significance testing - e.g. p95 significance is similar to putting a 95% prior on the null hypothesis, and p99 significance is similar to putting a 99% prior on the null hypothesis.

[+] pontusrehula|6 years ago|reply

To criticize is easy but it feels incomplete if one doesn't provide any clues of what the supposedly better alternatives would be.

[+] mycall|6 years ago|reply

84% isn't that great for predictions compared to DNNs, RNNs or other modern ML algorithms.

[+] gweinberg|6 years ago|reply

The author has a major fundamental misconception as to how probability works. If I say "the probability that proposition X is true is 0.5", that means that based on the information available to me right now it's equally likely likely to be true as false. That's not even remotely similar to saying I would offer an even money bet.

[+] baron_harkonnen|6 years ago|reply

Ignoring the fact that “the author” is one of the most respected statistians in the world today... there is no debate on how to translate probabilities into odds:

odds(x) = p(x)/(1-p(x))

Thats the definition of “odds”. so in this case it is quite clear that the odds for X is 1, implying and even money bet.

[+] sunstone|6 years ago|reply

The human brain is the best Bayesian model builder that evolution has yet devised. A good place to start assessing its weaknesses is to observe your own brain messing up. This shouldn't be hard to do.

119 comments