It seems like most of these complaints center around the idea of the inherent subjectivity of the prior. In cases like astronomy and other hard sciences, the prior reflects actual scientific knowledge and is not really subjective at all. In cases where we don't have that kind of evidence, empirical Bayes methods work very well by just peaking at some subset of the data and finding a good point estimate for the prior.
I'm also not sure why the OP thinks that calculating the normalizing constant is a huge issue. Most of the time you rarely need it since you're likely going to end up doing an MCMC or some other sampling method for the posterior, in which case you only need proportionality.
There are lots of problems with Bayesian methods in practice, but most of them revolve around the scalability of modern methods to massive data sets and very complicated models. Many Bayesians tend to think that it's absolutely crucial to quantify uncertainty and that the added computational cost and human effort is worthwhile. In practice, point estimate methods to find MAP or even just maximum likelihood values work really well for most problems. If you look at the trend in most machine learning, for instance, generally people find a cool way to solve some problem with good performance (e.g. SGD + Deep Nets), then some Bayesian lab spends a few years trying to interpret everything as a generative model and coming up with a clever way to sample everything (e.g. Lawernce Carin's lab at Duke has done a lot of this work in Deep Bayesian Nets). The end result is usually better, but by then most people have moved on to a newer problem and the appeal of getting a marginal boost in performance is harder for me to see. The Bayesian nonparametrics crowd has historically done a pretty good job of hitting a sweet spot of compromise on this by keeping a Bayesian view but still (usually) treating everything as an optimization problem first (e.g. variational inference methods).
It seems like your attitude towards statistical errors is largely going to depend on the risk of making bad predictions.
For example, the problem with the denominator containing "unknown unknowns" isn't much of an issue if you're searching photographs or optimizing ad revenue. It's much more important for something safety-related like building an airplane or a driverless car.
Finance is in the middle: not directly safety related, but modeling tail risk the wrong way could bankrupt the company.
> I'm also not sure why the OP thinks that calculating the normalizing constant is a huge issue. Most of the time you rarely need it since you're likely going to end up doing an MCMC or some other sampling method for the posterior, in which case you only need proportionality.
Right, and when doing Bayesian model selection the evidence drops out of the problem completely. All you need are the priors and the likelihoods.
The claim that ~H is "not itself a valid hypothesis" is dubious. If H is the hypothesis that a certain continuous parameter has a value in the range [a, b], then it's perfectly obvious what ~H means.
Of course it's possible to choose a vague, overly broad hypothesis, but a frequentist analysis of such a hypothesis is going to be just as bad as a Bayesian analysis. "Garbage in, garbage out" is true no matter what tool you use.
I do not understand the problem people have with priors in Bayesian methodology. Yes, it is true that a poor choice of prior can affect results. But classical, frequentist techniques incorporate priors implicitly: a flat prior indicating we have no information other than the data. And just as a poor Bayesian prior based on subjective belief can ruin an analysis, a non informative prior implicitly made can be just as catastrophic. And it is truly a rare case when there is absolutely nothing known about a process, and in these cases, a flat prior is the kind of poor prior that these people are so afraid of.
> But classical, frequentist techniques incorporate priors implicitly: a flat prior indicating we have no information other than the data.
That's not completely right. Frequentists don't assume a flat prior, but rather play a minimax strategy that gives a certain worst-case performance across all possible priors. For example, frequentist confidence intervals have coverage guarantees, while Bayesian intervals generally don't. The middle ground is using "objective Bayesian" methods that aim for good frequentist properties.
You misunderstand what's going on. If I publish a paper people want to know what the data I found suggests and that's it.
A reader can then build there own chain of logic combining several papers with prior knowledge. Further, if someone retracts a paper they can update that chain of logic. But, if a paper is based on another paper that was retracted then it's chain of logic is suspect.
PS: Bayesian reasoning is fine for meta analysis though.
So, he finds that intuitive guesstimates of probabilities are not very effective. No joke.
While probability is a really nice and useful theoretical construct, often in practice getting an accurate numerical estimate is from challenging to not doable.
Broadly there are three approaches:
(1) For something like coin flipping, just call it 1/2 and move on!
(2) There are stacks of theorems that can help, sometimes a lot. E.g., there is the renewal theorem that says that lots of stochastic arrival processes converge to Poisson processes, and there often in practice a good estimate of the arrival rate is easy and, then, a lot more probabilities just drop right out of various expressions for Poisson processes. Can also make use of the central limit theorem, the law of large numbers, the martingale inequality, a Markov assumption, etc. Here one of the best little tricks is to use intuition to justify independence (generally much more effective than using intuition in estimation of Bayes priors) and, then, exploit that assumption.
(3) Start with a Bayes prior or whatever the heck but have an iterative scheme where can have several or many iterations and for that scheme have some solid theorems that the scheme converges. Then, iterate your way there.
Great post. Couldn't have said it any better myself. Point 3 is essential, IMHO. Superforecasting by Phillip Tetlock & Dan Gardner [1] relates an excellent description of this process in the realm of human forecasting even though they don't phrase it as a Bayesian approach. Essentially they found that those best able to predict world events continuously honed their estimates using an iterative process updating what really could be described as the priors of the superforcasters.
It's an enlightening read as they describe some of the processes used to hone intuited estimates using an outward and inward looking processes. I'm going to have to look into what you mean by using intuition to judge independence. Any good sources on that?
I'd like to point out that frequentist statistics suffer from just as many philosophical shenanigans. (For examples, see any introduction to Bayesian statistics.)
If you want "rationality", you're going to have to look elsewhere.
I have yet to see any convincing criticism of Bayesian reasoning when performed using priors derived from the Principle of Maximum Entropy [0]. By incorporating all information available and nothing more, such a prior neither makes unwarranted assumptions nor throws away information (as is very commonly done with other methods). In principle the process of generating such a prior demands absolutely no subjectivity, rather it it the result of a logical deduction from all information available. In practice some information may be difficult to specify or incorporate, however this is automatically accounted for by the Principle of Maximum Entropy as it guarantees that nothing is assumed that is unspecified and being unable to incorporate some information will merely result in all possibilities being considered without bias. In the very worst case when you have no relevant information which can be incorporated this regresses to an uninformative prior (such as the uniform distribution) which are easily and rigourously handled by this principle even in far more complex cases where other approaches fail entirely. Furthermore, given a different prior this process can tell you exactly what additional (unwarranted) assumptions are made.
Once the priors are specified, the actual process of Bayesian Reasoning is formal logical reasoning generalised to the case when you possess incomplete information. It tells you exactly the degree of belief you can assign to a proposition given some information; and given this information assigning either less or more belief than the precise amount this deductive process tells you are equally grave mistakes.
For further information on the Principle of Maximum Entropy I recommend reading Prior Probabilities (1968) [1] and chapters 11 and 12 of Probability Theory: The Logic of Science [2]. If you are unconvinced of the theoretical validity or universality of Bayesian reasoning I heartily recommend reading chapters 1 and 2 of Probability Theory: The Logic of Science [2].
This is on everyone’s short list of problems with Bayes. In the simplest interpretation of Bayes, old evidence has zero confirming power. If evidence E was on the books long ago and it suddenly comes to light that H entails E, no change in the value of H follows. This seems odd – to most outsiders anyway.
I don't understand what he's referring to here. If we now know that H entails E, then that means our model of the world changed, and thus our posterior on H changed as well. Did I miss anything?
It's an interesting article, but there is a lot of debate around the interpretation of Bayesian inference, and IMO there are answers to be found. In particular, Andrew Gelman argues against the subjective interpretation: http://www.stat.columbia.edu/~gelman/research/published/phil... It's the best article I've read on the subject.
If you consider subjective priors to be a problem, this can be addresses to
some degree using so called "objective priors". They are objective in a sense
that if two people agree on underlying principles of how priors should be
assigned, then they will get the same priors. Catch is that you must decide
what principles to use, as they are not objective themselves.
Updating multiple times on the same evidence can be bad - as it overstates
evidence you have for some hypothesis - but you could do much worse. Instead of
discovering that H implies E, suppose instead that you conditioned on H, which
as later turned out is logically inconsistent. This is in general serious
mistake regardless if what you are doing have word "frequentist" or "Bayes"
attached to it, but consequences are not necessarily always the same. Larry
Wasserman in chapter titled "Strengths and Weaknesses of Bayesian Inference" of
his "All of Statistics" have an example concerned with estimating a normalizing
constant. He compares the two approaches, frequentist one which works just
fine, and Bayes one which fails miserably. There is no additional commentary so
I always wondered if he never realized that derivation makes inconsistent
assumptions, or he realized that but intended to show that frequentist comes
out just fine. Ex falso quod libet.
Regarding the raven paradox, the underlying reasoning and conclusions always
appeared to me to perfectly natural and reasonable. I think it is to great
detriment for mathematics and statistics, that people come up with catch names
with word "paradox" in it, for things that are merely unintuitive to them. For
example Simpson's paradox is a simple observation that: probability of an
event, is not a simple average of event probability across all groups, instead
those within group event probabilities should be also weighted by relative
group sizes. Whats paradoxical about that?
Regarding negation of H, not being a real hypothesis -- this is only true if
you claim that you somehow consider all alternative hypotheses. I don't think
people claim that. I seems to me that it is rather taken to represent only
those alternative hypothesis that are under consideration given your modeling
assumptions. Then it is perfectly fine and valid. I like how Jaynes avoided this
kind of misinterpretations by conditioning everything on background information
used and other assumptions. Let all of those be represented by B. Then you
would talk about P(H|B) and P(~H|B), which makes it clearer that you don't
talk all unknown unknowns.
[+] [-] tansey|10 years ago|reply
I'm also not sure why the OP thinks that calculating the normalizing constant is a huge issue. Most of the time you rarely need it since you're likely going to end up doing an MCMC or some other sampling method for the posterior, in which case you only need proportionality.
There are lots of problems with Bayesian methods in practice, but most of them revolve around the scalability of modern methods to massive data sets and very complicated models. Many Bayesians tend to think that it's absolutely crucial to quantify uncertainty and that the added computational cost and human effort is worthwhile. In practice, point estimate methods to find MAP or even just maximum likelihood values work really well for most problems. If you look at the trend in most machine learning, for instance, generally people find a cool way to solve some problem with good performance (e.g. SGD + Deep Nets), then some Bayesian lab spends a few years trying to interpret everything as a generative model and coming up with a clever way to sample everything (e.g. Lawernce Carin's lab at Duke has done a lot of this work in Deep Bayesian Nets). The end result is usually better, but by then most people have moved on to a newer problem and the appeal of getting a marginal boost in performance is harder for me to see. The Bayesian nonparametrics crowd has historically done a pretty good job of hitting a sweet spot of compromise on this by keeping a Bayesian view but still (usually) treating everything as an optimization problem first (e.g. variational inference methods).
[+] [-] mjw|10 years ago|reply
http://andrewgelman.com/2015/01/27/perhaps-merely-accident-h...
[+] [-] skybrian|10 years ago|reply
For example, the problem with the denominator containing "unknown unknowns" isn't much of an issue if you're searching photographs or optimizing ad revenue. It's much more important for something safety-related like building an airplane or a driverless car.
Finance is in the middle: not directly safety related, but modeling tail risk the wrong way could bankrupt the company.
[+] [-] gaur|10 years ago|reply
Right, and when doing Bayesian model selection the evidence drops out of the problem completely. All you need are the priors and the likelihoods.
The claim that ~H is "not itself a valid hypothesis" is dubious. If H is the hypothesis that a certain continuous parameter has a value in the range [a, b], then it's perfectly obvious what ~H means.
Of course it's possible to choose a vague, overly broad hypothesis, but a frequentist analysis of such a hypothesis is going to be just as bad as a Bayesian analysis. "Garbage in, garbage out" is true no matter what tool you use.
[+] [-] indiana-b|10 years ago|reply
[+] [-] cousin_it|10 years ago|reply
That's not completely right. Frequentists don't assume a flat prior, but rather play a minimax strategy that gives a certain worst-case performance across all possible priors. For example, frequentist confidence intervals have coverage guarantees, while Bayesian intervals generally don't. The middle ground is using "objective Bayesian" methods that aim for good frequentist properties.
[+] [-] cwyers|10 years ago|reply
[+] [-] Retric|10 years ago|reply
A reader can then build there own chain of logic combining several papers with prior knowledge. Further, if someone retracts a paper they can update that chain of logic. But, if a paper is based on another paper that was retracted then it's chain of logic is suspect.
PS: Bayesian reasoning is fine for meta analysis though.
[+] [-] graycat|10 years ago|reply
While probability is a really nice and useful theoretical construct, often in practice getting an accurate numerical estimate is from challenging to not doable.
Broadly there are three approaches:
(1) For something like coin flipping, just call it 1/2 and move on!
(2) There are stacks of theorems that can help, sometimes a lot. E.g., there is the renewal theorem that says that lots of stochastic arrival processes converge to Poisson processes, and there often in practice a good estimate of the arrival rate is easy and, then, a lot more probabilities just drop right out of various expressions for Poisson processes. Can also make use of the central limit theorem, the law of large numbers, the martingale inequality, a Markov assumption, etc. Here one of the best little tricks is to use intuition to justify independence (generally much more effective than using intuition in estimation of Bayes priors) and, then, exploit that assumption.
(3) Start with a Bayes prior or whatever the heck but have an iterative scheme where can have several or many iterations and for that scheme have some solid theorems that the scheme converges. Then, iterate your way there.
[+] [-] elcritch|10 years ago|reply
It's an enlightening read as they describe some of the processes used to hone intuited estimates using an outward and inward looking processes. I'm going to have to look into what you mean by using intuition to judge independence. Any good sources on that?
[1]: https://en.m.wikipedia.org/wiki/Superforecasting
[+] [-] mcguire|10 years ago|reply
If you want "rationality", you're going to have to look elsewhere.
[+] [-] dkbrk|10 years ago|reply
Once the priors are specified, the actual process of Bayesian Reasoning is formal logical reasoning generalised to the case when you possess incomplete information. It tells you exactly the degree of belief you can assign to a proposition given some information; and given this information assigning either less or more belief than the precise amount this deductive process tells you are equally grave mistakes.
For further information on the Principle of Maximum Entropy I recommend reading Prior Probabilities (1968) [1] and chapters 11 and 12 of Probability Theory: The Logic of Science [2]. If you are unconvinced of the theoretical validity or universality of Bayesian reasoning I heartily recommend reading chapters 1 and 2 of Probability Theory: The Logic of Science [2].
[0]: https://en.wikipedia.org/wiki/Principle_of_maximum_entropy [1]: http://bayes.wustl.edu/etj/articles/prior.pdf [2]: http://bayes.wustl.edu/etj/prob/book.pdf
[+] [-] ced|10 years ago|reply
I don't understand what he's referring to here. If we now know that H entails E, then that means our model of the world changed, and thus our posterior on H changed as well. Did I miss anything?
It's an interesting article, but there is a lot of debate around the interpretation of Bayesian inference, and IMO there are answers to be found. In particular, Andrew Gelman argues against the subjective interpretation: http://www.stat.columbia.edu/~gelman/research/published/phil... It's the best article I've read on the subject.
[+] [-] jsprogrammer|10 years ago|reply
[+] [-] wuch|10 years ago|reply
Updating multiple times on the same evidence can be bad - as it overstates evidence you have for some hypothesis - but you could do much worse. Instead of discovering that H implies E, suppose instead that you conditioned on H, which as later turned out is logically inconsistent. This is in general serious mistake regardless if what you are doing have word "frequentist" or "Bayes" attached to it, but consequences are not necessarily always the same. Larry Wasserman in chapter titled "Strengths and Weaknesses of Bayesian Inference" of his "All of Statistics" have an example concerned with estimating a normalizing constant. He compares the two approaches, frequentist one which works just fine, and Bayes one which fails miserably. There is no additional commentary so I always wondered if he never realized that derivation makes inconsistent assumptions, or he realized that but intended to show that frequentist comes out just fine. Ex falso quod libet.
Regarding the raven paradox, the underlying reasoning and conclusions always appeared to me to perfectly natural and reasonable. I think it is to great detriment for mathematics and statistics, that people come up with catch names with word "paradox" in it, for things that are merely unintuitive to them. For example Simpson's paradox is a simple observation that: probability of an event, is not a simple average of event probability across all groups, instead those within group event probabilities should be also weighted by relative group sizes. Whats paradoxical about that?
Regarding negation of H, not being a real hypothesis -- this is only true if you claim that you somehow consider all alternative hypotheses. I don't think people claim that. I seems to me that it is rather taken to represent only those alternative hypothesis that are under consideration given your modeling assumptions. Then it is perfectly fine and valid. I like how Jaynes avoided this kind of misinterpretations by conditioning everything on background information used and other assumptions. Let all of those be represented by B. Then you would talk about P(H|B) and P(~H|B), which makes it clearer that you don't talk all unknown unknowns.