top | item 39419472

(no title)

fergal_reid | 2 years ago

I think most of the replies, here and on stack exchange, are answering slightly the wrong question.

It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.

I think the answer is:

Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).

discuss

order

bscphil|2 years ago

> It is fair to ask why the likelihoods are useful if they are so small

The way the question demonstrates "smallness" is wrong, however. They quote the product of the likelihoods of 50 randomly sampled values - 9.183016e-65 - as if the smallness of this value is significant or meant anything at all. Forget the issue of continuous sampling from a normal distribution, and just consider the simple discrete case of flipping a coin. The combined probability of any permutation of 50 flips is 0.5 ^ 50, a really small number. That's because the probability is, in fact, really small!

knightoffaith|2 years ago

Right - and so the more appropriate thing to do is not look at the raw likelihood of any one particular value but instead look at relative likelihoods to understand what values are more likely than other values.

anon946|2 years ago

For the discrete case, it seems that a better thing to do is consider the likelihood of getting that number of heads, rather than the likelihood of getting that exact sequence.

I am not sure how to handle the continuous case, however.

jvanderbot|2 years ago

Yes - the most enlightening concept for me was "Highest Probability Density Interval" which basically always is clustered around the mean. But you can choose any interval which contains as much probability mass!

https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...

It's a fairly common "mistake" to assume that the MLE is useful as a point estimate and without considering covariance/spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD or some other measure of precision given the measurement set.

TobyTheCamel|2 years ago

> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This may be true for low dimensions but doesn’t generalise to high dimensions. Consider a 100-dimensional standard normal distribution for example. The MLE will still be at the origin but most of the mass will live in a thin shell of distance roughly 7 units from the origin.

blt|2 years ago

I think the "mass" they are referring to might the mass of the Bayesian posterior in parameter space, not the mass of the data in event space.

lupire|2 years ago

Concentration of mass is density. A shell is not dense.

If I am looking for a needle in a hyperhaystack, it's not important to know that it's more likely to be "somewhere on the huge hyperboundary" than "in the center hypercubic inch".

crazygringo|2 years ago

> Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

Can you elaborate? An MLE is never going to come up with the exact parameters that produced the samples, but in the original example, as long as you know it's a normal distribution, MLE is probably going to come up with a mean between 4 and 6 and a SD within a similar range as well (I haven't calculated it, just eyeballing it) -- when the original parameters were 5 and 5.

I guess I don't know what you mean by "correct", but that's as correct as you can get, based on just 50 samples.

fergal_reid|2 years ago

Right - I think this is what's at the heart of the original question.

I know they asked with a continuous example, but I don't interpret their question as limited to continuous cases, and I think it's easier to address using a discrete example, as we avoid the issue of each exact parameter having infinitesimal mass which occurs in a continuous setting.

Let's imagine the parameter we're trying to estimate is discrete and has, say, 500 different possible values.

Let's say the parameter can have the value of the integers between 1 and 500 and most of the mass is clustered in the middle between 230 and 270.

Given some data, it would actually be possible that MLE would come up with the exact value, say 250.

But maybe given the data, a range of values between 240 and 260 are also very plausible, so the likelihood of exactly 250 has a fairly low probability.

The original poster is confused, because they are basically saying, well, if the actual probability is so low, why is this MLE stuff useful?

You are pointing out they should really frame things in terms of a range and not a point estimate. You are right; but I think their question is still legitimate, because often in practice we do not give a range, and just give the maximum likelihood estimate of the parameter. (And also, separately, in a discrete parameter setting, specific parameter value could have substantial mass.)

So why is the MLE useful?

My answer would be, well, that's because for many posterior distributions, a lot of the probability mass will be near the MLE, if not exactly at it - so knowing the MLE is often useful, even if the probability of that exact value of the parameter is low.

aquafox|2 years ago

I agree with your points and thats why it's useful to compare a MLE to an alternative model via a likelihood ratio test, in which case one sees how much better the generative model performs as compared to the wrong model.

Similarly, AIC values do not make a lot of sense on an absolute scale but only relative to each other, as written in [1].

[1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304.

agnosticmantis|2 years ago

> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This is a Bayesian point of view. The other answers are more frequentist, pointing out that likelihood at a parameter theta is NOT the probability of theta being the true parameter (given data). So we can't and don't interpret it like a probability.

klipt|2 years ago

Given enough data, Bayesian and frequentist models tend to converge to the same answer anyway.

Bayesian priors have similar effect to regularization (e.g. ridge regression / penalizing large parameter values).

LudwigNagasena|2 years ago

That's not a Bayesian point of view. You can re-word it in terms of a confidence interval / coverage probability. It is true that in frequentist statistics parameters don't have probability distributions, but their estimators very much do. And one of the main properties of a good estimator is formulated in terms of convergence in probability to the true parameter value (consistency).