(no title)
fergal_reid | 2 years ago
It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.
I think the answer is:
Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.
However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.
Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).
bscphil|2 years ago
The way the question demonstrates "smallness" is wrong, however. They quote the product of the likelihoods of 50 randomly sampled values - 9.183016e-65 - as if the smallness of this value is significant or meant anything at all. Forget the issue of continuous sampling from a normal distribution, and just consider the simple discrete case of flipping a coin. The combined probability of any permutation of 50 flips is 0.5 ^ 50, a really small number. That's because the probability is, in fact, really small!
knightoffaith|2 years ago
anon946|2 years ago
I am not sure how to handle the continuous case, however.
jvanderbot|2 years ago
https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...
It's a fairly common "mistake" to assume that the MLE is useful as a point estimate and without considering covariance/spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD or some other measure of precision given the measurement set.
TobyTheCamel|2 years ago
This may be true for low dimensions but doesn’t generalise to high dimensions. Consider a 100-dimensional standard normal distribution for example. The MLE will still be at the origin but most of the mass will live in a thin shell of distance roughly 7 units from the origin.
blt|2 years ago
lupire|2 years ago
If I am looking for a needle in a hyperhaystack, it's not important to know that it's more likely to be "somewhere on the huge hyperboundary" than "in the center hypercubic inch".
crazygringo|2 years ago
Can you elaborate? An MLE is never going to come up with the exact parameters that produced the samples, but in the original example, as long as you know it's a normal distribution, MLE is probably going to come up with a mean between 4 and 6 and a SD within a similar range as well (I haven't calculated it, just eyeballing it) -- when the original parameters were 5 and 5.
I guess I don't know what you mean by "correct", but that's as correct as you can get, based on just 50 samples.
fergal_reid|2 years ago
I know they asked with a continuous example, but I don't interpret their question as limited to continuous cases, and I think it's easier to address using a discrete example, as we avoid the issue of each exact parameter having infinitesimal mass which occurs in a continuous setting.
Let's imagine the parameter we're trying to estimate is discrete and has, say, 500 different possible values.
Let's say the parameter can have the value of the integers between 1 and 500 and most of the mass is clustered in the middle between 230 and 270.
Given some data, it would actually be possible that MLE would come up with the exact value, say 250.
But maybe given the data, a range of values between 240 and 260 are also very plausible, so the likelihood of exactly 250 has a fairly low probability.
The original poster is confused, because they are basically saying, well, if the actual probability is so low, why is this MLE stuff useful?
You are pointing out they should really frame things in terms of a range and not a point estimate. You are right; but I think their question is still legitimate, because often in practice we do not give a range, and just give the maximum likelihood estimate of the parameter. (And also, separately, in a discrete parameter setting, specific parameter value could have substantial mass.)
So why is the MLE useful?
My answer would be, well, that's because for many posterior distributions, a lot of the probability mass will be near the MLE, if not exactly at it - so knowing the MLE is often useful, even if the probability of that exact value of the parameter is low.
aquafox|2 years ago
Similarly, AIC values do not make a lot of sense on an absolute scale but only relative to each other, as written in [1].
[1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304.
agnosticmantis|2 years ago
This is a Bayesian point of view. The other answers are more frequentist, pointing out that likelihood at a parameter theta is NOT the probability of theta being the true parameter (given data). So we can't and don't interpret it like a probability.
klipt|2 years ago
Bayesian priors have similar effect to regularization (e.g. ridge regression / penalizing large parameter values).
LudwigNagasena|2 years ago