Not one mention of the EM algorithm, which is, as far as I can understand, is being described here (https://en.m.wikipedia.org/wiki/Expectation%E2%80%93maximiza...). It has so many applications, among which is estimating number of clusters for a Gaussian mixture model.
EM can be used to impute data, but that would be single imputation. Multiple imputation as described here would not use EM since the goal is to get samples from a distribution of possible values for the missing data.
> It has so many applications, among which is estimating number of clusters for a Gaussian mixture model
Any sources for that? As far as I remember, EM is used to calculate actual cluster parameters (means, covariances etc), but I'm not aware of any usage to estimate what number of clusters works best.
Source: I've implemented EM for GMMs for a college assignment once, but I'm a bit hazy on the details.
Does any living statistician come close to the level of Donald Rubin in terms of research impact? Missing data analysis, causal inference, EM algorithm, any probably more. He just walks around creating new subfields.
I don't find the language of the article full of "hype"; they describe the history of different forms of imputation from single to multiple to ML-based.
The table is particularly useful as it describes what the article is all about in a way that can stick to students' minds. I'm very grateful for QuantaMagazine for its popular science reporting.
I agree with that. I skip the Quanta magazine articles, mainly because the titles seem to be a little to hyped for my taste and don't represent the content as well as they should.
I wish they actually engaged with this issue instead of writing a fluff piece. There are plenty of problems with multiple imputation.
Not the least of which is that it's far too easy to do the equivalent of p hacking and get your data to be significant by playing games with how you do the imputation. Garbage in, garbage out.
I think all of these methods should be abolished from the curriculum entirely. When I review papers in the ML/AI I automatically reject any paper or dataset that uses imputation.
This is all a consequence of the terrible statics used in most fields. Bayesian methods don't need to do this.
I feel like multiple imputation is fine when you have data missing at random.
The problem is that data is never actually missing at random and there’s always some sort of interesting variable that confounds which pieces are missing
Maybe in academia, where sketchy incentives rule. In industry, p-hacking is great till you’re eventually caught for doing nonsense that isn’t driving real impact (still, the lead time is enough to mint money).
My intuition would be that there are certain conditions under which Bayesian inference for the missing data and multiple imputation lead to the same results.
What is the distinction?
The scenario described in the paper could be represented in a Bayesian method or not. “For a given missing value in one copy, randomly assign a guess from your distribution.” Here “my distribution” could be Bayesian or not but either way it’s still up to the statistician to make good choices about the model. The Bayesian can p hack here all the same.
Does anyone else find it maddeningly difficult to read Quanta articles on desktop, because the nav bar keeps dancing around the screen? One of my least favorite web design things is the "let's move the bar up and down the screen depending on what direction he's scrolling, that'll really mess with him." I promise I can find the nav bar on my own when I need it.
That would push things towards the mean... not necessarily a bad thing, but presumably later steps of the analysis will be pooling/averaging data together so not that useful.
A more interesting approach, let's call it OPTION2, would be to sample from the predictive distribution of a regression (regression mean + noise), which would result in more variability in the imputations, although random so might not what you want.
The multiple imputation approach seems to be a resampling methods of obtaining OPTION2, w/o need to assume linear regression model.
[+] [-] Jun8|1 year ago|reply
An ELI5 intro: https://abidlabs.github.io/EM-Algorithm/
[+] [-] Sniffnoy|1 year ago|reply
[+] [-] CrazyStat|1 year ago|reply
[+] [-] miki123211|1 year ago|reply
Any sources for that? As far as I remember, EM is used to calculate actual cluster parameters (means, covariances etc), but I'm not aware of any usage to estimate what number of clusters works best.
Source: I've implemented EM for GMMs for a college assignment once, but I'm a bit hazy on the details.
[+] [-] clircle|1 year ago|reply
[+] [-] selimthegrim|1 year ago|reply
[+] [-] j7ake|1 year ago|reply
[+] [-] selectionbias|1 year ago|reply
[+] [-] aquafox|1 year ago|reply
[+] [-] xiaodai|1 year ago|reply
[+] [-] jll29|1 year ago|reply
The table is particularly useful as it describes what the article is all about in a way that can stick to students' minds. I'm very grateful for QuantaMagazine for its popular science reporting.
[+] [-] vouaobrasil|1 year ago|reply
[+] [-] MiddleMan5|1 year ago|reply
[+] [-] TaurenHunter|1 year ago|reply
Rubin Causal Model
Propensity Score Matching
Contributions to
Bayesian Inference
Missing data mechanisms
Survey sampling
Causal inference in observations
Multiple comparisons and hypothesis testing
[+] [-] light_hue_1|1 year ago|reply
Not the least of which is that it's far too easy to do the equivalent of p hacking and get your data to be significant by playing games with how you do the imputation. Garbage in, garbage out.
I think all of these methods should be abolished from the curriculum entirely. When I review papers in the ML/AI I automatically reject any paper or dataset that uses imputation.
This is all a consequence of the terrible statics used in most fields. Bayesian methods don't need to do this.
[+] [-] jll29|1 year ago|reply
[+] [-] parpfish|1 year ago|reply
The problem is that data is never actually missing at random and there’s always some sort of interesting variable that confounds which pieces are missing
[+] [-] DAGdug|1 year ago|reply
[+] [-] aabaker99|1 year ago|reply
What is the distinction?
The scenario described in the paper could be represented in a Bayesian method or not. “For a given missing value in one copy, randomly assign a guess from your distribution.” Here “my distribution” could be Bayesian or not but either way it’s still up to the statistician to make good choices about the model. The Bayesian can p hack here all the same.
[+] [-] fn-mote|1 year ago|reply
[+] [-] karaterobot|1 year ago|reply
[+] [-] paulpauper|1 year ago|reply
[+] [-] ivansavz|1 year ago|reply
A more interesting approach, let's call it OPTION2, would be to sample from the predictive distribution of a regression (regression mean + noise), which would result in more variability in the imputations, although random so might not what you want.
The multiple imputation approach seems to be a resampling methods of obtaining OPTION2, w/o need to assume linear regression model.
[+] [-] bgnn|1 year ago|reply
This must be about the confidence of the approach. Maybe interpolation would be overconfident too.
[+] [-] SillyUsername|1 year ago|reply
[+] [-] hatmatrix|1 year ago|reply
[+] [-] userbinator|1 year ago|reply
[+] [-] a-dub|1 year ago|reply