top | item 41754774

(no title)

underbiding | 1 year ago

True true but how do you account for missing data based on variables you care about and those you don't?

More specifically, how do you determine if the pattern you seem to be identifying is actually related to the phenomenon being measured and not an error in the measurement tools themselves?

For example, a significant pattern of answers to "Yes / No: have you ever been assaulted?" are blank. This could be (A), respondents who were assaulted are more likely to leave it blank out of shame or (B) someone handling the spreadsheet accidentally dropped some rows in the data (because lets be serious here, its all spreadsheets and emails...).

While you could say that (B) should be theoretically "more truly random", we can't assume that there isn't a pattern to the way those rows were dropped (i.e. a pattern imposed on some algorithm that bugged out and dropped those rows).

discuss

order

Xcelerate|1 year ago

> how do you determine if the pattern you seem to be identifying is actually related to the phenomenon being measured and not an error in the measurement tools themselves?

If the “which data is missing” information can be used be to compress the data that isn’t missing further than it can be compressed be alone, then the missing data is missing at least in part due to the phenomenon being measured. Otherwise, it’s not.

We’re basically just asking if K(non-missing data | which data is missing) < K(non-missing data). This is uncomputable so it doesn’t actually answer your question regarding “how to determine”, but it does provide a necessary and sufficient theoretical criteria.

A decent practical approximation might be to see if you can develop a model that predicts the non-missing data better when augmented with the “which information is missing” information than via self-prediction. That could be an interesting research project...

parpfish|1 year ago

There’s already a bunch of stats research on this problem. Some useful terms to look up are MCAR (missing completely at random) and MNAR (missing not at random)