top | item 21095410

(no title)

human20190310 | 6 years ago

At risk of embarrassing my self statistically, what exactly happens when you do this?

I.e., if you're controlling for country, that means you're bucketing by country, and looking at each subset, right? So if country is represented by a non-discrete value... what exactly happens?

discuss

Fomite|6 years ago

So let's pretend there's three types of trees we want to study: Oak, Maple and Aspen, which we code as 0, 1, and 2 for reasons (there are some good reasons to do this).

Statistically, if you treat them as a continuous variable, the estimates you get will act like there's an ordering there, and give you the effect of a one unit increase in tree. So it will tell you the effect of Oak vs. Maple and Maple vs. Aspen, assuming those are proportional and that Oak vs. Aspen will be twice that.

This is...nonsense, for most categorical variables. They don't have a nice, ordinal stepping like that.

kachnuv_ocasek|6 years ago

In short, ANOVA is usually what you want to do: https://en.wikipedia.org/wiki/One-way_analysis_of_variance

In practice, if you have n countries, you'll add n-1 binary variables to your regression equation. The first country is the reference level (all zeros), for the second country set the first new variable to one, the rest to zero, etc.

_0ffh|6 years ago

So one-hot encoding, plus one "none-hot" base case. Why not just one-hot for all? To save one input?