top | item 40835787

Well-known paradox of R-squared is still buggin me

71 points| luu | 1 year ago |statmodeling.stat.columbia.edu

103 comments

order
[+] cperciva|1 year ago|reply
I don't see the problem. The R-squared is 0.01 for blue/red predicting individual votes, because both of the states in question are really just different shades of purple. The R-squared is 1.00 for predicting the total vote share and which party wins the state, because of course the red/blue binary completely determines those.
[+] energy123|1 year ago|reply
R-squared is very small when the effect is small because it's squared, as the name implies.

If he doesn't like that he should just use R by itself, which would turn that 0.01 into 0.1, and it'd turn that 0.16 into 0.4. R is the Pearson correlation coefficient of a univariable linear regression.

[+] mihaaly|1 year ago|reply
> predicting individual votes

Isn't that around the top of the statistics no-nos list? Probability theory in general? Predicting individual result based on the whole sample base? It was long ago and I am not in this field at all but my recollection tells me that it was mentioned in the beginning of the first class of probability theory 101.

[+] kylebenzle|1 year ago|reply
Yes, you are right and statistics is confusing from the outside.

My opinion is LLMs are just applied statistics and you see people losing their minds thinking the models have "come to life". Most people just really have no intuition for stats.

[+] lmm|1 year ago|reply
> The two states are much different in their politics!

Are they? Sounds like they're both swing states, pretty close to 50-50, so which state you're from doesn't have a big effect on what your politics are likely to be. Which is exactly what the R^2 tells us. Where's the paradox?

[+] spenczar5|1 year ago|reply
A ten point gap is not a swing state, that’s big. Its about the same as New Jersey vs Texas in the 2020 presidential vote.
[+] DougBTX|1 year ago|reply
> Where's the paradox?

Exactly. I propose that the paradox is in first-past-the-post voting, a 5% swing leads to a 100% change in representation. How can that be?

[+] derbOac|1 year ago|reply
That was my first thought as well.

Also, traditional R squared with binary variables, or maybe categorical variables, never made much sense to me. The "meaning" of the variance I don't think is quite the same. You generally have to nonlinearly transform (e.g., logit) any linear model quantities into something else to put it on an observed variable scale.

[+] gpsx|1 year ago|reply
My statisitcs are a little rusty, so I might be off here. Someone correct me if I have this wrong. R^2 = 1 would be every voter in one state votes blue and every voter in the other votes red. R^2 = 0 would mean both states are exactly even between red and blue. The states are a lot closer to that. Again, my statistics are rusty so I'm no sure if this next part is valid, but sqaure root of .01 is .1, which doesn't seem like such a bad representation of the situation.
[+] parpfish|1 year ago|reply
part of this has to do with the fact that our intuitive sense of effect sizes don't really use proportions and we subconsciously start including sample sizes.

If a state had an election with millions of voters and got a 55-45 result, it would be a decisive landslide victory; If a elementary school classroom had an election 20 voters had got a 55-45 split, it's be the narrowest possible margin of victory.

Most would likely say that the effect in the former 'feels' much larger even though proportions are identical, which suggests that under the hood we're factoring in sample size to our intuition about effect sizes (probably something chi-square-ish).

The result is that the framing of the problem can change our sense of how big the effects are. When we hear that these are state-level elections, we think it's a huge effect and feel that we should be able to do reverse inference. If it was reframed as an election on a much smaller sample, the paradox disappears and you'd say "of course you wouldn't be able to reverse that inference"

[+] thaumasiotes|1 year ago|reply
This ties in to the difference between whether an effect is statistically significant ("does the effect exist?") and whether it's significant ("does the effect matter?").

It's very common to confuse the two ideas.

In particular, in an election with many millions of votes and a 55-45 margin, it's common to describe the winner as receiving a mandate to rule, because it was so easy to determine who the winner was, despite the fact that they appear to be extremely unpopular. That's not a mandate in any ordinary sense.

[+] jncfhnb|1 year ago|reply
R2 is more simply explained as the share of the error variance explained by the model out of the share of the error explained by the best guess, which is, in this case 0.5.

Guessing 0.5 will have you wrong wrong by 0.5 100% of the time. SST is 25 for a 100 sample example.

Guessing 0.55 for the 0.55 state will have you wrong by 0.45 55% of the time and 0.55 45% of the time for the other. SSE is 24.75

1- 24.75 / 25 = 0.01

Looking at it this way it’s not too hard to see why the R2 is bad. It barely explains any more difference in the individual behavior than the basic guess.

R2 is not a great metric for percentages or classification problems like this.

[+] kgwgk|1 year ago|reply
> Looking at it this way it’s not too hard to see why the R2 is bad. It barely explains any more difference in the individual behavior than the basic guess.

Right. R² is 1% because the prediction is bad - only marginally better than the basic guess.

> R2 is not a great metric for percentages or classification problems like this.

Using a different metric won't improve the prediction.

Is the Brier score a great metric for problems like this?

The Brier score for the model is 0.2475.

The Brier score for the "basic guess" is 0.25.

The improvement in the Brier score for the model relative to the basic guess is 1%.

[+] justk|1 year ago|reply
The math is correct, but I think the model used is not correct since it doesn't reflect that the variable s is dichotomous so rather a mixed model should be used. If we continue thinking that s is continuous we could think of this example: s=state is encoded as a continuous variable between -1 and 1 here people change state frequently and -1 reflects the person will vote in the blue state with probability 1 and s=1 that the person will vote in the red state with probability 1 while s=0 means that the person has the same probability of voting in the red or blue states. When s is near zero the model is not able to predict the preferences of the voter and this is the reason of the low predictive power of this model for a continuous s. The extreme cases s=-1 or s=1 could be rare for populations that move from one state to the other frequently so the initial intuition is misleaded to this paradox.
[+] mtts|1 year ago|reply
This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.

[+] jncfhnb|1 year ago|reply
A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.

[+] blt|1 year ago|reply
I'm no statistician, but the whole premise seems mismatched. Why are we using a tool from regression to analyze a classification problem?
[+] chipdart|1 year ago|reply
> Why are we using a tool from regression to analyze a classification problem?

Because classification is a regression problem.

Think about it for a second. You want to put together a tool to tell which class an input belongs to. You have training data you can use to build your tool around. Your training data is already divided into sets that belong to a specific classm Your goal is to put together a model that can tell you what's the closest class your input belongs to by comparing with how close your input is to elements of the training data belonging to a specific class.

What's your strategy?

Well, one of the textbook strategie starts by specifying how you measure the distance between elements of your training set, and from that point you work on putting together a function that not only minimizes the distance between elements of your training set but also, when used to evaluate elements of a training set, works well in telling the type of elements of the training set that are closest to them. Then you assume the class of your input element is the same as the class of the elements of the training set that are closest to them.

In the example above, the minimization step is... Yes, regression. You use regression to fit your model to your training data so that it is able how close your input element is to elements of a certain class, and then outputs how close it is to each of the classes.

[+] vcdimension|1 year ago|reply
I am a statistician, and you're right, for this kind of thing we would normally use a binary response model such as a logit or probit model that constrains the response variable to be between 0 & 1. However in this case it doesn't matter since there's only one independent variable (state), and it's binary so there's only 2 different predictions the model could make (which will be the correct probabilities of 0.45 & 0.55, even with a linear model).

The normal R^2 formula can't be applied to a logit/probit model; instead you use an alternative such as McFadden's or Cox & Snell pseudo R-squared. I'd be interested to see what value they take for this example.

Linear models are sometimes used even in models with many independent variables since it can be shown that the coefficients in a linear model are unbiased estimators for the average partial effects of any non-linear binary response model.

[+] greesil|1 year ago|reply
Yep. I'm also not a statistician, but linear regression that the blogger is using predicts the mean for each state, and this is being conflated with trying to predict p(color | state). The goodness of fit here would be better modeled cross entropy and not a standard deviation.
[+] dash2|1 year ago|reply
This comes up a lot in genetics. One crowd says "polygenic scores for education don't tell you much, because look how low the R-squared is!" Another crowd (including me) says "polygenic scores for education are a big deal, because look how big the effect size is!"
[+] akira2501|1 year ago|reply
What paradox? People don't vote a particular way because they live in a state. The logic here would imply "Welp, I live in Kentucky, so I guess Red?" would be the expected mode at the voting booth.
[+] jncfhnb|1 year ago|reply
Statistics does not require asserting causality.
[+] fosdad2131321|1 year ago|reply
There are two ways to resolve the paradox

1. if you insist on using the r-squared (i.e., a linear regression measure), then properly center and normalize your data, and model what you actually predict: the difference between the baseline (0.5) and the probability to vote for party 0 or party 1. If you model the outcomes as 0/1 without this, then you are using a model made for gaussian variables on what should be a logistic regression 2. if you can live with something that more accurately captures the idea of "explanatory power", you can use a GLM (logistic link function), do a logistic regression, and then use the log odds or another measure.

In both cases, the variance explained by the state that you are in is 1, because of course it is, that's how the thought experiment is constructed - p(vote for party 1)=0.5+ \delta(state).

"Paradoxes" like this are often interesting in the sense that they point to the math being the wrong math or you using it wrong, but instead people tend to assume that they are obviously understanding things correctly so it must be some weird property of the world (which then sometimes is used to construct some faulty conclusions as in some of the cited papers)

[+] justk|1 year ago|reply
From (1) On the other hand, if the variation between the group means and the grand mean is small, and the variation within groups is large, this suggests there are no real differences in the group means i.e. the variations we observe is just sampling variation.

The above is in the context of analysis of variance. In our example the means in each state are 0.55 and 0.45 and the total mean is 0.50 so first summand is small but the variances in the red and blue states are both 0.247, large summand, so the variations we observe are just sampling variations. Hence the state factor is not important and that explains the low R^2 value. Note that in each state the predicted value for the model is the group mean of that group. So analysis of variance explains that the OP result is not a paradox or something strange.

https://saestatsteaching.tech/analysis-of-variance

[+] c76|1 year ago|reply
Isn’t the phenomenon just related to the way the vote options are encoded? Use different methods and you will see different R^2 results. Aren’t the votes represented artificially on a continuous domain for the R^2 calculation but the actual values are categorical values?
[+] ayhanfuat|1 year ago|reply
Not really. No matter how you encode extremes will be 0% and 100% and one option will be 45% and the other 55%.
[+] leto_ii|1 year ago|reply
As other commenters have pointed out in one way or another, the problem seems to actually be that this simplistic model of voter choice can't capture all the structure of the real world that humans can quickly infer from the setup. Things like: state elections have millions of voters, 55/45 is actually a decisive, not a narrow win etc.

In a generic setup, imagine you have a binary classifier that outputs probabilities in the .45-.55 range - likely it won't be a really strong classifier. You would ideally like polarized predictions, not values around .5.

Come to think of it, could this be an issue of non-ergodicity too ( hope I'm using the term right)? i.e. state level prior is not that informative wrt individual vote?

[+] jncfhnb|1 year ago|reply
No, you want your model to be well calibrated. If the model accurately assessed a 0.55 probability of going blue, then that is what you want.

People who try to correct for “unbalanced classes” and contort their model to give polarizing predictions are frankly being pretty dumb.

The correct answer is to take your well calibrated probabilities and use you brain on what to do with them.

[+] ninjinxo|1 year ago|reply
If voters are split 60-40 on an issue, that doesn't mean that the odds are 60-40.

You should instead be asking, what are the odds that that X voters could change their vote.

[+] gweinberg|1 year ago|reply
The states (and even more so the sub-state regions) really are much more different than what you would think just looking at R vs D. A Democrat in a city the Democrats win 90-10 is likely a very different Democrat from one where they lose 60-40.
[+] Chinjut|1 year ago|reply
If you think that's bad, the R^4 coefficient is even lower.
[+] kazinator|1 year ago|reply
Nothing but endless cloudfare captchas here for me.

Removing cookies for the domain doesn't help, because (doh) I've never visited it before.