Ask HN: Regression to the mean math question
I have a question about regression to the mean.
Suppose you have a set of pairs (a,b) corresponding to students in a class. a = the student's score on the first midterm, b = score on second midterm.
If you plot the pairs with a on x-axis, b on y-axis, then get the least-squares line, you have an upward sloping line.
The line slope should be less than 1, indicating regression to the mean.
If you plot b on x-axis, a on y-axis, the slope is necessarily now greater than 1. But I fail to see what has changed in the analysis -- a and b are both just supposed to be samples from the same distribution, right?
This has been driving me crazy, so I'd love some help.
Thank you!
[+] [-] roundsquare|16 years ago|reply
x < mean => y > x
x > mean => y < x
If the scores are normalized. Regression to the mean is that most people move towards the mean in subsequent games/attempts/whatever.
But I fail to see what has changed in the analysis -- a and b are both just supposed to be samples from the same distribution, right?
Not at all. b is not independent of a, thats the whole point of regression to the mean. If you take ordered pairs where there is no connection between a and b, then you won't get any regression to the mean, you'll get points essentially randomly placed on the plane.
[+] [-] noaharc|16 years ago|reply
Fair point about "no connection between a and b".
What I should have said was something like: Why is it important that a come before b chronologically? If we were mistaken, and we thought that b came first, then what we would be seeing is "progression from the mean".
Does the concept of regression to the mean depend on the chronology of events? That would be weird -- most probability doesn't, right?
[+] [-] mbrubeck|16 years ago|reply
This is exactly reversed. If A and B are perfectly correlated, then you will have no regression to the mean. If they are perfectly independent, then you will have full regression to the mean. If they are only partially correlated, then you will have only partial regression to the mean.
(This is easy to see if you run a simulation of each case.)
[+] [-] mbrubeck|16 years ago|reply
If for some reason only the above-average students regressed, then the slope would be <1. But regression to the mean also affects the scores of students who started below average; as a group we should expect them to regress upward toward the mean. Combine the two groups, and the effects exactly cancel out, leaving a slope of 1.
(Since you say the slope "should be" one, I assume the scores are normalized somehow so that the mean score for exam A is the same as the mean for exam B.)
[+] [-] dmlorenzetti|16 years ago|reply
Suppose the course material is really cumulative, so that some students "get it" and take off, while other students fall by the wayside. Then scoring well on the first test predicts scoring well on the second midterm, while scoring poorly on the first predicts scoring really badly on the second. Then the slope of your least-squares-fit line could easily be greater than 1.
In other words, the mean could stay the same, due to above-average students (on the first test) getting better, and below-average students getting worse. There's no reason to suppose that below-average students will magically get better.
[+] [-] noaharc|16 years ago|reply
If you do poorly on the first test, the x-coordinate is low/close to y-axis. You're then expected to do better on the second test, so the y-coordinate is high. This will flatten the left half of the line.
As you said, if you do well on the first test, the x-coordinate is high, and the y-coordinate is low. This will flatten the right half of the line.
Right?