(no title)
shadowmint | 6 years ago
So, honest question:
If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?
Isn’t this a self defeating argue?
You can’t prove the sample is random, all you can do is show differences between samples and suggest its not consistent... but how do we go away and prove that some other survey we’re comparing it to is from a random sample?
ie. Isnt this just a convenient excuse to deny that a survey is meaningful?
Statistically, how do you mathemtaically quantify the effect of selection bias?
...because, it seems to me, unless you can actually do that, you’re just doing some arm chairmhand waving because you don’t like the results youre seeing.
This has come up several times (eg. js survey about react vs angular), and no one has ever given me a meaningful and mathematical response.
Its always just.. “it must be sample bias”, regardless of the 90000 people they surveyed.
I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.
Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.
(I’m not proud, I’ll take your down votes, but please answer and explain what I’m missing)
prepend|6 years ago
This was the author’s point. Just because you have 90k SO respondents doesn’t mean you can say anything about developers as a population. You can say lots of stuff about SO users. Or maybe developers who use SO. But just because you have lots of responses doesn’t mean you know what developers or jugglers or farmers or whatever population interests you.
The confusion rests with SO’s statement that their survey should be representative of developers in general (or CS graduates or whatever other than only SO visitors).
balfirevic|6 years ago
astazangasta|6 years ago
https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.h...
The way to deal with this is to try to construct a representative sample. Here is Gallup's method in 1936:
> But George Gallup knew that huge samples did not guarantee accuracy. The method he relied on was called quota sampling, a technique also used at the time by polling pioneers Archibald Crossley and Elmo Roper. The idea was to canvass groups of people who were representative of the electorate. Gallup sent out hundreds of interviewers across the country, each of whom was given quotas for different types of respondents; so many middle-class urban women, so many lower-class rural men, and so on. Gallup's team conducted some 3,000 interviews, but nowhere near the 10 million polled that year by the Digest.
Stack Overflow did not attempt to construct a representative sample of developers. Therefore they cannot claim that we can learn from their sample about the population.
MaulingMonkey|6 years ago
One can take efforts to make the sample more random. This is part of the reason why the U.S. Census is legally compelled, for example - to try and reduce self-selection bias. Or the push for mandatory standardized tests in schools.
One can contextualize the results. Applying, say, English literacy rates from a U.S. Survey to China is obviously going to be totally wrong. Applying a developer salary survey at Google to Game Developers is going to be totally wrong. But within their context, they can be more accurate. Outside of their original context, the survey can be re-run.
> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?
While convenient, it's sometimes also inconveniently true that a survey isn't terribly meaningful, or isn't in the context it's being reapplied in. Statistical stuff is hard, a lot of surveys are bad, and while you can make some reasonable guesses and extrapolations, it's worth doing so with a giant grain of salt.
tonyarkles|6 years ago
This survey obviously does not tap directly into the brains of every developer on the planet and extract their unbiased answers to the questions. But it’s still a useful model for seeing trends in the software industry.
Further, from my personal perspective, I’m pretty ok with the self-selected sampling bias inherent in the survey. The kinds of developers who see the value of Stackoverflow and are willing to participate voluntarily in the survey are the kinds of developers whose opinions I generally care about :). That’s my own bias, which I acknowledge exists, and it doesn’t particularly bother me.
Edit: further, none of the results jump out at me as particularly surprising. If there were some extraordinary results here, I’d want someone to do a more rigorous follow-up to dig into that, but there isn’t so...
ChrisSD|6 years ago
Surely you have this backwards? If you want to argue that a survey offers any generalisation, then surely the onus is on you to prove you've accounted for sample bias (amongst others)?
shadowmint|6 years ago
If you want to argue with it, surely the onus is on you to do it concretely?
> Because of your methodology, we must assume a biased sample.
^ I find this quote problematic.
Why must we assume that? If you want to distribution comparisons and point out there survey results are skewed by X compared to some other survey Y... ok.
...but that’s not whats happening right? Its just a flat out arbitrary assumption.
I don’t like arbitrary assumptions when I’m doing maths.
Its easy to say something is wrong, but if you can’t quanitfy how its wrong, I’m struggling to see why I should accept the assumption being raised here.
The js survey was very similar; it was arbitrarily asserted it went to more react developers... but no one actually proved that. They just... assumed it.
shkkmo|6 years ago
Nope, not at all.
It is true that no sample will ever be perfectly representative of a larger population. However, some samples can clearly be more representative than others and the easiest way to tell is to look at sampling methodology. Sample size has absolutely no effect on removing bias.
Here's some info so you can learn more about sampling methodologies: https://blog.socialcops.com/academy/resources/6-sampling-tec...
Now, doing actually representative samples is HARD in many situations, so is knowing how representative your sample is. This is why we can't predict things like who is going to win elections.
dahart|6 years ago
I didn’t see anyone point this out here yet specifically, but what you’re missing is that these 90k devs chose to respond to the survey, and the group is made of only SO participants, they were not developers selected at random. That’s the problem here.
There is an overwhelming bias, and it has been proved. Stack Overflow admits that openly and Julia talked about it in her answer to the OP’s commentary:
“Developers from underrepresented groups in tech participate on Stack Overflow at lower rates, so we undersample those groups, compared to their participation in the software developer workforce. We have data that confirms that”
balfirevic|6 years ago
What you're missing is that one of the intuitions you have is simply wrong. The intuition is that sample size can undo the ill effects of non-random sample. As stated in the original article and elsewhere in the comments, it cannot:
> It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias
daze42|6 years ago