Effect size is significantly more important than statistical significance

jerf|4 years ago

Speaking not to this study in particular necessarily, I strongly agree with the general point. Science has really been held back by an over-focusing on "significance". But I'm not really interested in a pile of hundreds of thousands of studies that establish a tiny effect with suspiciously-just-barely-significant results. I'm interested in studies that reveal robust results that are reliable enough to be built on to produce other results. Results of 3% variations with p=0.046 aren't. They're dead ends, because you can't put very many of those into the foundations of future papers before the probability of one of your foundations being incorrect is too large.

To the extent that those are hard to come by... Yeah! They are! Science is hard. Nobody promised this would be easy. Science shouldn't be something where labs are cranking out easy 3%/p=0.046 papers all the time just to keep funding. It's just a waste of money and time of our smartest people. It should be harder than it is now.

Too many proposals are obviously only going to be capable of turning up that result (insufficient statistical power is often obvious right in the proposal, if you take the time to work the math). I'd rather see more wood behind fewer arrows, and see fewer proposals chasing much more statistical power, than the chaff of garbage we get now.

If I were King of Science, or at least, editor of a prestigious journal, I'd want to put word out that I'm looking for papers with at least one of some sort of significant effect, or a p value of something like p = 0.0001. Yeah. That's a high bar. I know. That's the point.

"But jerf, isn't it still valuable to map out all the little things like that?" No, it really isn't. We already have every reason in the world to believe the world is drenched in 1%/p=0.05 effects. "Everything's correlated to everything", so that's not some sort of amazing find, it's the totally expected output of living in our reality. Really, this sort of stuff is still just below the noise floor. Plus, the idea that we can remove such small, noisy confounding factors is just silly. We need to look for the things that stand out from that noise floor, not spending billions of dollars doing the equivalent of listening to our spirit guides communicate to us over white noise from the radio.

naasking|4 years ago

> If I were King of Science, or at least, editor of a prestigious journal, I'd want to put word out that I'm looking for papers with at least one of some sort of significant effect, or a p value of something like p = 0.0001. Yeah. That's a high bar. I know. That's the point.

And study preregistration to avoid p-hacking and incentivize publishing negative results. And full availability of data, aka "open science".

Robotbeat|4 years ago

The problem is that when you’re on the cusp of a new thing, unless you’re super lucky, the result will necessarily be near the noise floor. Real science is like that.

But I definitely agree it’d be nice to go back and show something is true to p=.0001 or whatever. Overwhelmingly solid evidence is truly a wonderful thing, and as you say, it’s really the only way to build a solid foundation.

When you engineer stuff, it needs to work 99.99-99.999% of the time or more. Otherwise you’re severely limited to how far your machine can go (in terms of complexity, levels of abstraction and organization) before it spends most of its time in a broken state.

I’ve been thinking about this while playing Factorio: so much of our discussion and mental modeling of automation works under the assumption of perfect reliability. If you had SLIGHTLY below 100% reliability in Factorio, the game would be a terrible grind limited to small factories. Likewise with mathematical proofs or computer transistors or self driving cars or any other kind of automation. The reliability needs to be insanely good. You need to add a bunch of nines to whatever you’re making.

A counterpoint to this is when you’re in an emergency and inaction means people die. In that case, you need to accept some uncertainty early on.

hyperbovine|4 years ago

Come into Bayesian land, the water is fine. The whole NHST edifice starts to seem really shaky once you stop and wonder if "True" and "False" are really the only two possible states of a scientific hypothesis. Andrew Gelman has written about this in many places, e.g. http://www.stat.columbia.edu/~gelman/research/published/aban....

chuckcode|4 years ago

Don't get distracted by the click bait title. Effect size should be captured by statistical significance (larger effects are less likely to happen by chance). Author is really complaining that the original study didn't report enough data to check their analysis or do alternative analysis methods. Better title for article would be "Hard to peer review when you don't share the data"

tommiegannert|4 years ago

A few years ago, HN comments complained about the censorship that only leaves successful studies. We need to report on everything we've tried, so we don't walk around on donuts.

What's missing in my mind is admitting that results were negative. I'm reading up on financial literacy, and many studies end with some metrics being "great" at p 5%, but then some other metrics are also "great" at p 10%, without the author ever explaining what they would have classified as bad. They're just reported without explanation of what significance they would expect (in their field).

modeless|4 years ago

Not only is it not valuable to publish tons of studies with p=.04999 and small effect size, in fact it's harmful. With so many questionable results published in supposedly reputable places it becomes possible to "prove" all sorts of crackpot theories by selectively citing real research. And if you try to dispute the studies you can get accused of being anti-science.

shakezula|4 years ago

I blame most of this on pop science. It's absolutely ruined the average public's respect for the behind the scenes work doing interesting stuff in every field. What's worse is the attitude it breeds. Anti-intellectualism runs rampant amongst even well educated members of my social circle. It's frustrating to say the least.

bluGill|4 years ago

> Plus, the idea that we can remove such small, noisy confounding factors is just silly. We need to look for the things that stand out from that noise floor

We have found most of them, and all the easy ones. Today the interesting things are near the noise floor. 3000 years ago atoms were well below the noise floor, now we know a lot about them - most of it seems useless in daily life yet a large part of the things we use daily depend on our knowledge of the atom.

Science needs to keep separating things from the noise floor. Some of them become important once we understand it.

setgree|4 years ago

> or a p value of something like p = 0.0001

This has been proposed [0], albeit for a threshold of p < 0.005.

Here's Andy Gelman and others arguing otherwise [1]. They also got like 800 scientists to sign on to the general idea of no longer using statistical significance at all [2].

[0] https://www.nature.com/articles/s41562-017-0189-z

[1] http://www.stat.columbia.edu/~gelman/research/unpublished/ab...

[2] https://www.nature.com/articles/d41586-019-00857-9

phreeza|4 years ago

This is clearly a cost/benefit tradeoff, and the sweet spot will depend entirely on the field. If you are studying the behavior of heads of state, getting an additional N is extremely costly, and having a p=0.05 study is maybe more valuable than having no published study at all, because the stakes are very high and even a 1% chance of (for example) preventing nuclear war is worth a lot. On the other hand, if you are studying fruit flies, an additional N may be much cheaper, and the benefit of yet another low effect size study may be small, so I could see a good argument being made for more stringent standards. In fact I know that in particle physics the bar for discovery is much higher than p=0.05.

kuhewa|4 years ago

Nothing is wrong with publishing small effect size results. Setting a P threshold lower or a a higher bar for effect sizes for journal acceptance will just increase the positivity bias and also encourage more dodgy practices. Null results are important.

Understanding effect size is as important as significance can manifest by requiring effect size or variance explained to be reported every time the result of a statistical test is presented, e.g. rather than simply "a significant increase was observed (p = 0.01)" and also making that kind of parsing the standard in scientific journalism.

rscho|4 years ago

If you were the king of science, I'd kindly ask you to think about replacing grant financing and all other financial incentives that go along with publishing. Now that would be efficient. 'Cause I currently make .05-barely-significant-results but if you force me to up my game I will provide .0001-barely-significant-results no problem, even with 'preregistration' or whatever hoop you hold in front of me.

As an aside, could you also please make medicine a real science, so I can finally scientifically demonstrate that my boss is wrong?

Tycho|4 years ago

What do you (or anyone else) think about the statistical conclusions in this paper? Particularly the adjusted r-squared values reported.

https://www.cambridge.org/core/journals/american-political-s...

raxxorrax|4 years ago

The current science economy around publishing is partially responsible, although it should also be said that finding no correlation is still a gain of knowledge that is valuable to build upon for people in the same field, even if it might not generate the most exciting read for others.

sanxiyn|4 years ago

I agree we shouldn't listen to noise, but small effect size is not necessarily noise. (I agree it is highly correlated.) I mean, QED's effect size on g factor is 1.001. QED was very much worth finding out.

davidmanheim|4 years ago

Maybe all studies should be preregistered, including their methods... like this one was?

https://osf.io/vzdh6/

BenoitEssiambre|4 years ago

p = 0.0001 doesn't help much. You can get to an arbitrarily small p by just using more data. The problem is trying to reject a zero width null hypothesis. Scientists should always reject something bigger than infinitesimally small so that they are not catching tiny systematic biases in their experiments. There are always small biases.

Gwern's page "Everything Is Correlated" is worth reading: https://www.gwern.net/Everything

robocat|4 years ago

From the article:

Ernest Rutherford is famously quoted proclaiming “If your experiment needs statistics, you ought to have done a better experiment.”

“Of course, there is an existential problem arguing for large effect sizes. If most effect sizes are small or zero, then most interventions are useless. And this forces us scientists to confront our cosmic impotence, which remains a humbling and frustrating experience.”

mr_toad|4 years ago

Must be nice. Not everyone has the luxury of being able to carry out whatever experimentation they feel like. Sometimes we’re limited by what is affordable, practical, or ethical.

exporectomy|4 years ago

I wonder if we should separate the roles of scientist and researcher. Universities would have generalist "scientists" who's job would be to consult for domain-specialized researchers to ensure they're doing the science and statistics correctly. That way, we don't need every researcher in every field to have a deep understanding of statistics, which they often don't.

Either that or stop rewarding such bad behavior. Science jobs are highly competitive, so why not exclude people with weak statistics? Maybe because weak statistics leads to more spurious exciting publications which makes the researcher and institution look better?

civilized|4 years ago

The scientific establishment will never be convinced to stop doing bad statistics, so "the solution to bad speech is more speech". Statisticians should be rewarded for effective review and criticism of flawed studies, and critical statistical reviews of any article should be easy to find when they exist.

This is sounding like a great startup idea for a new scientific journal, actually.

neumann|4 years ago

Every medical researcher I've worked with had a biostatistician on hand to handle the stats. As a aerospace engineer, I always had interesting discussions with them on the meaningfulness of a clinical study with 15 people, but have come to appreciate the massive difficulty in progressing medical research if everybody were to wait for a clinical trial with a 1000 patients.

jltsiren|4 years ago

Such staff scientist roles for people with particular methodological skills do exist. They are not particularly common, because there are a few issues:

1. Who will pay for them?

2. How do we make staff scientist roles attractive to people who could also get tenure-track faculty positions or do ML/data science in the industry?

3. How do we ensure that a staff scientist position is not a career dead end if the funding dries up after a decade or two?

The standard academic incentives (long-term stability provided by tenure, freedom to work on whatever you find interesting, recognition among other experts in the field) don't really apply to support roles.

Robotbeat|4 years ago

We exclude people who don’t publish. Papers tend not to publish stuff that isn’t a positive result.

abeppu|4 years ago

I think the weird thing is that a bunch of people in tech understand this well _with respect to tech_, but often fall into the same p-value trap when reading about science.

If you're working with very large datasets generated from e.g. a huge number of interactions between users and your system, whether as a correlation after the fact, or as an A/B experiment, getting a statistically significant result is easy. Getting a meaningful improvement is rarer, and gets harder after a system has received a fair amount of work.

But then people who work in these big-data contexts can read about a result outside their field (e.g. nutrition, psychology, whatever), where n=200 undergrads or something, and p=0.03 (yay!) and there's some pretty modest effect, and be taken in by whatever claim is being made.

RandomLensman|4 years ago

These discussions are fun but rather pointless: e.g., sometimes a small effect is really interesting but it needs to be pretty strongly supported (for instance, claiming a 1% higher electron mass or a 2% survival rate in rabies).

Also, most published research is inconsequential so it really does not matter other than money spent (and that is not only related to findings but also keeping people employed etc.). If confidence in results is truly an objective might need to link it directly to personal income or loss of income, ie force bets on it.

versteegen|4 years ago

If you have a tiny effect size on X, you probably haven't discovered a significant cause of X, but just something incidental.

For example, smoking was finally proved to cause lung cancer because the effect size was so large that the argument that 'correlation does not imply causation' became absurd: it would have required the existence of a genetic or other common cause Z that both causes people to smoke and causes them to develop cancer with correlations at least as large as between smoking and lung cancer, but there just isn't anything correlated that strongly. It would imply that almost everyone who smokes heavily does so because of Z.

hammock|4 years ago

>Effect Size Is Significantly More Important Than Statistical Significance

Ok, but by how much?

stickfigure|4 years ago

You forgot "...and how often?"

ummonk|4 years ago

Agree with the title, but not the contents. The study in question is actually an example of a huge effect size (10% reduction in cases just from instructing villages they should wear masks is amazing) possibly hampered by poor statistical significance (as the blog post outlines).

robocat|4 years ago

Without knowing how many people were wearing masks, you can’t say the much about the 10% figure.

You get approximately[1] the same outcome if:

(a) masks are 100% effective but only 10% wear them, and

(b) masks are 10% effective and 100% wear them.

Is this study showing (a) or (b)?

Let us assume (b) masks only help by 10% and R0 is 2 without masks. If exponential transmission is occurring then in ~11.5 days you have the same number infected with masks as in 10 days without masks.

Either way the study has ended up with a 10% figure, and that figure gets misunderstood or intentionally misrepresented. If you want to argue for the effectiveness of masks against those that don’t wish to wear them, then personally I think it is a terrible study to argue with because 10% sounds shitty.

[1] Actual numbers depends on a heap of other things, but just assume those figures are right for the sake of making things easy to understand.

Disclaimer: I wear a mask during Level 2 lockdown in the South Island of New Zealand, and mask wearing has no partisan meaning here AFAIK.

vannevar|4 years ago

It should also be noted that the positive effect rose to 35% for people over the age of 60, who make up the overwhelming majority of serious Covid-19 cases. The omission of this important fact from the article leads me to question the motivation of the author.

unknown|4 years ago

[deleted]

georgewsinger|4 years ago

HN comments are usually the time for spicy contrarian takes to OP, but this post is dead on.

Low effect sizes are often a code smell for scientific incrementalism/stagnation.

agnosticmantis|4 years ago

An investigator needs to rule out all conceivable ways their modeling can go wrong, among them the possibility of a statistical fluke, which statistical significance is supposed to take care of. So statistical significance may best be thought of as a necessary condition, but is typically is taken to be a sufficient condition for publication. If I see a strange result (p-value < 0.05), could it be because my functional form is incorrect? or because I added/removed some data? Or I failed to include an important variable? These are hard questions and not amenable to algorithmic application and mass production. Typically these questions are ignored, and only the possibility of a statistical fluke is ruled out (which itself depends on the other assumptions being valid).

Dave Freedman's Statistical Models and Shoe Leather is a good read on why such formulaic application of statistical modeling is bound to fail.[0]

[0:https://psychology.okstate.edu/faculty/jgrice/psyc5314/Freed...]

fmajid|4 years ago

The studies are in villages, but the real concern is dense urban environments like New York (or Dhaka) where people are tightly packed together and at risk of contagion. I'm pretty sure masks make little difference in Wyoming either, where the population is 5 people per square mile.

asdff|4 years ago

Whats more important than population density is activity. A New Yorker who is mostly keeping to themselves and wearing a mask is unlikely to get the virus. A Wyoming native attending church service maskless and singing indoors for an hour is more likely to get the virus.

mrtranscendence|4 years ago

> If most effect sizes are small or zero, then most interventions are useless.

But this doesn't necessarily follow, does it? If there really were a 1.1-fold reduction in risk due to mask-wearing it could still be beneficial to encourage it. The salient issue (taking up most of the piece) seems to be not the size of the effect but rather the statistical methodology the authors employed to measure that size. The p-value isn't meaningful in the face of an incorrect model -- why isn't the answer a better model rather than just giving up?

Small effects are everywhere. Sure, it's harder to disentangle them, but they're still often worth knowing.

ummonk|4 years ago

> If there really were a 1.1-fold reduction in risk due to mask-wearing it could still be beneficial to encourage it.

That's understating it. The study doesn't measure the reduction in risk due to mask-wearing, but rather the reduction simply from encouraging mask-wearing (which only increases actual mask wearing by a limited amount). If the study's results hold up statistically, then they're really impressive. With the caveat of course that they apply to older variants with less viral loads than Delta - it's likely Delta is more effective against masks simply due to its viral load.

> The salient issue (taking up most of the piece) seems to be not the size of the effect but rather the statistical methodology the authors employed to measure that size. The p-value isn't meaningful in the face of an incorrect model -- why isn't the answer a better model rather than just giving up?

Exactly. The irony of this article is that this is an example where effect size is actually not the issue - it's potential issues with statistical significance due to imperfect modeling, and an inability for other researchers to rerun an analysis on statistical significance, due to not publishing the raw data.

sanxiyn|4 years ago

I agree the problem here is an incorrect model. Mask does not act on seroprevalence. Measuring mask's effect on seroprevalence is just wrong study design, although it may be easier to do.

whatshisface|4 years ago

Who cares if each effect is a factor of 2^(1/100) improvement, just give me 100 interventions and I'll double the value being measured.

kbrtalan|4 years ago

There’s a whole book about this idea, Antifragile by Nassim Taleb, highly recommended

_Nat_|4 years ago

The title's misinformation: effect-size ISN'T more important than statistical significance.

The article itself makes some better points, e.g.

> I worry that because of statistical ambiguity, there’s not much that can be deduced at all.

, which would seem like a reasonable interpretation of the study that the article discusses.

However, the title alone seems to assert a general claim about statistical interpretation that'd seem potentially harmful to the community. Specifically, it'd seem pretty bad for someone to see the title and internalize a notion of effect-size being more important than statistical significance.

spywaregorilla|4 years ago

Not so fast. If you win your first jackpot on the first ticket. You'll require 500,000 failures (at $1 per ticket) in order to fail to reject the null hypothesis at p < 0.05. Assuming you're just doing a t test (which isn't really appropriate tbh).

If you bought just ten tickets you would have a p value below 0.0000001

And that makes sense, because a p value of 0.01 says the probability of getting a sample this far from the null hypothesis is less than 1 in a million by random chance... which is what happened when you got the extremely unlikely but highly profitable answer.

edit: post was edited making this seem out of context...

sanxiyn|4 years ago

Mask's effect size on seroprevalence is probably zero. So no effect is expected result.

That's because mask acts on R0, not seroprevalence. After acting on R0, if R0 is >1, exponential growth, if <1, exponential decay. So no effect, unless it is the thing that pushes one from >1 to <1.

unknown|4 years ago

[deleted]

lotu|4 years ago

Also, they aren't testing masking effect on seroprevalence (or R0), they are testing the effect of sending out free masks and encouraging masking. That is only going to move the percent of people masking up or down a few percent at best.

Ice_cream_suit|4 years ago

The better medical journals do stress the hazard ration efficacy and the confidence interval.

See the extract below from the NEJM: Seasonal Malaria Vaccination with or without Seasonal Malaria Chemoprevention

"The hazard ratio for the protective efficacy of RTS,S/AS01E as compared with chemoprevention was 0.92 (95% confidence interval [CI], 0.84 to 1.01), which excluded the prespecified noninferiority margin of 1.20.

The protective efficacy of the combination as compared with chemoprevention alone was 62.8% (95% CI, 58.4 to 66.8) against clinical malaria, 70.5% (95% CI, 41.9 to 85.0) against hospital admission with severe malaria according to the World Health Organization definition, and 72.9% (95% CI, 2.9 to 92.4) against death from malaria.

The protective efficacy of the combination as compared with the vaccine alone against these outcomes was 59.6% (95% CI, 54.7 to 64.0), 70.6% (95% CI, 42.3 to 85.0), and 75.3% (95% CI, 12.5 to 93.0), respectively."

https://www.nejm.org/doi/full/10.1056/NEJMoa2026330?query=fe...

ammon|4 years ago

But how much more important? :) Sorry, could not help myself.

nabla9|4 years ago

If you have one BALB/c lab mouse, you give it something, and it glows in the dark few months after, effect size alone makes it significant.

jiggunjer|4 years ago

there is no effect size if you have no control group. so you'd need 2 mice at the least.

yesOfCourse9|4 years ago

[deleted]

unknown|4 years ago

[deleted]

159 comments