top | item 41187239

(no title)

I'm very skeptical on this, the paper they linked is not convincing. It says that GPT-4 is correct at predicting the experiment outcome direction 69% of the time versus 66% of the time for human forecasters. But this is a silly benchmark because people are not trusting human forecasters in the first place, that's the whole purpose for why the experiment is run. Knowing that GPT-4 is slightly better at predicting experiments than some human guessing doesn't make it a useful substitute for the actual experiment.

discuss

addcn|1 year ago

For sure. Great argument

+ the experiments may already be in the dataset so it’s really testing if it remembers pop psychology

a123b456c|1 year ago

Yes. A stronger test would be guessing the results of as-yet-unpublished experiments.

lumb63|1 year ago

Furthermore, there’s a replication crisis in social sciences. The last thing we need is to accumulate less data and let an LLM tell us the “right” answer.

verdverm|1 year ago

You can see this in their results, where certain types of studies have a lower prediction rate and higher variability

jslezak|1 year ago

Predicting the actual results of real unpublished experiments with a 0.9 correlation factor is a very non-trivial result. The human forecasts comparison is not the central finding

yas_hmaheshwari|1 year ago

Nicely put! Well argued!

I was not able to put my finger on what I felt wrong about the article -- till I read this

katzinsky|1 year ago

That's surprisingly low considering it was probably trained on many of the papers it's supposed to be replicating.

authorfly|1 year ago

I totally agree. So many people are missing the point here.

Also important is that in Psychology/Sociology, it's the counter-intuitive results that get published. But these results disproportionately fail to replicate!

Nobody cares if you confirm something obvious, unless it's on something divisive (e.g. sexual behavior, politics), or there is an agenda (dieting, etc). So people can predict those ones more easily than predicting a randomly generated premise. The ones that made their way into the prediction set were the ones researchers expected to be counter-intuitive (and likely P-hacked a significant proportion of them to find that result). People know this (there are more positive confirming papers than negative/fail-to-replicate).

This means the counter-intuitive, negatively forecast results, are the ones that get published i.e. the dataset saying that 66% of human forecasters is disproportionately constructed of studies that found counter-intuitive results compared to the overall neutral pre-published set of studies, because scientists and grant winners are incentivised to publish counter-intuitive work. I would even suggest the selected studies are more tantalizing that average in most of these studies, they are key findings, rather than the miniature of comments on methods or re-analysis.

By the way the 66% result has not held up super well in other research, for example, only 58% could predict if papers would replicate later on: https://www.bps.org.uk/research-digest/want-know-whether-psy... - Results with random people show that they are better than chance for psychology, but on average by less than 66% and with massive variance. This figure doesn't differ from psychology professors which should tell you the stat represents more the context of the field and it's research apparatus itself rather than capability to predict research. What if we revisit this GPT-4 paper in 20 years, see which have replicated, ask people to predict that - will GPT-4 still be higher if it's data is frozen today? If it is up to date? Will people hit 66%, 58%, or 50%?

My point is, predicting the results now is not that useful because historically, up to "most" of the results have been wrong anyhow. Predicting which results will be true and remain true would be more useful. The article tries to dismiss the issue of the replication crisis by avoiding it, and by using pre-registered studies, but such tools are only bandages. Studies still get cancelled, or never proposed after internal experimentation, we don't have a "replication reputation meter" to measure those (which affect and increase false positive results), and we likely never will, with this model of science for psychology/sociology statistics. If the authors read my comment and disagree, they should use predictions for underway replications with GPT-4 and humans, wait a few years for the results, and then conduct analysis.

Also, more to the point, as a Psychology grant funded once told me - the way to get a grant in Psychology is to: 1) Acquire a result with a counter-intuitive result first. Quick'n'dirty research method like students filling in forms, small sample size, not even published, whatever. Just make the story good for this one and get some preliminary numbers on some topic by casting a big web of many questions (a few will get P < 0.05 by chance eventually in most topics anyway at this sample size) 2) Find an angle whereby said result says something about culture or development (e.g. "The Marhsmallow experiment shows that poverty is already determined by your response to tradeoffs at a young age", or better still "The Marshmallow experiment is rubbish because it's actually entirely explained by SES as a third factor, and wealth disparity in the first place is ergo the cause". Importantly, change the research method to something more "proper" and instead apply P-hacking if possible when you actually carry out the research. The biggest P-hack is so simple and obvious nobody cares: you drop results that contradict or are insignificant, and just don't report them - carrying out alternate analysis, collecting slightly different data, switching from online to in person experiments, whatever you canto get a result. 3) Upon the premise of further tantalizing results, propose several studies which can fund you over 5 years, apply some of the buzz words of the day. Instead of "Thematic Analysis", It's "AI Summative Assessment" for the Word Frequency amounts, etc. If you know the grant judgers, avoid contradicting whatever they say, but be just outside of the dogma enough (usually, culturally) to represent movement/progress of "science".

This is how 99% of research works. The grant holder directs the other researchers. When directing them to carry out an alternate version of the experiment or change what we are analyzing, you motivate them that it's for the good of the future, society, being at the cutting edge, and supporting the overarching theory (which ofcourse, already has "hundreds" of supporting evidence from other studies constructed in the same fashion).

As to sociology/psychology experiments - Do social experiments represent language and culture more than people and groups? Randomly.

Do they represent what would be counter-intuitive or support developing and entrenching models and agendas? Yes.

90% of social science studies have insufficient data to say anything at P < 0.01 level which should realistically be our goal if we even want to do statistics with the current dogma for this field (said kindly because some large datasets are genuine enough and used for several studies to make up the numbers in the 10%). I strongly see a revolution in psychology/sociology within the next 50 years to redefine a new basis.

equinox12|1 year ago

I think this analysis is misguided.

Even considering a historic bias for counter-intuitive results in social science, this has no bearing on the results of the paper being discussed. Most of the survey experiments that the researchers used in their analyses came from TESS, an NSF-funded program that collects well-powered nationally representative samples for researchers. A key thing to note here is that not every study from TESS gets published. Of course, some do, but the researchers find that GPT4 can predict the results of both published and unpublished studies at a similar rate of accuracy (r = 0.85 for published studies and r = 0.90 for unpublished studies). Also, given that the majority of these studies 1) were pre-registered (even pre-registering sample size), 2) had their data collected through TESS (an independent survey vendor), and 3) well-powered + nationally-representative, makes it extremely unlikely for them to have been p-hacked. Therefore, regardless of what the researchers hypothesized, TESS still collected the data and the data is of the highest quality within social science.

Moreover, the researchers don't just look at psychology or sociology studies, there are studies from other fields like political science and social policy, for example, so your critiques about psychology don't apply to all the survey experiments.

Lastly, the study also includes a number of large-scale behavioral field experiments and finds that GPT4 can accurately predict the results of these field experiments, even when the dependent variable is a behavioral metric and not just a text-based response (e.g., figuring out which text messages encourage greater gym attendance). It's hard for me to see how your critique works in light of this fact also.

fl0id|1 year ago

This so much. There was another similar one recently which was also BS.

SrslyJosh|1 year ago

[deleted]