top | item 40335751

(no title)

Eiim | 1 year ago

I'm working on my Master's in Statistics, so I feel I can comment on some of what's going on here (although there are others more experienced than me in the comments as well, and I generally agree with their assessments). I'm going to look only at the diabetes example paper for now, mostly because I have finals tomorrow. I find it to be the equivalent of a STA261 final project at our university, with some extra fluff and nicer formatting. It's certainly not close to something I could submit to a journal.

The whole paper is "we took an existing dataset and ran the simplest reasonable model (a logistics regression) on it". That's about 5-10 minutes in R (or Python, or SAS, or whatever else). It's a very well-understood process, and it's a good starting point to understand the data, but it can't be the only thing in your paper, this isn't the 80's anymore.

The overall style is verbose and flowery, typical of LLMs. Good research papers should be straightforward and to the point. There's also strange mixing of "we" and "I" throughout.

We learn in the introduction that interaction effects were tested. That's fine, I'd want to see it set up earlier why these interaction effects are posited to be interesting. It said earlier that "a comprehensive investigation considering a multitude of diabetes-influencing lifestyle factors concurrently in relation to obesity remains to befully considered", but quite frankly, I don't believe that. Diabetes is remarkably well-studied, especially in observational studies like this one, due to its prevalence. I haven't searched the literature but I really doubt that no similar analysis has been done. This is one of the hardest parts of a research paper, finding existing research and where its gaps are, and I don't think an LLM will be sufficiently capable of that any time soon.

There's a complete lack of EDA in the paper. I don't need much (the whole analysis of this paper could be part of the EDA for a proper paper), but some basic distributional statistics of the variables. How many responses in the dataset were diabetic? Is there a sex bias? What about age distribution? Are any values missing? These are really important for observational studies because if there's any issues they should be addressed in some way. As it is, it's basically saying "trust us, our data is perfect" which is a huge ask. It's really weird that a bunch of this is in the appendix (which is way too long to be included in the paper, would need to be supplementary materials, but that's fine) (and also it's poorly formatted) but not mentioned anywhere in the paper itself. When looking at the appendix, the main concern that I have is that only 14% of the dataset is diabetic. This means that models will be biased towards predicting non-diabetic (if you just predict non-diabetic all of the time, you're already 86% accurate!). It's not as big of an issue for logistic regression, or for observational modeling like this, but I would have preferred an adjustment related to this.

In the results, I'm disappointed by the over-reliance on p-values. This is something that the statistics field is trying to move away from, of a multitude of reasons, one of which is demonstrated quite nicely here: p-values are (almost) always miniscule with large n, and in this case n=253680 is very large. Standard errors and CIs have the same issue. The Z-value is the most useful measure of confidence here in my eyes. Effect sizes are typically the more interesting metric for such studies. On that note, I would have liked to see predictors normalized so that coefficients can be directly compared. BMI, for example, has a small coefficient, but that's likely just because it has a large range and variance.

It's claimed that the AIC shows improved fit for the second model, but the change is only ~0.5%, which isn't especially convincing. In fact, it could be much less, because we don't have enough significant figures to see how the rounding went down. p-value is basically meaningless as previously stated.

The methods section says almost nothing that isn't already stated at least once. I'd like to know something about the tools which were used in this section, which is completely lacking. I do want it highlight this quote: "Both models employed a method to adjust for all possible confounders inthe analysis." What??? All possible confounders? If you know what that means you know that that's BS. "A method"? What is your magic tool to remove all variance not reflected in the dataset, I need to know! I certainly don't see it reflected in the code.

The code itself seems fine, maybe a little over-complicated but that might be necessary for how it Interfaces with the LLM. The actual analysis is equivalent to 3 basic lines of R (read CSV, basic log reg with default parameters 1, basic log reg with default parameters 2).

This paper would probably get about a B+ in 261, but shouldn't pass a 400-level class. The analysis is very simple and unimpressive for a few reasons. For one, the questions asked of the dataset are very light. More interesting, for example, might have been to do variable selection on all interaction terms and find which are important. More models should have been compared. The dataset is also extremely simple and doesn't demand complex analysis. An experimental design, or messy data with errors and missing values, or something requiring multiple datasets, would be a more serious challenge. It's quite possible that one of the other papers addresses this though.

discuss

roykishony|1 year ago

Thanks so much for these thorough comments.

You suggested some directions for more complex analysis that could be done on this data - I would be so curious to see what you get if you could take the time to try out running data-to-paper as a co-pilot on your own - you can then give it directions and feedback on where to go - will be fascinating to see where you take it!

We also must look ahead: complexity and novelty will rapidly increase as ChatGPT5, ChatGPT6 etc are rolled in. The key with data-to-paper is to build a platform that harnesses these tools in a structured way that creates transparent and well-traceable papers. Your ability to read and understand and follow all the analysis in these manuscripts so quickly speaks to your talent of course, but also to the way these papers are structured. Talking from experience, it is much harder to review human-created papers at such speed and accuracy...

As for your comments on “it's certainly not close to something I could submit to a journal” - please kindly look at the examples where we show reproducing peer reviewed publications (published in a completely reasonable Q1 journal, PLOS One). See this original paper by Saint-Fleur et al: https://journals.plos.org/plosone/article?id=10.1371/journal...

and here are 10 different independent data-to-paper runs in which we gave it the raw data and the research goal of the original publication and asked it to do the analysis reach conclusions and write the paper: https://github.com/rkishony/data-to-paper-supplementary/tree... (look up the 10 manuscripts designated “manuscriptC1.pdf” - “manuscriptC10.pdf”)

See our own analysis of these manuscripts and reliability in our arxiv preprint: https://arxiv.org/abs/2404.17605

Note that the original paper was published after the training horizon of the LLM that we used and also that we have programmatically removed the original paper from the result of the literature search that data-to-paper does so that it cannot see it in the search.

Thanks so much again and good luck for the exam tomorrow!