Very cool appendix describing how they collected the data. I was kind of surprised to learn that they collected arXiv abstracts + metadata from Kaggle, but it definitely makes sense. I was also surprised that 6 years of SSRN papers was only ~1.3m documents. If you assume 20 pages/document and 400 words/page and 1.3 tokens/word, then it would only cost (ballpark) $1000 to pass the full corpus through the 4o-mini completions API. I think it would be really neat to build out a "Dataset Used", "Model Used" etc table for SSRN papers. I imagine more complicated questions would be harder to answer (because you might have to analyze non-text parts of the documents).
barishnamazov|1 month ago
[0] https://en.wikipedia.org/wiki/Paul_Ginsparg