top | item 38205010

(no title)

> Although the author OCR’ed the SAT questions and believes that they weren’t in the training data

I agree that the author of the tweet fairly underestimates the potential portion of OCR'ed contents in OpenAI's training data. In late August, Nougat[1] is released by Meta, this is an OCR model. Its performance are wild and the model is open source.

I hardly believe that OpenAI does not spend effort on getting more training from OCR content. I also hardly believes that OpenAI waits for a Meta paper to have an internal performant OCR model.

[1]: https://arxiv.org/abs/2308.13418

discuss

No comments yet.