Using generative AI as part of historical research: three case studies

[+] eviks|1 year ago|reply

For a case study would be nice if the case were actually studied…

> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases

> This is basically perfect,

I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???

Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?

> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).

How?

> Does this replace the actual reading required? Not at all.

With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget

[+] benbreen|1 year ago|reply

I agree, I probably should've gone into more detail on the actual case studies and implications. I may write this up as a more academic article at some point so I have space to do that.

To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.

Speaking entirely for myself here, a pretty significant part of what professional historians do is to take a ton of photos of hard-to-read archival documents, then slowly puzzle them out after the fact - not by using any OCR tool (because none of them that I'm aware of are good enough to deal with difficult paleography) but the old fashioned way, by printing them out, finding individual letters or words that are readable, and then going from there. It's tedious work and it requires at least a few days of training to get the hang of.

If anyone wants to get a sense of what this paleography actually looks like, this is something I wrote about back in 2013 when I was in grad school - https://resobscura.blogspot.com/2013/07/why-does-s-look-like...

For those looking for a specific example of an intermediate-difficulty level manuscript in English, that post shows a manuscript of the John Donne poem "A Triple Fool" which gives a sense of a typical 17th century paleography challenge that GPT-4o is able to transcribe (and which, as far as I know, OCR tools can't handle - though please correct me if I'm wrong). The "Sea surgeon" manuscript below it is what I would consider advanced-intermediate and is around the point where GPT-4o, and probably most PhD students in history, gets completely lost.

re: basically perfect, the errors I see are entirely typos which don't change the meaning (descritto instead of descritta, and the like). So yes, not perfect, but not anything which would impact a historical researcher. In terms of existing tools for translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.

re: "irrelevant books," there's really no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary. For that reason, in my own work, this is very much about augmenting rather than replacing human labor. The main work begins after this sort of LLM-augmented research. It isn't replaced by it in any way.

[+] carschno|1 year ago|reply

I wanted to say this, but could not express it as well. I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.

I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one. For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.

[+] simonw|1 year ago|reply

Full quote:

> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.

[+] pjc50|1 year ago|reply

Do you know any OCR tools that work on early modern English handwriting?

[+] simonw|1 year ago|reply

I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.

[+] dr_dshiv|1 year ago|reply

I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).

Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.

I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.

And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…

LLMs will have a huge impact on humanities scholarship; we need methods and evals.

[+] benbreen|1 year ago|reply

Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.

[+] grobbyy|1 year ago|reply

A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.

The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.

[+] dmix|1 year ago|reply

Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?

[+] dartos|1 year ago|reply

This is a showcase of exactly what LLMs are good at.

Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.

This is really cool. This is AI augmenting human capabilities.

[+] BeefWellington|1 year ago|reply

Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.

Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.

Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.

[+] satisfice|1 year ago|reply

The intractable problem, here, is that “LLMs are good historians” is a nearly useless heuristic.

I’m not a historian. I don’t speak old spanish. I am not a domain expert at all. I can’t do what the author of this post can do: expertly review the work of an LLM in his field.

My expertise is in software testing, and I can report that LLMs sometimes have reasonable testing ideas— but that doesn’t mean they are safe and effective when used for that purpose by an amateur.

Despite what the author writes, I cannot use an LLM to get good information about history.

[+] simonw|1 year ago|reply

This is similar to the problem with some of the things people have been doing with o1 and o3. I've seen people share "PhD level" results from them... but if I don't have a PhD myself in that subject it's almost impossible for me to evaluate their output and spot if it makes sense or not.

I get a ton of value out of LLMs as a programmer partly because I have 20+ years of programming experience, so it's trivial for me to spot when they are doing "good" work as opposed to making dumb mistakes.

I can't credibly evaluate their higher level output in other disciplines at all.

[+] amelius|1 year ago|reply

You __can__ get good information from an LLM, however you just have to backtrack every once in a while because the information turned out to be false.

[+] jolmg|1 year ago|reply

> explicación poética

> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.

The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.

[+] benbreen|1 year ago|reply

Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.

[+] throwup238|1 year ago|reply

> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)

I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).

Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).

[+] voidhorse|1 year ago|reply

Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.

[+] zwischenzug|1 year ago|reply

I wrote this piece in 2023, which argues similarly that LLMs are a boon, not a threat to historians

https://zwischenzugs.com/2023/12/27/what-i-learned-using-pri...

[+] adamredwoods|1 year ago|reply

>> One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.

This isn't a limitation, this is critically dangerous. Commercial AI is a centralized, controlled, biased LLM. At what point will someone train it to say something they want people to believe? How can it be trusted?

Consensus based information is still best, and I don't feel LLMs will give us that.

[+] dang|1 year ago|reply

Discussed here!

What I learned using private LLMs to write an undergraduate history essay - https://news.ycombinator.com/item?id=38813297 - Dec 2023 (81 comments)

[+] Animats|1 year ago|reply

"LLMs, which are exquisitely well-tuned machines for finding the median viewpoint on a given issue..."

That's an excellent way to put it. It's the default mode of an LLM. You can ask an LLM for biases, and get them, of course.

[+] gcanyon|1 year ago|reply

I wonder (hope) that for any given issue, the majority of the internet/the training data, and therefore the model's output, will be fairly near to the truth. Maybe not for every topic, but most.

E.g., the models won't report that unicorns are real because the majority of the internet doesn't report that unicorns are real. Of course, there may be issues (like ghosts?) where the majority of the internet isn't accurate?

[+] DennisP|1 year ago|reply

It was pretty neat seeing this because a recent paper found that AI models are bad historians: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-histo...

But the gist of its argument just seems to be that they don't know fine details of history, and make the same generalized assumptions that humans would make with only a cursory knowledge of a particular topic. This seems unavoidable for a model that compresses a broad swath of human knowledge down to a couple hundred gigabytes.

Using AI as a research tool instead of a fact database is of course a whole different thing.

[+] trgn|1 year ago|reply

One thing I'd love if models would get to help me confirm a thing or find the source od soemthing I have a vague memory of and which may be right or wrong, I just don't know.

E.g. I have this recollection of a quote, slightly pithy, from around the 19 hundreds about hobby clubs controlling social life, maybe from Mark twain, maybe not.

I just cannot come up with the prompt that gets me the answer, instead I just get hallucination after hallucination, just confirming whatever I put in, like a student who didn't study for the test and is just going along with what the professor is asking at the oral exam.

[+] sloproth|1 year ago|reply

In my experience, these AI models haven't been great with knowledge about one specific figure (like a President). I wonder if there's a movement to start introducing these AI models to books or e-books that aren't accessible online? I wish I could be able to discuss the less publicly known details of historical figures' lives or upbringings with AI, but it's clear that more niche information that you can only read about isn't available to it.

[+] ris|1 year ago|reply

Still waiting for someone to train an LLM entirely from sources written before a chosen date and be able to discuss concepts with someone apparently lacking any knowledge of the world after that date.

[+] lionkor|1 year ago|reply

Try to get an LLM to admit it doesn't know something, first

[+] EcommerceFlow|1 year ago|reply

Would be fascinating trying to get an LLM trained with 1900 data to discover Einstein physics

[+] waveBidder|1 year ago|reply

might work for say post the 1800's in literate countries, but for e.g. Rome our sources are so sparse and so far removed from the time they're writing about that it would be worse than nothing.

[+] Jordan-117|1 year ago|reply

"What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer. MonadGPT is a finetune of Mistral-Hermes 2 on 11,000 early modern texts in English, French and Latin, mostly coming from EEBO and Gallica. Like the original Mistral-Hermes, MonadGPT can be used in conversation mode. It will not only answer in an historical language and style but will use historical and dated references. This is especially visible for science questions (astronomy, medecine). Obviously, it's not recommended to follow any advice from Monad-GPT." Available to install and run locally -- or you can try it out for free online."

https://www.metafilter.com/201537/O-brave-new-world-that-has...

[+] dataviz1000|1 year ago|reply

In the 1950s, most people believed that the Soviets made the biggest contribution to stopping the Nazis. However, today, most people think it was actually the Americans who played the biggest role in defeating the Nazis.

> "In 1945, the French public said the Soviets did the most to defeat Nazi Germany - but in 2024 they're most likely to say it was the Americans"[0]

[0] https://yougov.co.uk/politics/articles/49613-d-day-anniversa...

[+] aero142|1 year ago|reply

Are there any successful models that weren't trained with RLHF, or using a system with RLHF. I'm curious if this could be done without a fine tune step that would't meaningfully bias this.

[+] Uehreka|1 year ago|reply

Normally I balk when commenters go “well they you’re the perfect person to go do it!”, but actually… this is the kind of thing that sounds like it could be a fun project if you’re legit interested. The necessary datasets are likely not hard to gather and collate, a lot of it is probably on places like Project Gutenberg or can be gleaned through OCR of images downloaded from various publicly available archives.

Granted, you’d need to spend about a year on this and for a lot of that time your graphics card (and possibly whole computer) would be unusable, but then if the results were compelling you’d get a cool 15 minutes of internet fame when you posted your results.

[+] sloproth|1 year ago|reply

yes! There's this measure of historical expertise that involves "eating the brains", so to speak, of the people living back then such that if you time traveled back to a bar or street in [insert period], you could carry on a conversation about events going on in that time :) I would love something that uses newspaper fragments, books, etc. to simulate this experience!

[+] csmpltn|1 year ago|reply

The only reason LLMs “work” is because they are trained on a vast corpus of (text-based) human interactions online. The main reason LLMs weren’t a thing 25 years ago, was because there just wasn’t enough scrapeable and useful data available online…

Reduce the dataset to “knowledge as of year 1880” - and it’s not certain you’d even be able to “interact” with the LLM in any meaningful way…

[+] unknown|1 year ago|reply

[deleted]

[+] monktastic1|1 year ago|reply

I think I'm slow. Can you explain this again, maybe with more words?

[+] tptacek|1 year ago|reply

This was so good. I'm super curious to learn more about the strategies used to set up system prompts for the custom GPT that was set up here.

[+] urbandw311er|1 year ago|reply

How would one know that the translation of the Italian text (that he gives as an example) was not just already baked into the model’s training data?

[+] cyrillite|1 year ago|reply

Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?

I know this is possible, but the further away I get from my core domains, the harder it is for me to use these tools in a way that doesn’t feel like too much blind faith (even if it works!)

[+] simonw|1 year ago|reply

I think the trick here is to treat everything these models tell you as part of a larger information diet.

Like if you have a friend who's very well-read and talkative but is also extremely confident and loves the sound of their own voice. You quickly learn to treat them as a source of probably-correct information, but only part of they way you learn any given topic.

I do this with LLMs all the time: I'm constantly asking them clarifying questions about things, but I always assume that they might be making mistakes or feeding me convincing sounding half-truths or even full hallucinations.

Being good at mixing together information from a variety of sources - of different levels of accuracy - is key to learning anything well.

[+] aquafox|1 year ago|reply

You ask them for references and check yourself. They are good exploratory and hypothesis generating tools, but not more. Getting a sensible sounding answer should not be an excuse for you to confirm. Often, the devil is in the details.

[+] kozikow|1 year ago|reply

> the harder it is for me to use these tools in a way that doesn’t feel like too much blind faith (even if it works!)

I tend to ask multiple models and if they all give me roughly the same answer, then it's probably right.

[+] sdesol|1 year ago|reply

I wrote a chat app built around mistrust for LLM responses. You can see an example here:

https://beta.gitsense.com/?chat=ed907b02-4f03-477f-a5e4-ce9a...

If you click on the Evaluation links, you can see how you can use multiple LLMs to validate LLM response. The evaluation of the accurate response is interesting since Llama 3.3 was the most critical.

https://beta.gitsense.com/?chat=fdfb053d-f0e2-4346-bdfc-7305...

At this point, you would ask Llama to explain why the response was not 100% which you can use to cross reference other LLMs or to do your own research.

[+] AdieuToLogic|1 year ago|reply

> Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?

You can't. Because LLM's are statistical generative text algorithms, dependent upon their training data set and subsequent reinforcement. Think Bayesian statistics.

What you are asking for is "to reliably interrogate topics of interest", which is not what LLM's do. Concepts such as reliability are orthogonal to their purpose.

[+] yannis|1 year ago|reply

I find them useful in summarizing State of the Art to get me going in a new topic, but then again so is Wikipedia. A useful side angle, if you using LaTeX, you can cut-and-paste references into ChatGPT and can turn them into Bibtex format with >80% success. For a PHD study though starting from textbooks, papers etc. it will be better, but can augment successfully, like any tool use it for what is best.

[+] pelagicAustral|1 year ago|reply

I'm not sure what good will a system that only focuses on targeted truths will ever do to humanity, we already live in a world were stats are only valid if they do not offend a single person. The reason AI's are so doctored are that sometimes we just do not want to hear the truth, and we dont.

[+] britch|1 year ago|reply

Interesting perspective. I appreciate that it tests the models at different "layers" of understanding.

I have always felt that LLMs would fall apart beyond the summarization. Maybe they would be able to regurgitate someone else's analysis. The author seems to think there's some level of intelligent creativity at play

I'm hopeful that the author is right. That truly creative thinking may be beyond the abilities of LLMs and be decades away.

I think the author doesn't consider the implications of broad use of LLM societally. Will people be willing to fund human historian grad students when they can get a LLM for a fraction of the price? Will prospective historians have gained the training necessary if they've used an LLM through all of school?

I believe the education system could figure it out over time. I'm more worried that LLMs like this will be used as further justification to defund or halt humanities research. Who needs a history department when I can get 80% for the cost of a few chatGPT queries?

203 comments