(no title)
thor-rodrigues | 7 months ago
Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.
The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)
If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.
I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.
Adityav369|7 months ago
monkeyelite|7 months ago
But it also illustrates to me that the opportunities with LLMs right now are primarily about reclassifying or reprocessing existing sources of value like patent documents. In the 90-00s many successful SW businesses were building databases to replace traditional filing.
Creating fundamentally new collections of value which require upfront investment seems to still be challenging for our economy.
cheschire|7 months ago