As the title says, I have many PDFs - mostly scans via Scansnap - but also non-scans. These are sensitive in nature, e.g. bills, documents, etc. I would like a local-first AI solution that allows me to say things like: "show me all tax documents for August 2023" or "show my home title". Ideally it is Mac software that can access iCloud too, since that where I store it all. I would prefer to not do any tagging. I would like to optimize on recall over precision, so False Positives in the search results are ok. What are modern approaches to do this, without hacking one up on my own?
bastien2|1 year ago
andai|1 year ago
Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.
I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.
More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)
yreg|1 year ago
What does that even mean. When you know the exact keywords then you use full-text.
When you don't know them then other tools can be helpful.
vikramkr|1 year ago
rahimnathwani|1 year ago
barrenko|1 year ago
Could you expand on the answer? Thanks!
pierre|1 year ago
https://docs.llamaindex.ai/en/stable/getting_started/starter...
homarp|1 year ago
but for now you will have to use python.
You can try it here https://ai.azure.com/explore/models/Phi-3-vision-128k-instru... to get an idea of its OCR + QA abilities
nl|1 year ago
jd3|1 year ago
https://news.ycombinator.com/item?id=38759877
https://news.ycombinator.com/item?id=36832572
tspann|1 year ago
Pretty easy to run local and lightweight with Milvus Lite with LlamaIndex
ekianjo|1 year ago
m0shen|1 year ago
As far as AI goes, not sure.
whynotmaybe|1 year ago
Ey7NFZ3P0nzAe|1 year ago
It supports virtually all LLMs and embeddings, including local LLMs and local embedding It scales surprisingly well and I have tons of improvements to come, when I have some free time or procrastinate. Don't hesitate to ask for features!
Here's the link: https://github.com/thiswillbeyourgithub/DocToolsLLM/
samspenc|1 year ago
You do need a beefy GPU to run the local LLM, but I think it's a similar requirement for running any LLM on your machine.
constantinum|1 year ago
There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].
Now back to your problem:
This solution might be an overkill for your requirement, but you can try the following:
To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]
[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...
jszymborski|1 year ago
https://tika.apache.org/
fooker|1 year ago
Well, Claude and GPT-4 seem to be.
elrostelperien|1 year ago
Without AI, but searching the PDF content, I use Recoll (https://www.recoll.org/) or ripgrep-all (https://github.com/phiresky/ripgrep-all)
gumboshoes|1 year ago
hm-nah|1 year ago
It’s not local, but the Azure Document Intelligence OCR service has a number of prebuilt models. The “prebuilt-read” model is $1.50/1k pages. Once you OCR your docs, you’ll have a JSON of all the text AND you get breakdowns by page/word/paragraph/tables/figures/alllll with bouding-boxes.
Forget the Lang/Llama/Chain-theory. You can do it all in vanilla Python.
Kikawala|1 year ago
SecureAI-Tools: https://github.com/SecureAI-Tools/SecureAI-Tools
pixelmonkey|1 year ago
https://github.com/phiresky/ripgrep-all
gyrovagueGeist|1 year ago
SoftTalker|1 year ago
datpiff|1 year ago
When it comes to a search solution - what kind of searches have you done in the past? What kind of problems did you come across? If the answer to either is "none" you are planning on building a useless system.
Kikobeats|1 year ago
Here an example turning a arxiv paper into real text:
https://api.microlink.io/?data.html.selector=html&embed=html...
It looks like PDF, but it you open devtools you can see it's actually a very precise HTML representation.
theolivenbaum|1 year ago
brailsafe|1 year ago
What I haven't seen suggested though, is the built-in spotlight. Press CMD+Space, type some unique words that might appear in the document, and spotlight will search it. This also works surprisingly well for non-OCRd images of text, anything inside a zip file, an email, etc..
yousnail|1 year ago
I’ve used both for sensitive internal SOPs, and both work quite well. Private gpt excels at ingesting many separate documents, the other excels at customization. Both are totally offline, and can use mostly whatever models you want.
ssahoo|1 year ago
Get a copilot PC with recall enabled and quickly scan through the documents by opening in Adobe acrobat reader. Voillla! You will have an sqlite dB that has your index. Few days later, Adobe could have your data in their llm.
gibsonf1|1 year ago
pawelduda|1 year ago
ilaksh|1 year ago
https://andrejusb.blogspot.com/2024/03/optimizing-receipt-pr...
But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.
Sort through it to find anything with a credit card number or anything ahead time.
Or you could look into InternVL..
Or a combination of PaddleOCR first and then use a strong LLM via API, like gpt-4o or llama3 70b via together.ai
If you truly must do it locally, then if you have two 3090s or 4090s it might work out. Otherwise it the LLMs may not be smart enough to give good results.
Leaving out the details of your hardware makes it impossible to give good advice about running locally. Other than, it's not really necessary.
gnicholas|1 year ago
Why do you have this confidence? Is it based on reading their TOS, and assuming they'll follow it?
bendsawyer|1 year ago
The result is a huge step up from 'full text search' solutions, for my use case. I can have conversations with decades of documents, and it's incredibly helpful. The support scheme keeps my original documents unconnected from the machine, which I own, while updates are done over a remote link. It's great, and I feel safe.
Things change so fast in this space that there did not seem to be a cheap, stable, local alternative. I honestly doubt one is coming. This is not a on-size-fits-all problem.
skapa_flow|1 year ago
phodo|1 year ago
westcort|1 year ago
hulitu|1 year ago
Adobe Reader can search all PDFs in a directory. They hide this function though.
kkfx|1 year ago
ocrmypdf + ripgrep-all, recoll (GUI+XLI xapian wrapper) if you prefer an indexed version, for mere full-text search, currently nothing gives better results. The semantic search it's still not there, Paperless-ngx, tagspaces and so on demand way too much time for adding just a single document to be useful at a certain scale.
My own personal version is org-mode, I keep all my stuff org-attached, so instead of searching the pdfs I search my notes linking them, a kind of metadata-rich, taggable, quick, full-text search however even if org-ql is there I almost not use it, just org-roam-node-find and counsel-rg on notes. Once done this allow for quick enough manual and variously automated archiving, do it on a large home directory it's a very long and tedious manual work. For me it's worth done since I keep adding documents and using them, but it took more than an year to be "almost done enough" and it's still unfinished after 4 years.
treetalker|1 year ago
If you’re having trouble thinking of search terms to plug into HoudahSpot (or grep etc.) then I suppose you could ask a chatbot to assist your brainstorming, and then plug those terms into HoudahSpot/grep/etc.
epirogov|1 year ago
https://products.aspose.org/pdf/net/chat-gpt/
dudus|1 year ago
If you trust Google that is.
hobo_mark|1 year ago
bendsawyer|1 year ago
jesterson|1 year ago
There is no AI or any other modern fad, but fulltext search (including OCR for image files inside PDFs) works great
1123581321|1 year ago
If you're okay with some false positives, Devonthink would work as is, actually.
bendsawyer|1 year ago
"act as an expert in Y, looking across all times I've typed X, summarize my changing position over thee years, and suggest other terms that have a similar pattern of change, in a list."
The kind of thing I used to give to an intern over a month, with results that are not far off what that intern produced...
edgyquant|1 year ago
nl|1 year ago
Tables and especially multi-column PDFs often need one-off handling and - worse - you don't know when one is being misparsed until you start getting weird search results. At that point you need to debug your entire search pipeline, which isn't fun!
jeffreyq|1 year ago
hypefi|1 year ago
Tylast|1 year ago
sciencesama|1 year ago
gandalfthepink|1 year ago
vrighter|1 year ago
finack|1 year ago
hiq|1 year ago
I wanted to convert some equations from some maths textbook back into latex, and I found that taking a screenshot and feeding the image into some LLM service supporting images was a good way to do that.
adyashakti|1 year ago
borg16|1 year ago
unknown|1 year ago
[deleted]