Local Deep Research – ArXiv, wiki and other searches included

[+] mentalgear|1 year ago|reply

I applaud the effort for the local (lo-fi) space ! Yet, reading over the example linked in the docs (which does not seem cheery-picked, kudos for that!), my impression is that the document is a rather messy outcome [1].

I think what's missing is one (or more) step in-between, possible a graph database (eg[2]), which the LLM can place all it's information in, see relevant interconnections, query to question itself, and then generate the final report.

(maybe the final report could be an interactive HTML file that the user can ask questions, or edit themselves).

There's also a similar open-deep research tool called onyx [2], with I think has better UI/UX albeit not local. Maybe the author could consider porting this to local instead of rolling and maintaining another deep-research tool themselves ?

I'm saying this, not because I think it's not a good project, but because there are a ton of open deep-research projects which I'm afraid will just fizzle out, and would be better if people would join forces working on those aspects they care most about (e.g. local aspect, or RAG strategies, etc) .

[1] https://github.com/LearningCircuit/local-deep-research/blob/...

[2] "In-Browser Graph RAG with Kuzu-WASM and WebLLM" https://news.ycombinator.com/item?id=43321523

[3] https://github.com/onyx-dot-app/onyx

[+] bilater|1 year ago|reply

I had my own spin on deep research which you might find this easier to navigate: https://github.com/btahir/open-deep-research

[+] TeMPOraL|1 year ago|reply

> I think what's missing is one (or more) step in-between, possible a graph database (eg[2]), which the LLM can place all it's information in, see relevant interconnections, query to question itself, and then generate the final report.

Quickly, productize this (and call it DeepRAG, or DERP) before it explodes in late 2025 - you may just beat the market to it!

See: https://news.ycombinator.com/item?id=43267539

[+] learningcircuit|11 months ago|reply

I wanted to share some updates on Local Deep Research since posting here 12 days ago. Thanks to everyone who gave feedback and suggestions - the project has improved significantly with your input.

Recent improvements:

- Better inline citation: Sources from PubMed, arXiv, Wikipedia, etc. are now properly cited directly in the text

- Improved report structure: Reports now have better organization with logical sections and clearer source attribution

- Added support for multiple research domains: Works well across scientific, historical, economic, and technical topics

- Enhanced search iterations: Now performs multiple rounds of research with follow-up questions for deeper analysis - More flexible LLM integration: Works with pretty much any model (local via Ollama or cloud-based)

- Expanded search engine options: Easy to add new sources for specialized research

For those who mentioned concerns about report quality and organization - we've made significant improvements in this area. The citation tracking now provides much better provenance information throughout the research pipeline.

I'd also like to thank HashedViking who joined as a contributor and has been improving the UI/UX side of things. We're committed to keeping this as a truly local, privacy-focused tool that doesn't rely on expensive APIs.

For anyone interested in contributing, we're looking for help with: 1. Further improving report organization 2. More local search engines and sources 3. Documentation and examples 4. UI/UX enhancements 5. Testing with different models and research domains

The project is at: https://github.com/LearningCircuit/local-deep-research/

What features would be most useful to you in a research tool like this? We're particularly interested in ideas for better knowledge organization and making the research outputs more valuable.

[+] jeffreyw128|1 year ago|reply

This is cool!

If you want to add embeddings over internet as a source, you should try out exa.ai. Includes: wikipedia, tens of thousands of news feeds, Github, 70M+ papers including all of arxiv, etc.

disclaimer: I am one of the founders (:

[+] learningcircuit|1 year ago|reply

I will add it. Its very easy to integrate new search engines.

[+] nhggfu|1 year ago|reply

looks siiiick. congrats + good luck

[+] learningcircuit|1 year ago|reply

Example output: https://github.com/LearningCircuit/local-deep-research/blob/...

[+] sinenomine|1 year ago|reply

You could be the first if you were to develop an eval (preferably automated with llm as judge) and compared local deep research with perplexity's, openai's and deepseek's implementations on high-information questions.

[+] HashedViking|1 year ago|reply

Hello there,

I'm the coauthor of this project (UI part). I've joined this project when it was below 100 stars (a week ago), motivated by the 'local' sentiment. I think all of those 'open' alternatives are just wrappers around PAID 'Open'AI APIs, which just undermines the 'Open' term. My vision for this repo is a system independent of LLM providers (and middlemen) and overpriced web-search services (5$ per 1000 search requests at Google is just insane). Initially, I just wanted to experiment a bit and didn't expect the repo to explode, so feel free to critique the UI code I hacked together over a few evenings.

The ultimate goal:

A corporation-free LLM usage (local graph database integration sounds good).

A corporation-free web search (this is a massive challenge — even SearXNG relies on Google/Bing under the hood)

So, if you feel the same join the project, and lets build something great!

[+] wahnfrieden|1 year ago|reply

Is anyone using (local) LLMs to directly search for (by scanning over) relevant materials from a corpus rather than relying on vector search?

[+] suprjami|1 year ago|reply

Generally this fails.

Most LLMs lose the ability to track facts over about 20k words of content, the best can manage maybe 40k words.

Look for "needle" benchmark tests, as in needle-in-haystack.

Not to mention the memory requirements of such a huge context like 128k or 1M tokens. Only people with enterprise servers at home could run that locally.

[+] CGamesPlay|1 year ago|reply

I tried this out, but I hit so many errors that I could never generate a report. There is no way to resume a failed generation, so it seems like if any API call fails, even 10 minutes into the generation, you have to start over from scratch.

[+] bravura|1 year ago|reply

For web search, also consider the Kagi and Tavily APIs.

[+] learningcircuit|1 year ago|reply

Thank you I will add

[+] alchemist1e9|1 year ago|reply

Nice work!

I’ve been thinking recently that a local collection of pre-processed for RAG using curated focused structured information might be a good complement to this dynamic searching approach.

I see this used LangChain, might be worth checking into txtai.

https://neuml.github.io/txtai/examples/

[+] throwaway24681|1 year ago|reply

Looks very cool. How does this compare to the RAG features provided by open-webui?

There is web search and a way to embed documents, but so far it seems like the results are subpar as details are lost in embeddings. Is this much better?

[+] learningcircuit|1 year ago|reply

Give me a question and I can give you the output? So you can compare.

[+] ein0p|1 year ago|reply

Is there some kind of a tool which would provide AI search experience _and mix in the contents from my bookmarks_ (that is, fetch/cache/index/RAG the contents of pages those bookmarks point to) when creating the report? Bookmarking is an useless dumpster fire right now. This could make it useful again.

Currently the failure mode I see quite often in e.g. OpenAIs deep research is it sources its answer from an obviously low-authority source and provides a reference to that as if it's a scientific journal. The answer gets screwed up by that as well, because such sources rarely contain anything of value, and even if other sources are high quality, low quality source(s) mess everything up.

Emphasizing the content I've already curated (via bookmarks) could significantly boost the SNR.

[+] learningcircuit|1 year ago|reply

If you have PDF collection you could include it into the local search and give it very high relevance?

[+] antonkar|1 year ago|reply

I think the guy who’ll make the 3D game-like GUI for LLMs is the next Jobs/Gates/Musk and Nobel Prize Winner (I think it’ll solve alignment by having millions of eyes on the internals of LLMs), because computers became popular only after the OS with a GUI appeared, current chatbots are a bit like a command line in comparison. I just started ASK HN to let people and me share their AI safety ideas, both crazy and not: https://news.ycombinator.com/item?id=43332593

[+] tecleandor|1 year ago|reply

You just posted the same comment three times in three different posts in 10 minutes. I'd say it would be nice to take it a bit slower...

[+] Der_Einzige|1 year ago|reply

You are 100000% correct. It's telling that shitty gradio webuis like oobabooga or automatic1111 got SO many github stars.

ComfyUI is huge despite literally just bringing the node based editor paradigm to Stable Diffusion.

UI/UX for LLMs and GenAI is so hilariously shit right now. So many investors want to invest in yet another LLMops company instead of a meaningful competitor to the terrible LM-studio.

34 comments