From a quick glance, this project doesn't seem to use any tool/function calling or streaming or format enforcement or any other "fancy" API features, so all chances are that it may just work, although I have some reservations about the quality, especially with smaller models.
I’m curious how this compares to the open-source version made by HuggingFace [1]. As I can tell, the HF version uses reasoning LLMs to search/traverse and parse the web and gather results, then evaluates the results before eventually synthesizing a result.
This version appears to show off a vector store for documents generated from a web crawl (the writer is a vector-store-aaS company)
There's quite a few differences between HuggingFace's Open Deep-Research and Zilliz's DeepSearcher.
I think the biggest one is the goal: HF is to replicate the performance of Deep Research on the GAIA benchmark whereas ours is to teach agentic concepts and show how to build research agents with open-source.
Also, we go into the design in a lot more detail than HF's blog post. On the design side, HF uses code writing and execution as a tool, whereas we use prompt writing and calling as a tool. We do an explicit break down of the query into sub-queries, and sub-sub-queries, etc. whereas HF uses a chain of reasoning to decide what to do next.
I think ours is a better approach for producing a detailed report on an open-ended question, whereas HFs is better for answering a specific, challenging question in short form.
I think the magic of Grok's implementation of this is that they already have most of the websites cached (guessing via their twitter crawler) so it all feels very snappy. Bing/Brave search don't seem to offer that in their search apis. Does such a thing exist as a service?
I’ve been wondering about this and searching for solutions too.
For now we’ve just managed to optimize how quickly we download pages, but haven’t found an API that actually caches them. Perhaps companies are concerned that they’ll be sued for it in the age of LLMs?
The Brave API provides ‘additional snippets’, meaning that you at least get multiple slices of the page, but it’s not quite a substitute.
Web search APIs can't present the full document due to copyright. They can only present the snippet contextual to the query.
I wrote my own implementation using various web search APIs and a puppeteer service to download individual documents as needed. It wasn't that hard but I do get blocked by some sites (reddit for example).
Considering all the major AI companies have basically created the same deep research product, it would make sense that they focus on a shared open source platform instead.
Have been searching for a deep research tool that I can hook up to both my personal notes (in Obsidian) and the web and this looks like this has those capabilities. Now the only piece left is to figure out a way to export the deep research outputs back into my Obsidian somehow.
Sometimes I wanted to do a little coding to automate things with my personal productivity tool so i feel a programatic interface that open source implementation like this provides is very convenient
I'm wondering about the practical implications of integrating web crawling. Could this, in theory, be used solely for reading papers from Sci-Hub and producing valid graduate-level research?
It could be useful for comparing reports built using DeepSeek R1 vs. GPT-4o and other large models. The code being open source might highlight the limitations of different LLMs much faster and help develop better reasoning loops in future prompts for specific needs. Really interesting stuff.
Cloudflare is going to ruin self hosted things like this and force centralization to a few players. I guess we'll need decentralized efforts to scrape the web and be able to run it on that.
gslepak|1 year ago
Is there a deep searcher that can also use local LLMs like those hosted by Ollama and LM Studio?
drdaeman|1 year ago
From a quick glance, this project doesn't seem to use any tool/function calling or streaming or format enforcement or any other "fancy" API features, so all chances are that it may just work, although I have some reservations about the quality, especially with smaller models.
learningcircuit|11 months ago
[deleted]
vineyardmike|1 year ago
This version appears to show off a vector store for documents generated from a web crawl (the writer is a vector-store-aaS company)
[1] https://github.com/huggingface/smolagents/tree/main/examples...
stefanwebb|1 year ago
I think the biggest one is the goal: HF is to replicate the performance of Deep Research on the GAIA benchmark whereas ours is to teach agentic concepts and show how to build research agents with open-source.
Also, we go into the design in a lot more detail than HF's blog post. On the design side, HF uses code writing and execution as a tool, whereas we use prompt writing and calling as a tool. We do an explicit break down of the query into sub-queries, and sub-sub-queries, etc. whereas HF uses a chain of reasoning to decide what to do next.
I think ours is a better approach for producing a detailed report on an open-ended question, whereas HFs is better for answering a specific, challenging question in short form.
parhamn|1 year ago
tekacs|1 year ago
For now we’ve just managed to optimize how quickly we download pages, but haven’t found an API that actually caches them. Perhaps companies are concerned that they’ll be sued for it in the age of LLMs?
The Brave API provides ‘additional snippets’, meaning that you at least get multiple slices of the page, but it’s not quite a substitute.
binarymax|1 year ago
I wrote my own implementation using various web search APIs and a puppeteer service to download individual documents as needed. It wasn't that hard but I do get blocked by some sites (reddit for example).
swyx|1 year ago
fragmede|1 year ago
http://commoncrawl.org
bilater|1 year ago
https://github.com/btahir/open-deep-research
fuddle|1 year ago
stefanwebb|1 year ago
https://milvus.io/blog/i-built-a-deep-research-with-open-sou...
https://milvus.io/blog/introduce-deepsearcher-a-local-open-s...
Daniel_Van_Zant|1 year ago
jianc1010|1 year ago
zitterbewegung|1 year ago
https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...
The QuickStart had a good response. [1] https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...
mtrovo|1 year ago
It could be useful for comparing reports built using DeepSeek R1 vs. GPT-4o and other large models. The code being open source might highlight the limitations of different LLMs much faster and help develop better reasoning loops in future prompts for specific needs. Really interesting stuff.
namlem|1 year ago
redskyluan|1 year ago
Search is not a problem . What to search is!
Using reasoning model, it is much easier to split task and focus on what to search
gnatnavi|1 year ago
cma|1 year ago