Launch HN: Airweave (YC X25) – Let agents search any app
164 points| lennertjansen | 5 months ago |github.com
Here’s an example of Cursor using Airweave https://www.youtube.com/watch?v=IvxidK9Ciy4. And here’s a general example of our new search functionality: https://www.youtube.com/watch?v=iqEqc_iGUO8
We came to this problem while building agentic applications for webshop owners and customer service, and noticing most failure modes weren’t about tool execution, but finding the right internal context to enable the right actions.
We started solving, what seemed at the time, a problem for our own use case, and quickly fell into a rabbithole of issues. Company and user data lives across SaaS and databases; it’s sparse, messy, and constantly changing. Agents need a data orchestration and retrieval layer that accepts free-form natural language queries and returns actionable results quickly.
Simply pointing an agent at an MCP server does not equate to fine-grained search functionality or deep understanding of the underlying resource. Most MCP servers are thin wrappers that expose an existing API in a more LLM-friendly way, but this doesn’t actually give the agent any new capabilities beyond what the resource or app already offered. Specifically, it doesn’t give the agent a way to thoroughly search and understand the contents of the resource.
Airweave connects to sources via their APIs, crawls and normalizes content, chunks it, extracts entity relationships, and indexes the chunks in a vector store alongside keyword fields and lightweight graph metadata in Postgres. Data sync is orchestrated with Temporal (handles pagination/rate limits, schedules, and change detection via timestamps and content hashes) so collections stay close to real-time with their sources.
On retrieval, Airweave can run semantic and BM25 keyword search in parallel, fuse results (RRF), apply recency bias, and re-rank. Agents can fetch ranked chunks with citations or ask for a synthesized answer. The same interface is exposed via REST, Python/TS SDKs, and MCP so agents can discover it like any other tool.
It’s been fun to see what users have built with Airweave; from legal AI assistants to research discovery agents and context augmentation for coding agents. We’re currently experimenting with agentic search patterns, layering different types of enrichment and indexing, RBAC on indexed data, and streaming architectures.
If this is interesting to you, feel free to take it for a spin. Curious to hear your thoughts and feedback on the problem and our solution!
btown|5 months ago
I see in another comment that you encourage each user to build their own dataset with their own permissions, but often this breaks for founders. If I have a Super Secret Personnel Planning Google Doc at a founder level, how can I be the one to set up the system for our company, but ensure that only files that I've explicitly shared with the company are ingested? What if a file needs to be made anyone-with-link-can-access for sharing with a strategic partner, but that shouldn't be indexed for the entire company?
Far too much of the world relies on the security-by-obscurity of public-but-unindexed links, and communications that might look public from a metadata perspective but were carefully designed for a very specific group of people who have verbal/mental context about confidentiality expectations. Being able to categorize by likely confidentiality, and allowing an administrator to partition access on a project and sub-project basis based on that, might be crucial for growth.
My recollection is that Onyx had limited support for some security use cases, but very rudimentary. Hoping you can solve this in a thoughtful way!
Onyx links for comparison:
https://www.onyx.app/
https://docs.onyx.app/developers/guides/chat_guide
https://docs.onyx.app/admin/connectors/official/
raufakdemir|5 months ago
As for intelligently - but probabilistically - determining confidentiality (if I read that correctly), that does sound pretty interesting in scenarios where metadata is just simply insufficient. Also tricky. Sounds like you thought about these problems pretty deeply.
lennertjansen|5 months ago
On permissioning: we default to per-user syncs that adopt the permissions of the syncing user and mirror source ACLs (e.g., Drive items a user owns or that are sharedWithMe). In practice, founders avoid leaking private docs by either (a) having each user sync their own corpus, or (b) using a centrally-scoped token limited to Shared Drives/team folders and excluding personal “My Drive.” You can also keep separate collections and only expose cross-user search behind your own checks. We’re exploring richer org-level RBAC mapping on a per-customer basis (e.g., mapping Drive/SharePoint groups to index ACLs), but the above works today.
@Weves: Thanks, appreciate it!
Weves|5 months ago
Congratulations on the launch Rauf & Lennert! Always great to have more innovation in the open source AI space :D. It looks like Airweave works well with Cursor, something we don't have nailed down yet!
ashu1461|5 months ago
1. How do you decide whether to cache the data into a vector database or fetch it on runtime using a tool call ?
2. Slowly all players like Open AI / Claude are trying to provide a somewhat equivalent offering of connecting your workspaces and then providing search on top of it either via direct integrations / mcp servers, how do you see that spanning out ?
raufakdemir|5 months ago
lennertjansen|5 months ago
suprnurd|5 months ago
lennertjansen|5 months ago
ameyamk|5 months ago
raufakdemir|5 months ago
We plan to implement unified ACL syncs to dedupe the data or even have 1 sync per org, but that’s mostly a cost optimization; Airweave will just scale horizontally until then.
Blahah|5 months ago
A couple of bits of feedback:
1. Code samples on the site have broken whitespace on mobile (Android/Brave) so look a bit intense.
2. The pricing is complex to reason about - I have to consider the technical aspects and the number of users? Why don't I just get an API key?
lennertjansen|5 months ago
and ofc, feel free to reach out if your team needs help with setup
candiddevmike|5 months ago
raufakdemir|5 months ago
We usually sync per user. That way we make sure that no information leaks to another interface.
janwilmake|5 months ago
lennertjansen|5 months ago
hommes-r|5 months ago
lennertjansen|5 months ago
ripped_britches|5 months ago
raufakdemir|5 months ago
orliesaurus|5 months ago
lennertjansen|5 months ago
ori_b|5 months ago
lennertjansen|5 months ago
EGreg|5 months ago
And who is "us"?
"Well, our agents, of course. We'll send the information down to our servers, because -- surprise -- we have the GPU infrastructure to run it, and you don't. Don't worry, it's secure."
"Alright, well--"
https://www.wiz.io/blog/38-terabytes-of-private-data-acciden...
"Oops! Well don't worry, it's not like we're the first ones to sell your usage data..."
https://ferrumit.com/resources/it-s-now-legal-for-isps-to-se...
"You see! Well, just send us your DNA we'll analyze it -- with science! I mean with AI..."
"Alright, here is--"
https://www.nytimes.com/2025/05/19/business/regeneron-pharma...
"Oops! Well don't worry, it's not like the company that bought us will do anything with your data, that we wouldn't have done."
Here's my question...
1) How much can we feasibly run on a consumer-grade GPU today, on-board the computer, either the latest macbook or latest mobile iphone? Does Apple Metal + Silicon ship with any models that are on board the latest iOS 26?
2) How can we extend the security boundary to GPU servers that are attested black boxes that store data encrypted at rest, guaranteed not to train on it and are not owned by some corporation that can peek at the data?