Launch HN: Exa (YC S21) – The web as a database
We started working on Exa because we were frustrated that while LLM state-of-the-art is advancing every week, Google has gotten worse over time. The Internet used to feel like a magical information portal, but it doesn’t feel that way anymore when you’re constantly being pushed towards SEO-optimized clickbait.
Websets is a step in the opposite direction. For every search, we perform dozens of embedding searches over Exa’s vector database of the web to find good search candidates, then we run agentic workflows on each result to verify they match exactly what you asked for.
Websets results are good for two reasons. First, we train custom embedding models for our main search algorithm, instead of typical keyword matching search algorithms. Our embeddings models are trained specifically to return exactly the type of entity you ask for. In practice, that means if you search “startups working in nanotech”, keyword-based search engines return listicles about nanotech startups, because these listicles match the keywords in the query. In contrast, our embedding models return actual startup homepages, because these startup homepages match the meaning of the query.
The second is that LLMs provide the last-mile intelligence needed to verify every result. Each result and piece of data is backed with supporting references that we used to validate that the result is actually a match for your search criteria. That’s why Websets can take minutes or even hours to run, depending on your query and how many results you ask for. For valuable search queries, we think this is worth it.
Also notably, Websets are tables, not lists. You can add “enrichment” columns to find more information about each result, like “# of employees” or “does author have blog?”, and the cells asynchronously load in. This table format hopefully makes the web feel more like a database.
A few examples of searches that work with Websets:
- “Math blogs created by teachers from outside the US”: https://websets.exa.ai/cma1oz9xf007sis0ipzxgbamn
- "research paper about ways to avoid the O(n^2) attention problem in transformers, where one of the first author's first name starts with "A","B", "S", or "T", and it was written between 2018 and 2022”: https://websets.exa.ai/cm7dpml8c001ylnymum4sp11h
- “US based healthcare companies, with over 100 employees and a technical founder": https://websets.exa.ai/cm6lc0dlk004ilecmzej76qx2
- “all software engineers in the Bay Area, with experience in startups, who know Rust and have published technical content before”: https://youtu.be/knjrlm1aibQ
You can try it at https://websets.exa.ai/ and API docs are at https://docs.exa.ai/websets. We’d love to hear your feedback!
[+] [-] AznHisoka|10 months ago|reply
But if it filtered it first to "start with the letter R", it would only have to look at perhaps 5% of the results it's trying to verify!
So it's doing needless verification of results that will be thrown out by another filter that should've been applied first!
[+] [-] liam-hinzman|10 months ago|reply
We use an agentic search planner that adapts its search strategy as matches are found, but it could be smarter with substrings.
https://websets.exa.ai/cmad36arq009fl30i4dvkc7wn
[+] [-] hubraumhugo|10 months ago|reply
Since you were part of YC 21, could you share a bit about your pivots/product iterations you went through over the last 4 years?
[+] [-] willbryk|10 months ago|reply
- 2022: Consumer-facing embeddings search (back when we were known as Metaphor)
- 2023: Web search for AIs - once the AI ecosystem heated up, we made a business out of web search + crawling API. This is still our primary business.
- Now: Websets, a useful product built on top of our search tech
If you're curious, our company right now is fully devoted to:
1. Dramatically improving Websets quality
2. Building the best general search engine in the world
[+] [-] xp84|10 months ago|reply
Congrats on your launch. With the natural way this lends itself to comparison shopping this is an amazing tool for people trying to find "the best X for me" whether that's a TV, a school, etc. So much content that you find on Google when trying to answer that type of query, is designed to trick, bamboozle, and to hide the facts that you might use to answer this question (but most of all to get you to click affiliate links).
[+] [-] joshstrange|10 months ago|reply
The initial search/experience is good but then I got dumped here [0] and it's not clear to me if things are still happening or if it broke (it's been at least 5 min with no UI updates.
I can't see the full results yet but this is very interesting and a task I ask OpenAI's Deep Research to attempt periodically. It makes a good show of doing the work but the results are not great IMHO (for asking it generate lists/tables of data like this). I can see this tool being incredibly useful for lead generation (how I am testing it out).
[0] https://cs.joshstrange.com/dySqK1mb
[+] [-] liam-hinzman|10 months ago|reply
“List of food festivals on the east coast specializing in small dishes or encourages sampling from multiple vendors. features more than 20 vendors”
https://websets.exa.ai/cmad3sonh001zhx0i1h7t692f
btw I like how you host screenshots on your personal website
[+] [-] byearthithatius|10 months ago|reply
The UI showed literally no change. So I checked and the console shows:
``` Try: 14 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 15 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 16 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 17 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 18 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 19 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 20 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Gave up after 10 seconds. 681-7df1b139fa2dc9f0.js:14:3379 filteredSuggestions Array(3) [ {…}, {…}, {…} ] 681-7df1b139fa2dc9f0.js:14:3379 ```
Also your table doesn't fit in the viewport so I can't see the results.
Firefox Ubuntu.
[+] [-] pilingual|10 months ago|reply
[+] [-] tibbar|10 months ago|reply
[+] [-] liam-hinzman|10 months ago|reply
[+] [-] mbeavitt|10 months ago|reply
[+] [-] willbryk|10 months ago|reply
Types of searches Websets doesn't currently do well at: - products (e.g., ecommerce sites) - Content that requires authentication/permissions to access - non-English content
Some of the above are on our roadmap, and let us know if there's some type of data you'd like us to support!
[+] [-] vetleen|10 months ago|reply
I did one search with 4 criteria, then added the two free columns, and at this point i had spent 750 of my 1000 free credits. The next tier being $49 with only 8000 credits, which means only 10 searches a month.
The search I did was super useful, and I would love to use the product, and reccomend it to my coworkers. But the pricing is what stops me.
Best of luck. I'll probably use it once a month if I can remember :)
[+] [-] WuxiFingerHold|10 months ago|reply
[+] [-] fanzhang|10 months ago|reply
This seems a lot better than those quizzes or quotes that ask a bunch of questions first and then ask for your email at the end -- or worse -- a payment.
[+] [-] vetleen|10 months ago|reply
[+] [-] a_n|10 months ago|reply
[+] [-] theamk|10 months ago|reply
"robotics servo motors with two-directional control for under $100"
1. https://mjbots.com/ - their motor are $1369. FAIL.
2. https://www.pololu.com/ - this is huge store, but they do have some motors like that. Pass, but wish it linked to specific page and not top top-level one.
3. dh-robotics.com - no prices, but some products on open market are few K$. Likely fail as well.
4. https://www.robotarticulation.com/ - The product is not for sale (early beta), and it looks likely much more than $1K. FAIL.
5. https://www.lynxmotion.com/ - another huge store, most two-directional motors are expensive but there are some under $100... Pass, but wish it linked to specific page and not top top-level one.
[+] [-] 85392_school|10 months ago|reply
> So the search should work best for people, companies, papers, high quality written content.
> Types of searches Websets doesn't currently do well at: products, content that requires authentication/permissions to access, and non-English content
[+] [-] gertlex|10 months ago|reply
My experience around such started with pwm hobby servos, includes dynamixels, and I've worked with larger stuff using harmonic drive gearboxes. Can't recall encountering a "servo" that is one-directional.
[+] [-] esafak|10 months ago|reply
I searched for "alternatives to jq with a functional API" and one of the criteria it came up with was "Provides technical details or comparisons relevant to the alternatives" but the table only listed the repo's url and description. And the description was truncated with ellipses with no way for me to resize the columns. Also, it missed the opportunity to tell me that some shells can replicate jq's functionality. Finally, it would have to be faster to be a daily driver. At this speed, it is something I would reserve for backup, for when the workhorse fails. Which means I would not want to pay $49/month.
Hope that helps. Interesting idea.
[+] [-] willbryk|10 months ago|reply
Yeah we'd love to make the product as accessible and cheap as possible, but as of state of AI costs of 2025, it's a very expensive product to run and so we have it login gated. If you're willing to log in though, you'll find a lot of the features that you're mentioning :)
[+] [-] liam-hinzman|10 months ago|reply
If you sign in each result will be graded by an LLM, supporting references will be found, you can get agents to add arbitrary data to each result, and the table UI is much better.
Understand if you don’t want to sign up, I’d just look at the examples linked in the OP in that case
[+] [-] dbuxton|10 months ago|reply
Our experimental use case is enabling quick and dirty integration of web-based docs into an employee service agentic chatbot - lots of the questions are around “how do I max out my 401k”, which connects to internal information, but some are more like “how do I link a calendar to calendly”.
The one thing I’d love to have in the search product is a cruft cleaner for the results of web queries. Where you have cached the data presumably this wouldn’t add much overhead. Reduces what you have to feed to the LLM downstream and might improve the embeddings performance.
[+] [-] willbryk|10 months ago|reply
If something else though, curious.
[+] [-] frankramos|10 months ago|reply
[+] [-] srameshc|10 months ago|reply
[+] [-] cobertos|10 months ago|reply
[+] [-] willbryk|10 months ago|reply
[+] [-] willbryk|10 months ago|reply
[+] [-] wdrw|10 months ago|reply
Anyway, the model used doesn't seem to be very good, it did not understand a basic "OR" criteria. I asked for a list of companies with an office in Toronto that are involved in hardware development such as custom silicon, robotics, satellites or drones. It completely misunderstood the "or" part (and the "such as" part). E.g. I see many robotics companies marked as a "Miss" because they only do robotics but not any of the other things on my list.
Overall though I love the idea, I would pay for your service (on a pay-as-you-go per-query basis) if the underlying model was smart enough for me to actually rely on the results.
[+] [-] jackienotchan|10 months ago|reply
Do you have any built-in features that address these issues?
[+] [-] antoniojtorres|10 months ago|reply
[+] [-] drob518|10 months ago|reply
1. I love the idea.
2. The UI needs to work on smaller screens (e.g., tablets). The current layout is VERY cramped.
3. Its ability to search for businesses in a given geography is poor. I asked it to search for businesses in a city and it was giving me results that were obviously incorrect from halfway across the country.
4. For a homepage URL for a business, it once gave me a parked domain name at GoDaddy's "domain for sale" page. That seemed like a blunder. Is that because it's pulling in WHOIS information and it connected some addresses?
5. Performance is quite poor. Perhaps that's because you're getting "Hackernews'd" with a surge of people consuming all your capacity.
[+] [-] campl3r|10 months ago|reply
[+] [-] whoisjuan|10 months ago|reply
When I checked this a year or so ago, I might have gotten the impression that it was cheaper. Now, it costs the same as what Perplexity charges for search-grounded queries, which is the same as Google charges for Gemini queries with search.
So basically, one player sets a price, and everyone is anchored on that as the pricing for the entire category? I'm just genuinely interested in why every offering in this space is priced like this.
It seems a bit misaligned with how pure LLM queries are priced.
I have a product that would benefit from search grounding, but this pricing wouldn't work with my volume of queries.
[+] [-] liam-hinzman|10 months ago|reply
Perplexity charges the same on their lowest tier model, and three times as much for their more expensive models.
Gemini charges $35 per 1000 requests.
https://exa.ai/pricing
https://docs.perplexity.ai/guides/pricing
https://ai.google.dev/gemini-api/docs/pricing
[+] [-] AznHisoka|10 months ago|reply
[+] [-] mfrye0|10 months ago|reply
How do you dedupe entities, like companies and people? I've noticed ChatGPT tends to provide "great" results when asking about different entities, but in reality it just groups similar sounding entities together in its answer.
For example, I asked ChatGPT about a well known startup. It gave me a confident answer about how much they raised, their current status, etc. When looking at the 3 sources they cited though, it was actually 3 different companies that all had similar sounding names that it just grouped together to form its answer.
Basically, how do I trust the output of your system?
[+] [-] liam-hinzman|10 months ago|reply
https://imgur.com/dsGK5dS
[+] [-] upcoming-sesame|10 months ago|reply
Most of the time I want to find some vendors / companies and Deep Research does that but also responds with a wall of unnecessary text where I just want the table