Show HN: Hacker Search – A semantic search engine for Hacker News
233 points| jnnnthnn | 1 year ago |hackersearch.net | reply
I'm Jonathan and I built Hacker Search (https://hackersearch.net), a semantic search engine for Hacker News. Type a keyword or a description of what you're interested in, and you'll get top links from HN surfaced to you along with brief summaries.
Unlike HN's otherwise very valuable search feature, Hacker Search doesn't require you to get your keywords exactly right. That's achieved by leveraging OpenAI's latest embedding models alongside more traditional indexes extracted from the scraped and cleaned up contents of the links.
I think there are many more interesting things one could build atop the HN dataset in the age of LLMs (e.g. more explicitly searching for technical opinions, recommending stories to you based on your interests, and making the core search feature more useful). If any of those sound interesting to you, head over to https://hackersearch.net/signup to get notified when I launch them!
Note: at least one person has built something similar before (https://news.ycombinator.com/item?id=36391655). Funnily enough, I only found out about this through my own implementation, and I based on my testing, I think Hacker Search generally performs better when doing keyword/sentence searches (vs. whole document similarity lookup), thanks to the way the data is indexed.
[+] [-] v1sea|1 year ago|reply
Testing it out, I'd say the results for "graph visualization" are focused if a bit incomplete. So to me it has high precision, but lower recall.
I don't see this searching comments. That could be a nice extension. Thanks for sharing.
[+] [-] jnnnthnn|1 year ago|reply
If you feel up for it, you should share your email in the righthand "Unhappy with your results?" widget. My plan is to manually look into the disappointing searches and follow-up with better results for folks, in addition to fixing whatever can be fixed.
Agreed re: searching comments (which it indeed currently doesn't do).
[+] [-] isoprophlex|1 year ago|reply
Loving your LLM generated summaries! Very nice user experience to see at a glance what a hit is about. Also your back button actually works, haha.
Well done!
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] codethief|1 year ago|reply
Unfortunately, though, it didn't find what I was looking for in the following real-word test case: The other day I tried to remember the name of an SaaS to pin/cache/ back up my apt/apk/pip dependencies, which I think I had read about either here[0] or here[1]. After quite a bit of time and some elaborate Google-fu, I did end up finding those HN threads again. However, they did not show up on hackersearch.net for me, neither when entering the service's name nor when I searched for "deterministic Docker builds" or "cache apt apk pip dependencies".
[0]: https://news.ycombinator.com/item?id=39684416
[1]: https://news.ycombinator.com/item?id=39723888
[+] [-] jnnnthnn|1 year ago|reply
I'm planning to fix that in short order, feel free to sign up at https://hackersearch.net/signup if you care to receive an update when that goes live!
[+] [-] awendland|1 year ago|reply
I built mine on top of an RSS feed I generate from Hacker News which filters out any posts linking to the top 1 million domains [1] and creates a readable version of the content. I use it to surface articles on smaller blogs/personal websites—it's become my main content source. It's generated via Github Actions every 4 hours and stored in a detached branch on Github (~2 GB of data from the past 4 years). Here's an example for posts with >= 10 upvotes [2].
It only took several hours to build the semantic search on top. And that included time for me to try out and learn several different vector DBs, embedding models, data pipelines, and UI frameworks! The current state of AI tooling is wonderfully simple.
In the end I landed on (selected in haste optimizing for developer ergonomics, so only a partial endorsement):
I generated the index locally on my M2 Mac which ripped through the ~70k articles in ~12 hours to generate all the embeddings.I run the search site with Podman on a VM from Hetzner—along with other projects—for ~$8 / month. All requests are handled on CPU w/o calls to external AI providers. Query times are <200 ms, which includes embedding generation → vector DB lookup → metadata retrieval → page rendering. The server source code is here [3].
Nice work @jnnnthnn! What you built is fast, the rankings were solid, and the summaries are convenient.
[1] https://majestic.com/reports/majestic-million
[2] https://github.com/awendland/hacker-news-small-sites/blob/ge...
[3] https://github.com/awendland/hacker-news-small-sites-website...
[+] [-] jasonjmcghee|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] ofermend|1 year ago|reply
Clearly many of us see the need here. I have also been working on a similar demo: https://search-hackernews.vercel.app/ 1. Stack: Vectara for RAG, Vercel for hosting 2. Results show the main story and top 3-4 comments from the story 3. Focused mostly on the search aspect - so if you click it redirects you to the HN page itself. No summaries although it'd be easy to add.
Would love to get some feedback and any suggestions for improvement. I'm still working on this as a side project.
Example query to try: "What did Nvidia announce in GTC 2024?" (regular HN search returns empty)
[+] [-] rdli|1 year ago|reply
Here's a question for this crowd: Do we see domain/personalized RAG as the future of search? In other words, instead of Google, you go to your own personal LLM, which has indexed all of the content you care about (whether it's everything from HN, or an extra informative blog post, or ...)? I personally think this would be great. I would still use Google for general-purpose search, but a lot of my search needs are trying to remember that really interesting article someone posted to HN a year ago that is germane to what I'm doing now.
[+] [-] jnnnthnn|1 year ago|reply
Quality aside, I think the primary challenge is in figuring out the right UX for delivering that at scale. One of the really great advantages of Google is that it is right there in your URL bar, and that for many of the searches you might do, it works just fine. Figuring out when it doesn't and how to provide better result then seems like a big unsolved UX component of figuring out personalized search.
[+] [-] tinyhouse|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
One big distinction with the "site:https://news.ycombinator.com" hack is that the search on Hacker Search directly runs against the underlying link's contents, rather than whatever happens to be on HN. We also more directly leverage HN's curation by factoring in scores.
Appreciate your suggestions; will look into building those!
[+] [-] manca|1 year ago|reply
My only piece of advice, though: try to do the reranking using some other rerankers instead of an LLM -- you'll save both on the latency AND the cost.
Other than that, good job.
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] hubraumhugo|1 year ago|reply
> recommending stories to you based on your interests
I built this as a service that monitors and classifies HN stories based on your interests (solved my FOMO): https://www.kadoa.com/hacksnack
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] BohuTANG|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] pjot|1 year ago|reply
https://github.com/patricktrainer/duckdb-embedding-search
[+] [-] jasonjmcghee|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] curious_cat_163|1 year ago|reply
> e.g. more explicitly searching for technical opinions...
Yes, please! I would love to be able to search for strongly held opinions by folks who _know_ what they are talking about.
> recommending stories to you based on your interests...
I am curious how, in principle, you would you do that? Where do you think the signal that indicates my "interest" lies?
[+] [-] jnnnthnn|1 year ago|reply
To learn your interests we'd at a minimum need to know what HN stories you tend to click or comment on, e.g. by a different reader view or using a browser extension. Presumably your comments and submissions could provide useful signal as well :)
[+] [-] fuzzfactor|1 year ago|reply
Apparently the old Algolia search has not been accessible around the world for a few months at least.
[+] [-] simonw|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
I actually generate two summaries: one is part of the ingestion pipeline and used for indexing and embedding, and another is generated on-the-fly based on user queries (the goal there is to "reconcile" the user query with each individual item being suggested).
I use GPT-3.5 Turbo, which works well enough for that purpose. Cost of generating the original summaries from raw page contents came down to about $0.01 per item. That could add up quickly, but I was lucky enough to have some OpenAI credits laying around so I didn't have to think much about this or explore alternative options.
GPT-4 would produce nicer summaries for the user-facing portion, but the latency and costs are too high for production. With GPT-3.5 however those are super cheap since they require very few tokens (they operate off of the original summaries mentioned above).
Worth noting that I've processed stories by score descending, and didn't process anything under 50 points which substantially reduced the number of tokens to process.
[+] [-] avereveard|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] levkk|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
Agreed it could be faster for uncached queries. The embeddings retrieval itself is actually pretty fast (uses pgvector). However, I found that having a LLM rerank results + generate summaries related to the search query made results more useful, which is what accounts for much of the latency.
Maybe I should make that a user-customizable setting!
[+] [-] HanClinto|1 year ago|reply
What about using the embeddings for nearest-neighbor search for similar articles? I.E., for any given article, can you use the embeddings of an article to run a search, rather than encoding my query? That would let me find similar / related articles much more easily.
[+] [-] jnnnthnn|1 year ago|reply
Yup, totally feasible. I might add that!
[+] [-] xwowsersx|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] Fudgel|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
[+] [-] Scene_Cast2|1 year ago|reply
[+] [-] jnnnthnn|1 year ago|reply
- Next.js
- OpenAI's embeddings and GPT endpoints
- Postgres with pgvector (on neon.tech)
- Tailwind
- tRPC
- Vercel for web hosting
- Google Cloud products for data pipelines (GCS, Cloud Tasks)