Long Read: Lessons from Building Semantic Search for GitHub and Why I Failed

kevmo314|11 months ago

It's somewhat ironic that the author advocates for keeping it simple and using pgvector but then buries a ton of complexity with an API server, auth server, Cloudflare workers, and durable objects. Especially given

> Supabase easily the most expensive part of my stack (at $200/month, if we ran in it XL, i.e. the lowest tier with 4-core CPU)

That could get you a pretty decent VPS and allow you to coassemble everything with less complexity. This is exemplified in some of the gotchas, like

> Cloudflare Workers demand an entirely different pattern, even compared to other serverless runtimes like Lambda

If I'm hacking something together, learning an entirely different pattern for some third-party service is the last thing I want to do.

All that being said though, maybe all it would've done is prolong the inevitable death due to the product gap the author concludes with.

zxt_tzx|11 months ago

Totally fair point. Thanks for taking the time to read through it! I guess I didn't want to use a VPS and then have to switch to something else if the product really worked, but I guess that rhymes with premature optimization.

Some other clarifications:

- I was also surprised with how expensive Supabase turned out to be and only got there because I was trying to sync very big repos ahead of time. I could see an alternative product where the cost here would be minimal too

- I did see this project as an opportunity to try out Cloudflare. as mentioned in the post, as a full stack TypeScript developer, I thought Cloudflare could be a good fit and I still really want it to succeed as a cloud platform

- deploying two separate API server and auth server is actually simpler than it sounds, since each is a Cloudflare Worker! will try to open source this project so this is clearer

- the durable objects rate limiter was wholly experimental and didn't make it into production

> All that being said though, maybe all it would've done is prolong the inevitable death due to the product gap the author concludes with.

Very true :(

ljm|11 months ago

Not speaking for OP’s experience but I suppose that you might default to all this fancy serverless edge worker stuff if you learned how to code on their (usually generous) free-tier plans, or they were the only things you dealt with at work.

Meanwhile setting up a little VPS box would come more naturally if you learned in the era of the LAMP stack and got your hands dirty with Linux.

In fact I wonder if for some people that’s made worse by the tendency to split frontend and backend web development into completely separate disciplines when originally you did the whole thing.

zxt_tzx|11 months ago

Author here. Over the last few months, I have built and launched a free semantic search tool for GitHub called SemHub (https://semhub.dev/). In this blog post, I share what I’ve learned and why I’ve failed, so that other builders can learn from my experience. This blog post runs long and I have sign-posted each section. I have marked the sections that I consider the particularly insightful with an asterisk (*).

I have also summarized my key lessons here:

1. Default to pgvector, avoid premature optimization.

2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.

3. Filtering with vector search may be harder than you expect.

4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.

5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.

gfody|11 months ago

it's weird you consider this a failure. you spent a few months and learned how to work with embedding models to build an efficient search. the fact that your search works well is a successful outcome. if your goal was to turn a few month effort into a thriving business that's never going to happen period - it only seems possible because when it does happen for people we completely discount the luck factor.

if you want to turn your search into a business now that's a new and different effort, mostly marketing and stuff that most self respecting engineers gives zero shits about, but if that's your real goal don't call it a failure yet because you haven't even tried.

fulafel|11 months ago

SST: https://github.com/sst/sst - vaguely similar to CDK but can also manage some non-AWS resources and seems TypeScript-only

smarx007|11 months ago

Hi, thanks for building a great tool and a great write-up! I was trying to add a number of repos under oslc/, oslc-op/, and eclipse-lyo/* orgs but no joy - internal server error. Hopefully, you will reconsider shutting down the project (just heard about it and am quite excited)!

I think a project like yours is going to be helpful to OSS library maintainers to see which features are used in downstream projects and which have issues. Especially, as in my case, when the project attemps to advance an open standard and just checking issues in the main repo will not give you the full picture. For this use case, I deployed my own instance to index all OSS repos implementing OSLC REST or using our Lyo SDK - https://oslc-sourcebot.berezovskyi.me/ . I think your tool is great in complementing the code search.

vaidhy|11 months ago

Having built a failed semantic search engine for life sciences (bioask when it existed), I think the last point should be the first. Not getting a product market fit very quickly killed mine.

romanhn|11 months ago

Thanks for posting this, very timely as I'm also playing around with pgvector for semantic search. I saw that you ended up trimming inputs longer than 8K tokens. Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)? Embedding models I'm playing with have a max of 512 tokens, so chunking is pretty much a must. Choosing a chunking strategy seems to be a deep rabbit hole of its own.

cynicalsecurity|11 months ago

With 5 you mean promoting the app? It is by far the biggest problem, yes. In many cases even bigger than building the app itself.

niel|11 months ago

Thanks for writing this up!

> Filtering with vector search may be harder than you expect.

I've only ever used it for a small proof of concept, but Qdrant is great at categorical filtering with HNSW.

https://qdrant.tech/articles/filtrable-hnsw/

wrs|11 months ago

Fantastic writeup — thank you for taking the time to do this!

whakim|11 months ago

I was the first employee at a company which uses RAG (Halcyon), and I’ve been working through issues with various vector store providers for almost two years now. We’ve gone from tens of thousands to billions of embeddings in that timeframe - so I feel qualified to at least offer my opinion on the problem.

I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

ichiwells|11 months ago

> Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale

For anyone coming across this without much experience here, for building these indexes in pgvector it makes a massive difference to increase your maintenance memory above the default. Either as a separate db like whakim mentioned, or for specific maintenance periods depending on your use case.

``` SHOW maintenance_work_mem; SET maintenance_work_mem = X; ```

In one of our semantic search use cases, we control the ingestion of the searchable content (laws, basically) so we can control when and how we choose to index it. And then I've set up classic relational db indexing (in addition to vector indexing) for our quite predictable query patterns.

For us that means our actual semantic db query takes about 10ms.

Starting from 10s of millions of entries, filtered to ~50k (jurisdictionally, in our case) relevant ones and then performing vector similarity search with topK/limit.

Built into our ORM and zero round-trip latency to Pinecone or syncing issues.

EDIT: I imagine whakim has more experience than me and YMMV, just sharing lesson learned. Even with higher maintenance mem the index building is super slow for HNSW

zxt_tzx|11 months ago

Thank you for the comment, compared to you I have only touched the bare surface of this quite complex domain, would love to get more of your input!

> building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Someone on Twitter reached out and pointed out one could quantizing the embeddings to bit vectors and search with hamming distance — supposedly the performance hit is actually very negligible, especially if you add a quick rescore step: https://huggingface.co/blog/embedding-quantization

> But (as mentioned in other comments) keeping your data in sync is a huge issue.

Curious if you have any good solutions in this respect.

> The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?

gregw134|11 months ago

What would you recommend for billions of embeddings?

johnfn|11 months ago

That was a great write up.

If you don't mind me giving you some unsolicited product feedback: I think SemHub didn't do well because it's unclear what problem it's actually solving. Who actually wants your product? What's the use case? I use GitHub issues all the time, and I can't think of a reason I'd want semhub. If I need to find a particular issue on, say, TypeScript, I'll just google "github typescript issue [description]" and pull up the correct thing 9 times out of 10. And that's already a pretty rare percentage of the time I spend on GitHub.

zxt_tzx|11 months ago

Thanks for the feedback, to be honest, my own experience is actually very similar to yours.

The original pain point probably only exists for small minority of open source maintainers who manage multiple repos and actually search across them regularly. Most devs are probably like you and I, and the mediocre GitHub search experience is more than compensated by using Google.

In its current iteration, it's quite hard to get regular devs to change their searching behaviour, and, even for those who experience this pain point, it probably isn't large enough for them to change their behavior.

If I continue to work on this, I would want to (1) solve a bigger + more frequent pain point; (2) build something that requires a smaller change in user behavior.

Noumenon72|11 months ago

https://manticoresearch.com/blog/github-semantic-search/ gives some good examples where you get more with semantic than keyword search:

  * Search for "memory leak", get "index out of memory"
  * Search "API rate limits", get “throttling”, “250 results” limit, and “rate limiting”
  * Search issues for "user authentication" to see whether anyone has submitted your feature request
  * Search for “SQL injection” to get “database infiltration” or “SQL vulnerability”

nchmy|11 months ago

This seems pretty similar to something that the ManticoreSearch team released a year ago

https://manticoresearch.com/blog/manticoresearch-github-issu...

You can index any GH repo and then search it with vector, keyword, hybrid and more. There's faceting and anything else you could ever want. And it is astoundingly fast - even vector search.

Here's the direct link to the demo https://github.manticoresearch.com/

zxt_tzx|11 months ago

oh wow that's super cool, I tried it and it's very fast indeed. thanks for sharing! will spend more time to understand how it's implemented

VirgilShelton|11 months ago

Hey Warren great job on the site, but what you'll need to do is SEO. You're a great writer so all you need to add to your writing skills is SEO. I did a basic SEO audit of semhub.dev and you have no SEO. While this is niche you'll need to add a blog to your website and use basic SEO keyword research to find what your target audience is searching for instead of just blogging to blog. Start reading https://backlinko.com/seo-basics-for-beginners and you'll be well on your way. It should take about a year for you to get some good traction. Don't rush, just keep learning more and more everyday and you'll get there in a few years with organic SEO alone. The comments here alone are proof that you have a viable MVP.

GL!

serjester|11 months ago

Great write up, especially agree on pgvector with small (ideally fine tuned) embeddings. There’s so much complexity that comes with keeping your vector db in sync with you main db (especially once you start filtering with metadata). 90% of gen AI apps don’t need it.

zxt_tzx|11 months ago

> There’s so much complexity that comes with keeping your vector db in sync with you main db (especially once you start filtering with metadata)

Ohh do you speak from experience? I know I will likely never do this, but curious how did you do it? When I looked into this, I found that Airbyte has something to connect the vector db with the main db, but I never bit that bullet (thankfully)

scottyeager|11 months ago

> * No way to search across multiple repos within GitHub. > * No way to easily see open and closed issues in the same view.

I don't quite understand, because searching issues across all of Github and also within orgs is already supported. Those searches show both open and closed issues by default.

For searches on a single repo, just removing the "state" filter entirely from the query also shows open and closed issues.

I do think that semantic search on issues is a cool idea and the semantic/fuzzy aspect is probably the biggest motivator for the project. It just felt funny to see stuff that Github can actually already do listed at the top of motivating issues.

dgellow|11 months ago

GitHub search is pretty unreliable from my experience. Search results are limited to 1000 items, and you never know if the index you’re searching against is up to date — unless a file has been opened recently in GitHub web UI there is a significant and unpredictable delay between a commit and the indexing.

So far I’ve been very happy with Livegrep, we are using it to search across ~10k repos, the index is rebuilt once an hour with a simple cron job. Searching is insanely fast, it’s using very little resources, just a simple computer engine instance. The main downside is the lack of multiline search, but so far that hasn’t been too much of a problem.

brian-armstrong|11 months ago

Am I misunderstanding what is meant by semantic code search? I thought the idea was that you run something like a parser on the repo to extract function/class/variable names and then allow searching on a more rich set of data, rather than tokenizing it like English.

I know github kind of added this but their version falls apart still even in common languages like C++. It's not unusual for it to just completely miss cross references, even in smaller repos. A proper compiler's eye view of symbolic data would be super useful, and Github's halfway attempt can be frustratingly daft about it.

zxt_tzx|11 months ago

Ah I was doing semantic search of GitHub _issues_, not the actual code on GitHub.

For code search, I have used grep.app, which works reasonably well

franky47|11 months ago

I started a quick weekend project to do just that today: index my OSS project's [1] issues & discussions, so I can RAG-ask it to find references when I feel like I'm repeating myself (in "see issue/PR/discussion #123", finding the 123 is the hardest part).

This article might be super helpful, thanks! I don't intend to make a product out of it though, so I can cut a lot of corners, like using a PAT for auth and running everything locally.

[1] https://github.com/47ng/nuqs

zxt_tzx|11 months ago

After this failed experience with SemHub, I am actually thinking of building something like this, for open source maintainers like you are definitely the ICP! (nuqs seems really cool btw, storing state in the URL param is definitely the way to go)

To elaborate, I was thinking of:

- running a cron that checks repos every X minutes

- for every new issue someone has opened, I will run an agent that (1) checks e.g. SemHub to look for similar issues; (2) checks the project's Discord server or Slack channel to see if anyone has raised something similar; (3) run a general search

- use LLMs to compose a helpful reply pointing the OP to that other issue/Discord discussion etc.

From other OSS maintainers, I've heard that being able to reliably identify duplicates would be a huge plus. Does this sound like something you'd be interested to try? Let me know how I can reach you if/when I have built something like this!

I am personally quite annoyed by all the AI slop being created on social media and even GitHub PRs and would love to use the same technology to do something pro-social.

nosefrog|11 months ago

> When using Cloudflare Workers as an API server, I have experienced requests that would “fail silently” and leave a “hanging connection”, with no error thrown, no log emitted, and a frontend that is just loading. Honestly, no idea what’s up with this.

Yikes, these sorts of errors are so hard to debug. Especially if you don't have a real server to log into to get pcaps.

viraptor|11 months ago

Cloudflare workers are not amazing in terms of communicating problems. The errors you get can also be out of sync with the docs and the support doesn't have access to poke at your issues directly. Together with the custom runtime and outdated TS types... it can be a very frustrating DX.

gregorvand|11 months ago

Hi Warren, great article. Would love to connect on what we're doing (also in Singapore). Please drop me a message gregor@vand.hk

dangapeass|11 months ago

[deleted]

51 comments