top | item 43300330

(no title)

zxt_tzx | 11 months ago

Author here. Over the last few months, I have built and launched a free semantic search tool for GitHub called SemHub (https://semhub.dev/). In this blog post, I share what I’ve learned and why I’ve failed, so that other builders can learn from my experience. This blog post runs long and I have sign-posted each section. I have marked the sections that I consider the particularly insightful with an asterisk (*).

I have also summarized my key lessons here:

1. Default to pgvector, avoid premature optimization.

2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.

3. Filtering with vector search may be harder than you expect.

4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.

5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.

discuss

gfody|11 months ago

it's weird you consider this a failure. you spent a few months and learned how to work with embedding models to build an efficient search. the fact that your search works well is a successful outcome. if your goal was to turn a few month effort into a thriving business that's never going to happen period - it only seems possible because when it does happen for people we completely discount the luck factor.

if you want to turn your search into a business now that's a new and different effort, mostly marketing and stuff that most self respecting engineers gives zero shits about, but if that's your real goal don't call it a failure yet because you haven't even tried.

zxt_tzx|11 months ago

> it's weird you consider this a failure. you spent a few months and learned how to work with embedding models to build an efficient search. the fact that your search works well is a successful outcome.

Thank you for your encouragement! I take your point that it was not a technical failure, but I think it's still a product failure in the sense that SemHub was not solving a big enough pain point for sufficiently many people.

> if you want to turn your search into a business now that's a new and different effort, mostly marketing and stuff that most self respecting engineers gives zero shits about, but if that's your real goal don't call it a failure yet because you haven't even tried.

Haha to be honest, my goal was even more modest, SemHub is intended to be a free tool for people to use, we don't intend to monetize it. I also did try to market it (DMing people, Show HN), but the initial users who tried it did not stick around.

Sure, I could've marketed SemHub more, but I think the best ideas carry within themselves a certain virality and I don't think this is it.

fulafel|11 months ago

SST: https://github.com/sst/sst - vaguely similar to CDK but can also manage some non-AWS resources and seems TypeScript-only

e12e|11 months ago

Apparently they started on top of cdk - then migrated to pulumni adding support for terraform providers.

Looks like one of the more interesting deploy toolkits I've seen in a while.

smarx007|11 months ago

Hi, thanks for building a great tool and a great write-up! I was trying to add a number of repos under oslc/, oslc-op/, and eclipse-lyo/* orgs but no joy - internal server error. Hopefully, you will reconsider shutting down the project (just heard about it and am quite excited)!

I think a project like yours is going to be helpful to OSS library maintainers to see which features are used in downstream projects and which have issues. Especially, as in my case, when the project attemps to advance an open standard and just checking issues in the main repo will not give you the full picture. For this use case, I deployed my own instance to index all OSS repos implementing OSLC REST or using our Lyo SDK - https://oslc-sourcebot.berezovskyi.me/ . I think your tool is great in complementing the code search.

zxt_tzx|11 months ago

Ohh apologies, I think there was a bug that led to the Internal Server Error, please try again, I _think_ it should be working now!

> I think a project like yours is going to be helpful to OSS library maintainers to see which features are used in downstream projects and which have issues.

That was indeed the original motivation! Will see if I can convince Ammar to reconsider shutting down the project, but no promises

> For this use case, I deployed my own instance to index all OSS repos implementing OSLC REST or using our Lyo SDK

Ohh, in case it's not clear from the UI, you could create an account and index your own "collection" of repos and search from within that interface. I had originally wanted to build out this "collection" concept a lot more (e.g. mixing private and public repos), but I thought it was more important to see if there's traction for the public search idea at all

vaidhy|11 months ago

Having built a failed semantic search engine for life sciences (bioask when it existed), I think the last point should be the first. Not getting a product market fit very quickly killed mine.

romanhn|11 months ago

Thanks for posting this, very timely as I'm also playing around with pgvector for semantic search. I saw that you ended up trimming inputs longer than 8K tokens. Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)? Embedding models I'm playing with have a max of 512 tokens, so chunking is pretty much a must. Choosing a chunking strategy seems to be a deep rabbit hole of its own.

zxt_tzx|11 months ago

> Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)?

Ohh I had not seriously considered this until reading this. I could have multiple embeddings per issue and search across those embeddings and if the same issue is matched multiple times, I would probably take the strongest match and dedupe it.

I could create embeddings for comments too and search across those.

Thanks for the suggestion, would be a good think to try!

> Choosing a chunking strategy seems to be a deep rabbit hole of its own.

Yes this is true. In my case, I think the metadata fields like Title and Labels are probably doing a lot of the work (which would be duplicated across chunks?) and, within an issue body, off the top of my head, I can't see any intuitive ways to chunk it.

I have heard that for standard RAG, chunking goes a surprisingly long way!

cynicalsecurity|11 months ago

With 5 you mean promoting the app? It is by far the biggest problem, yes. In many cases even bigger than building the app itself.

niel|11 months ago

Thanks for writing this up!

> Filtering with vector search may be harder than you expect.

I've only ever used it for a small proof of concept, but Qdrant is great at categorical filtering with HNSW.

https://qdrant.tech/articles/filtrable-hnsw/

zxt_tzx|11 months ago

Thanks for sharing! Do you have more details to share, e.g. did you just have a vector db, or did you have a main db as well?

In my research, Qdrant was also the top contender and I even created an account with them, but the need to sync two dbs put me off

wrs|11 months ago

Fantastic writeup — thank you for taking the time to do this!

zxt_tzx|11 months ago

I'm glad you found it helpful :)