If you need semantic search locally then it's fine, but serving an embedding model might be still challenging. And if you want to expose it, your laptop might be not enough.
I've hosted embedding models on AWS Lambda (fair that this is a vendor, but 1 vs. 3), if you try an LLM with 1B+ parameters you will struggle, but if the difference between a light-weight BERT-like transformer and an LLM is only a few % of loss, why bother getting your credit card out?
Edit: another thought, skip lambda entirely and run the embedding job on the server as a background process, and use an on-disk vector store (lancedb)
Shameless plug: I built Mighty Inference Server to solve this problem. Fast embeddings with minimal footprint. Better BEIR and MTEB scores using the lightning fast and small E5 V2 models. Scales linearly on CPU, no GPU needed.
hn_20591249|2 years ago
Edit: another thought, skip lambda entirely and run the embedding job on the server as a background process, and use an on-disk vector store (lancedb)
binarymax|2 years ago
https://max.io
llogiq|2 years ago
bootsmann|2 years ago