top | item 46229583

(no title)

downsplat | 2 months ago

Yeah at $WORK we use various LLM APIs to analyze text; it's not heavy usage in terms of tokens but maybe 10K calls per day. We've found that response times vary a lot, sometimes going over a minute for simple tasks, and random fails happen. Retry logic is definitely mandatory, and it's good to have multiple providers ready. We're abstracting calls across three different APIs (openai, gemini and mistral, btw we're getting pretty good results with mistral!) so we can switch workloads quickly if needed.

discuss

jwillp|2 months ago

I've been impressed by ollama running locally for my work, involving grouping short text snippets by semantic meaning, using embeddings, as well as summarization tasks. Depending on your needs, a local GPU can sometimes beat the cloud. (I get no failures and consistent response times with no extra bill.) Obviously YMMV, and not ideal for scaling up unless you love hardware.

duckmysick|2 months ago

Which models have you been using?

phantasmish|2 months ago

It'd be kinda nice if they exposed whatever queuing is going on behind the scenes, so you could at least communicate that to your users.

aamoscodes|2 months ago

IIRC this is almost exactly the use case for OpenRouter, down to provider fallback https://openrouter.ai/docs/guides/best-practices/uptime-opti...