top | item 35530478

Ask HN: How to deploy LLMs to serve a million users?

17 points| nothrowaways | 2 years ago

What are current practices of serving inherently large LLM models to large number of users (with fast token/second)?

15 comments

order
[+] oceanplexian|2 years ago|reply
I don't work at OpenAI but as a Systems Engineer it's pretty easy to conceptualize how they'd do it.

- Instead of serving up one 175B parameter model, ChatGPT is probably a collection of smaller models that classify the input and then send it off to an optimized model to reduce the memory footprint for that task, such as classification, summarization, coding, and so on.

- Faking the memory from the chat. Context size = compute time. If they can "compress" a conversation by summarizing previous context, it will likely reduce the tokens they need to generate for each response. It definitely feels like they are doing something like this, as the chat gets longer and the model is less able to recall context from the chat.

- Model quantization: This is being heavily used in the OSS world and I'd be shocked if OpenAI weren't trying something similar. You can use quantized models without losing much accuracy and further reduce the memory footprint.

- Heavy use of caching. If I were them I'd be caching like crazy. There are a lot of clever tricks with caching variants, so users don't realize they're looking at a cached response. You could take the cached response from the big LLM, then using a simple LLM to reword it so users think they're still talking to the big, smart model.

- All the standard stuff in a BigCo stack (Containers, some kind of IaC, CI/CD stacks, testing environments, various types of databases or caching software, service discovery, observability tooling, etc) And probably a shitload of glue code that makes the magic happen behind the scenes.

[+] dbish|2 years ago|reply
Very much doubt #1, the whole reason this works is model size. Smaller models and some intent model on top of it are very unlikely to produce the results they've been able to (and goes against everything they've shared on how this works).

For #2, they've talked about this openly (and if you use any of the underlying models directly you can see it), there is a limited context window (~4k tokens for davinci-003), so time aside, they have to compress to fit that context window.

[+] greatpostman|2 years ago|reply
It’s a lot simpler than you think, model serving is Horizontally scalable. You deploy the model running on a top of gpus each with a web server. Load balancer distributes the requests, probably a web socket streams the inference in chunks back to the client.
[+] FrenchDevRemote|2 years ago|reply
i guess you'd need a good relationship with a cloud provider, because I'm not even sure a regular customer could rent that many GPUs through normal means, or just make your own servers/datacenters

then I guess some basic decent load balancing/queuing on your inference endpoints

[+] dbish|2 years ago|reply
Exactly. All the big ones have partnerships/allegiances or own their cloud already, look at OpenAI and Microsoft or Google Bard on top of GCP, and similarly Stability seems to have a strategic partnership with AWS. You won't run at their scale without something like that, especially with GPU shortages we're seeing.
[+] blibble|2 years ago|reply
serious answer? get given $10 billion of cloud computing credit
[+] MuffinFlavored|2 years ago|reply
how long would that roughly last at 0% markup (cost) from microsoft?
[+] quickthrower2|2 years ago|reply
Make it open source, let the community figure it out. Hobbyists will optimize. :-)
[+] la64710|2 years ago|reply
From the horses mouth ;)

Deploying a large language model like OpenAI's GPT-3 for a million users requires careful planning and consideration of various technical and logistical aspects. Here are some steps that can be taken to deploy a large language model for a large number of users:

Determine the hardware and infrastructure requirements: Deploying a large language model for a million users requires significant computing power and infrastructure. You will need to decide whether to use cloud-based solutions like AWS, Azure, or Google Cloud, or build your own hardware infrastructure. Build a scalable architecture: To ensure that your deployment can handle a million users, you need to build a scalable architecture that can accommodate additional users as needed. This may involve setting up a load balancer and multiple servers to distribute the load. Ensure data security: Since a large language model like GPT-3 deals with sensitive information, it is important to ensure data security. This may involve encrypting data, using secure APIs, and setting up firewalls to protect against cyber-attacks. Develop APIs: To enable users to access the language model, you need to develop APIs that can be accessed through a web or mobile application. These APIs should be designed to handle multiple requests simultaneously. Test and monitor the deployment: Before deploying the language model for a million users, it is important to thoroughly test the system to ensure it is functioning as expected. You should also set up monitoring tools to track performance and detect any issues. Provide user support: When deploying a large language model for a million users, it is important to provide user support to address any issues or concerns that users may have. This may involve setting up a helpdesk or providing documentation to help users understand how to use the system. Overall, deploying a large language model like OpenAI's GPT-3 for a million users requires a significant investment of resources and expertise. By carefully planning and executing each step of the deployment process, you can ensure that the system is scalable, secure, and user-friendly.

[+] nothrowaways|2 years ago|reply
I hope the "horse" is not what I think it is.
[+] verdverm|2 years ago|reply
Is it that much different from other APIs serving large numbers of users?

(other than the large GPU costs and lower request throughput per instance)

[+] KatrKat|2 years ago|reply
I think some domain-specific considerations include:

1. You need a really big in-memory data set that you touch ~all of several times for each request, so you really want to e.g. memory-map it and make sure it actually fits in memory on the machine.

2. If using a GPU, you have to make sure the GPU is hooked up to the serving process. You probably want your processes to be heavier-weight than they otherwise would be.

3. You might want to batch requests from several users for processing in the same stream of commands to the GPU. So you need to collect the right number of requests before processing any of them, without making any requests wait too long. You might need to sort these out by what inference parameters they want to override, and send them to different servers, because they might need to be batched accordingly.

4. You might want to stream the output more or less character by character. Possibly to several users, from one live run on a GPU, after having batched up enough requests to justify a run.

5. Content moderation when you are sending data to the user before you have even seen all of it yourself is an unsolved problem.

[+] travisjungroth|2 years ago|reply
It’s not different from other GPU ML deployments, but those are already a little less solved than the standard web API portion.