top | item 41311088 Run AI inference apps with self-hosted models on Cloud Run with Nvidia GPUs 15 points| LyalinDotCom | 1 year ago |cloud.google.com 2 comments order hn newest wietsevenema|1 year ago One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud Run service).Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds.Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).You can get Gemma 2 (2B, Q4_0) to return tokens after 11 seconds from a cold start (best case). steren|1 year ago Cloud Run PM here, ask me anything!
wietsevenema|1 year ago One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud Run service).Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds.Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).You can get Gemma 2 (2B, Q4_0) to return tokens after 11 seconds from a cold start (best case).
wietsevenema|1 year ago
Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.
Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds.
Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).
You can get Gemma 2 (2B, Q4_0) to return tokens after 11 seconds from a cold start (best case).
steren|1 year ago