top | item 45544429

(no title)

dikobraz | 4 months ago

I present an LLM inference throughput benchmark for RTX4090 / RTX5090 / PRO6000 GPUs based on vllm serving and vllm bench serve client benchmarking tool.

The hardware configurations used:

- 1x4090, 2x4090, 4x4090

- 1x5090; 2x5090; 4x5090

- 1x6000

All machines have at least 50GB of RAM per GPU with a minimum of 7 cores. The 4090 machines utilize the EPYC Milan (3rd Gen) processor, while the 5090/6000 models employ the EPYC Genoa (4th Gen) processor, resulting in slightly faster overall performance.

I have optimized the benchmark setup for throughput. VLLM serves models. The model is split across multiple GPUs using the --pipeline-parallel-size VLLM option, if needed. I run as many VLLM instances as possible, using an NGINX load balancer on top to distribute requests across them and maximize throughput (replica parallelism). For example, if only two GPUs are required to run the model on a 4-GPU machine, I run two VLLM instances with --pipeline-parallel-size=2 and an NGINX load balancer. If all four GPUs are required, then a single VLLM instance with --pipeline-parallel-size=4 is used.

The vllm bench serve tool is used for benchmarking with random data and a sequence length of 1000. The number of concurrent requests is set to 400 to ensure saturation of the LLM token generation capacity.

I have benchmarked three different models to understand better the effect of PCIe communication on the final LLM performance. I have tried to find the largest modern model that fits into a single 4090, two 4090s, and four 4090s. It would be possible to fit larger GGUF models, but VLLM poorly supports GGUF, and I wanted to use VLLM because it is optimized for high-throughput serving.

Here is the model selection and the logic behind it:

Qwen3-Coder-30B-A3B-Instruct-AWQ (fits 24GB). This 4-bit quantized model fits into a single RTX4090. Thus, scaling the number of GPUs yields a linear scale in throughput, so 4 x 4090 and 4 x 5090 configurations should have an edge as they have more raw compute power.

Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (fits 48GB). This 4-bit quantized model fits into 2 x 4090. Some communication over PCIe can lower the performance of multi-GPU setups.

GLM-4.5-Air-AWQ-4bit (fits 96GB). This model requires all four 4090s, so PCIE communication will likely be a bottleneck, and Pro 6000 should have an edge.

Besides raw throughput, graphs contain the serving cost per million tokens for the respective model on the respective hardware. The rental price is set to $0.39 per hour for 4090, $0.65 for 5090, and $1.29 for Pro 6000. These prices are typical for GPU rentals at neuralrack.ai, which provided the hardware for this benchmark. You can adjust the GPU price in the config.yml file in the benchmark repository and invoke make report to generate a new report that better reflects your situation. Results

The overall winner is RTX PRO 6000 for its consistent performance across all model sizes and best cost-efficiency for larger models. However, if your workload primarily involves smaller models, the multi-GPU RTX 5090 can offer better absolute throughput at a lower cost.

Small Models (fits 24GB): Multi-GPU consumer configurations offer the best value due to replica parallelism, but RTX PRO 6000 is very close.

Medium Models (fits 48GB): RTX 5090 configuration provides the best balance of performance and cost, followed by RTX PRO 6000.

Large Models (fits 96GB): RTX PRO 6000 emerges as the clear winner despite its higher hourly cost, thanks to the elimination of PCIe overhead.

Medium article: https://medium.com/ai-advances/rtx-4090-vs-rtx-5090-vs-rtx-p...

Non-medium link: https://www.cloudrift.ai/blog/benchmarking-rtx-gpus-for-llm-...

GitHub: https://github.com/cloudrift-ai/server-benchmark/tree/main

discuss

No comments yet.