top | item 46970757

(no title)

1 points| dikobraz | 19 days ago

discuss

dikobraz|19 days ago

LLM inference throughput benchmark for RTX PRO 6000 SE vs H100, H200, and B200 GPUs, based on the vllm serve and vllm bench serve benchmarking tools, to understand the cost-efficiency of various datacenter GPU options.

Benchmarking Setup: The benchmark is optimized for throughput. VLLM serves models. The model is split across multiple GPUs using the --tensor-parallel-size VLLM option, if needed. Multiple VLLM instances serve the model; an NGINX load balancer on top distributes requests across them, maximizing throughput (replica parallelism). For example, if only 4 GPUs are required to run the model on an 8-GPU machine, two VLLM instances are launched with --tensor-parallel-size=4, and an NGINX load balancer is used. If all eight GPUs are required, then a single VLLM instance with --tensor-parallel-size=8 is used.

The vllm bench serve tool is used for benchmarking with random data and a sequence length of 1000. The number of concurrent requests is set to 64-256 to ensure the LLM's token-generation capacity is saturated.

Three models are benchmarked to better understand the effect of PCIe communication on the 8xPro6000 server vs. NVLink on the H100/H200/B200.

Here is the model selection and the logic behind it: - GLM-4.5-Air-AWQ-4bit (fits 80GB). Testing single-GPU performance and maximum throughput with replica scaling on 8 GPU setups. No PCIE bottleneck. - Qwen3-Coder-480B-A35B-Instruct-AWQ (fits 320GB). This 4-bit-quantized model fits into 4 GPUs. Some PCIe communication overhead in Pro 6000 setups may reduce performance relative to NVLink-enabled datacenter GPUs. - GLM-4.6-FP8 (fits 640GB). This model requires all eight GPUs. PCIe communication overhead expected. The H100 and H200 configurations should have an advantage.

Besides raw throughput, graphs show the serving cost per million tokens for each model on its respective hardware. The rental price is set at $0.93 for Pro6000, $1.91 for H100, $2.06 for H200, and $2.68 for B200.

Results: - B200 wins on throughput, with the largest gap on the most communication-heavy workload – GLM-4.6-FP8 (8-way TP): B200 is 4.87x faster than PRO 6000 (8,036.71 vs 1,651.67 tok/s) – Qwen3-Coder-480B (4-way TP): B200 is 4.02x faster than PRO 6000 (6,438.43 vs 1,602.96 tok/s) – GLM-4.5-Air (single-GPU replicas): B200 is 4.22x faster than PRO 6000 (9,675.24 vs 2,290.69 tok/s) - B200 is also the cost efficiency leader under updated run-cost estimates. B200’s throughput advantage more than compensates for its higher hourly cost. - PRO 6000 is an attractive low-capex option. It beats H100 on cost per across all models and is on par with H200 on GLM-4.5-Air. - H200 is a major step up over H100. H200 delivers ~1.83x to 2.14x H100 throughput across the three models. - H100 looked worse than expected in this specific setup. It’s on par with PRO 6000 in throughput on GLM-4.5-Air and behind all other contenders in cost per token across all workloads.