top | item 45646860

(no title)

Really:

"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."

Which, if you scale it, matches the GPs statement.

discuss

yorwba|4 months ago

From the SCMP article you might get the impression that the various figures all refer to the same GPU cluster, but in the paper itself it's very clear that this is not the case, i.e. the 213 GPUs in the smaller cluster are not serving 1.35% of the requests in the larger cluster. Then if you want to scale it, you have a choice of different numbers you could scale, and each would get different results. Since they're constrained by the limited number of different models a single GPU can serve, I think scaling by the number of models is the most realistic option.