top | item 45661100

(no title)

jewel | 4 months ago

In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.

So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.

Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.

discuss

Jrxing|4 months ago

Sharing the big GPU cluster with non-latency critical load is one solution we also explored.

For this work, we are targeting more on the problem of smaller models running SOTA GPUs. Distilled/fine-tuned small models have shown comparable performance in vertial tasks.