(no title)
jewel | 4 months ago
So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.
Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.
Jrxing|4 months ago
For this work, we are targeting more on the problem of smaller models running SOTA GPUs. Distilled/fine-tuned small models have shown comparable performance in vertial tasks.