(no title)
joaohaas | 4 months ago
>First, we evaluate DeepSeek's open-source EPLB implementation. This employs a greedy bin-packing strategy: experts are sorted by load in descending order, and each is placed onto the least-loaded GPU that has capacity (Figure 3a, Example 1). While simple, the solution is slow because it written in Python and uses a for-loop to performs linear search for finding the best-fit GPU choice.
This is because when considering a load balancing algorithm, unless the work being done (in this case by the GPU) lasts only a few ms, the load balancing algorithm being fast will never be the bottleneck. The post does not mention whether this is the case at all.
Also, I don't want to sound rude, but if all they managed to get is a 5x increase over a simple python algorithm, I don't think this is impressive at all...? Any rewrite of the 'dumb' algorithm in a language with more memory control and cache continuity should result in much better results.
teunlao|4 months ago
abmfy|4 months ago
The original algorithm was provided by DeepSeek, and our optimized implementation achieves a 92× speedup over it. The 5x number is comparing with another baseline that is undisclosed yet.
When integrating EPLB into vLLM, I discovered—somewhat unexpectedly—that the open-source algorithm consumes nearly half of the total time of a rearrangement step, with the remaining time spent transferring weights across GPUs. To address this, I applied OpenEvolve to the algorithm, setting the primary objective to improve speed while maintaining the same balance factor. It performed remarkably well. With additional optimizations on the weight transferring, the overall overhead has now become almost negligible.
kristjansson|4 months ago
e: also comparison a fixed (nothing faster than 0!) and random policy might be informative if your intent is to publish this as improvement for the object problem, not just a demonstration of ARDS.