Why cut up a a wafer of chips, package each with HBM, put the package on a board, connect to CPUs with a fabric, then tie them all back together with networking chips and cables? Their 3 new clusters are the top 3 biggest AI training platforms in the world. Comparing the WSE-3 against Nvidia's H100. "It's got 52 times more cores. It's got 800 times more memory on chip. It's got 7,000 times more memory bandwidth and more than 3,700 times more fabric bandwidth. But they don't sell the chips, they build the clusters and sell the compute power. Except for a couple they built in Dubai.
Plus, newer models are moving from Transformer to Mamba and don't need as much memory, because they save the important information, not everything it's been trained on.
jpeggtulsa|1 year ago
Plus, newer models are moving from Transformer to Mamba and don't need as much memory, because they save the important information, not everything it's been trained on.