xAI picked Ethernet over InfiniBand for its H100 Colossus training cluster

Work already underway to expand system to 200,000 Nvidia Hopper chips Unlike most AI training clusters, xAI's Colossus with its 100,000 Nvidia Hopper GPUs doesn't use InfiniBand. Instead, the massive system, which Nvidia bills as the "world's largest AI supercomputer," was built using the GPU giant's Spectrum-X Ethernet fabric....

featured-image

Unlike most AI training clusters, xAI's Colossus with its 100,000 Nvidia Hopper GPUs doesn't use InfiniBand. Instead, the massive system, which Nvidia bills as the "world's largest AI supercomputer," was built using the GPU giant's Spectrum-X Ethernet fabric. Colossus was built to train xAI's Grok series of large language models, which power the chatbot built into Elon Musk's echo chamber colloquially known as Tw.

.., right, X.



The system as a whole is massive, boasting more than 2.5 times the number of GPUs compared to the US' number one ranked Frontier supercomputer at Oak Ridge National Laboratory with its nearly 38,000 AMD MI250X accelerators. Perhaps more impressively, Colossus was deployed in just 122 days and took 19 days to go from first deployment to training.

In terms of peak performance, the xAI cluster boasts 98.9 exaFLOPS of dense FP/BF16 — double that if xAI's models can take advantage of sparsity during training, and double that again to 395 exaFLOPS when training at sparse FP8 precision. However, those performance figures won't last for long.

Nvidia reports that xAI has already begun adding another 100,000 Hopper GPUs to the cluster, which would effectively double the system's performance. Even if xAI were to run the High Performance Linpack (HPL) used to rank the world's largest and most powerful publicly known supercomputers on the system, Colossus would almost certainly claim the top spot with 6.7 exaFLOPS of peak FP64 matrix performance.

However, that assumes the Ethernet fabric used to stitch those GPUs together can keep up. There is a reason, after all, HPC centers tend to opt for Infiniband. Beyond Colossus' massive performance figures, it's worth talking about this networking choice.

As we previously discussed , as of early 2024, about 90 percent of AI clusters used Nvidia's InfiniBand networking. The reason comes down to scale. Training large models requires distributing workloads across hundreds and even thousands of nodes.

Any amount of packet loss can result in higher tail latencies and therefore slower time to train models. InfiniBand is designed to keep packet loss to an absolute minimum. On the other hand, packet loss is a fact of life in traditional Ethernet networks.

Despite this, Ethernet remains attractive for a variety of reasons, including cross compatibility, vendor choice, and, often higher per port bandwidth. So, to overcome Ethernet's limitations, Nvidia developed its Spectrum X family of products, which include its Spectrum Ethernet switches and BlueField SuperNIC. Specifically, Colossus used the 51.

2 Tbps Spectrum SN5600, which crams 64 800GbE ports into a 2U form factor. Meanwhile, the individual nodes used Nvidia's BlueField-3 SuperNICs, which feature a single 400GbE connection to each GPU in the cluster. But while Nvidia can't deliver 800 Gbps networking to each accelerator just yet, its next-generation ConnectX-8 SuperNICs will.

The idea is that by building logic into both the switch and the NIC, the two can take advantage of high-speed packet reordering, advanced congestion control, and programmable I/O pathing to achieve InfiniBand-like loss and latencies over Ethernet. "Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions," Nvidia claimed in a recent blog post, adding that it has also managed to achieve 95 percent data throughput thanks to the fabric's congestion controls. For comparison, Nvidia argues that, at this scale, standard Ethernet would have created thousands of flow collisions and would have only achieved 60 percent of its data throughput.

Nvidia isn't the only networking vendor looking to overcome Ethernet's limitations using SmartNICs and switches. As we've previously discussed, Broadcom is actually doing something quite similar but rather than at the NIC level, it's focusing primarily on reducing packet loss between its Jericho3-AI top-of-rack switches and its Tomahawk 5 aggregation switches. AMD is also getting in on the fun with its upcoming Ultra Ethernet-based Pensando Pollara 400, which will feature the same kind of packet spraying and congestion control tech we've seen from Nvidia, Broadcom, and others to achieve InfiniBand-like loss and latencies.

®.