Google offers 7th-gen Ironwood TPUs for AI, with AI-inspired comparisons

featured-image

Sure, we're doing FP8 versus a supercomputer's FP64. What of it? Cloud Next Google's seventh-generation Tensor Processing Units (TPU), announced Wednesday, will soon be available to cloud customers to rent in pods of 256 or 9,216 chips....

Cloud Next Google's seventh-generation Tensor Processing Units (TPU), announced Wednesday, will soon be available to cloud customers to rent in pods of 256 or 9,216 chips. The Chocolate Factory cheekily says a pod with 9,216 of the homegrown AI accelerators will deliver 24x the compute power of the world's number-one most-powerful publicly known supercomputer, America's El Capitan , at 42.5 exaFLOPS versus 1.

7. As impressive as that may sound, Google's marketing team is leaving out a rather important detail here. That 42.



5 exaFLOPS peak performance figure uses FP8 precision, while El Cap achieved 1.74 exaFLOPS at FP64 in the HPC-centric LINPACK benchmark. Its peak theoretical performance is actually closer to 2.

74 FP64 exaFLOPS. Normalized to FP8, the AMD-powered HPE-Cray super's peak theoretical performance comes in at a little over 87 exaFLOPS for dense workloads or twice that for sparse workloads. Google marketing compares 42.

5 exaFLOPS of FP8 to 1.74 exaFLOPS of FP64, when it should be 42.5 versus at least 87, meaning El Capitan comfortably comes out on top over a 9,216-TPU v7 pod.

That claim of 24x just ain't right to us. We asked Google about that, and a spokesperson said the cloud titan was simply comparing the best number available for El Capitan that it could find at the time. Dare we say, Gemini AI would be proud.

"We do not have information on El Capitan’s sustained FP8 performance," we're told. "Our assumption behind the comparison is that El Capitan put their best number forward for peak compute for AI workloads, given their focus also includes AI. "Although El Capitan may be able to support FP8, we are unable to make the comparison without additional data on sustained performance.

We cannot automatically assume linear improvement on peak performance with lower precision. Furthermore, note that Ironwood can scale beyond a single pod to 400,000 chips, or 43 TPU v7x pods, connected through our high-speed Jupiter data center network." Setting aside those comparisons, Google's latest TPUs, codenamed Ironwood, represent a major upgrade over last year's Trillium parts.

A look at Google's TPU v7 with four chips per board ...

Click to enlarge Designed with large language model (LLM) inferencing in mind, each TPU boasts as much as 192 GB of high bandwidth memory (HBM) good for somewhere between 7.2 TB/s and 7.4 TB/s of bandwidth.

Both are quoted in Google's launch announcement , one in the text and one in a graphic. As we've previously discussed, memory bandwidth is the primary bottleneck for inference workloads. The larger memory capacity means the chips can accommodate larger models.

In terms of raw floating-point performance, Google boasts each liquid-cooled TPU v7 is capable of churning out 4.6 petaFLOPS of dense FP8. That puts performance in the same ball parks as Nvidia’s Blackwell B200.

Alongside its namesake tensor processing engines, the Ironwood features Google's SparseCore which is designed to accelerate "ultra-large embeddings" common in ranking and recommender systems. The chips, further detailed at The Next Platform , are expected to reach general availability later this year. In order to build these pods, each TPU is equipped with a specialized inter-chip interconnect (ICI) that Google says is good for 1.

2 terabits per second of bidirectional per-link bandwidth, a 1.5x uplift over Trillium. According to Google, the larger of the two offered pods will consume roughly 10 megawatts under full load.

Google didn't say what the per-chip TDP is but this suggests it's somewhere in the 700 W to 1 kW range observed on similar-level GPUs. While that may sound like a lotta power, Google emphasizes that the parts are still 30x more efficient than its first crop of TPUs from 2015 and deliver 2x the performance-per-watt of last year's chips. ®.