This week we have the annual Supercomputing event where all the major players in high performance computing put their cards on the table when it comes to hardware, installation and design wins. As part of the event, Intel has a presentation of its hardware offering, which reveals additional details about the next generation of hardware entering the Aurora Exascale supercomputer.
Aurora is a contract that Intel has had for some time – the goal was originally to have a 10nm Xeon Phi system, which the idea was discontinued when Xeon Phi was discontinued, and was a landscape that is constantly changing due to Intel’s hardware offering. A few years ago, it was finalized that the system would now use Intel’s Sapphire Rapids processors (those that come with high-bandwidth memory) in combination with the new Ponte Vecchio Xe-HPC based GPU accelerators and increased from several hundred PetaFLOP to ExaFLOP computing. Recently, Intel CEO Pat Gelsinger revealed that the Ponte Vecchio accelerator achieves twice the performance, above expectations of the original discoveries, and that the Aurora will be a 2 + EF supercomputer when made. Intel expects to ship the first batch of hardware to the Argonne National Laboratory by the end of the year, but that will come with a write-off of $ 300 million on Intel’s fourth-quarter financial statements. Intel expects to ship the rest of the machine by 2022, as well as increase production of hardware for mainstream use in the first quarter for wider launch in the first half of the year.
Today we have additional details about the hardware.
As for the processor side, we know that each Aurora unit will feature two of Intel’s latest Sapphire Rapids CPUs (SPRs), with four PC boards, DDR5, PCIe 5.0, CXL 1.1 (not CXL.mem), and will make extensive use of EMIB connection between tiles. Aurora will also use SPR with built-in high-bandwidth memory (SPR + HBM), and the main discovery is that SPR + HBM will offer up to 64 GB of HBM2e using 8-Hi stacks.
Based on representations, Intel intends to use four 16 GB HBM2e stacks for a total of 64 GB. Intel has a relationship with the Micron, and the physical dimensions of the Micron HBM2e are in line with the representations given in Intel’s materials (compared to, say, Samsung or SKHynix). Micron currently offers two versions of 16 GB HBM2E with ECC hardware: one with 2.8 Gbps per pin (358 GB / s per stack) and one with 3.2 Gbps per pin (410 GB / s per stack). All in all, we are looking at a peak bandwidth of 1,432 TB / s to 1,640 TB / s depending on the version Intel uses. Versions with HBM will use an additional four boards, to connect each HBM stack to one of the SPR chips.
Based on Intel’s diagram, despite Intel saying that SPR + HBM will share a socket with traditional SPR, it is clear that there will be versions that are not compatible. This may be the case when Aurora versions of SPR + HBM are set up specifically for that machine.
As for the Ponte Vecchio (PVC) side of the equation, Intel has already revealed that one server within Aurora will have six PVC accelerators with two SPR processors each. Each of the accelerators will be interconnected in an “all-to-one” topology using a new Xe-Link protocol built into each PVC – Xe-Link supports 8 in fully connected mode, so Aurora only needs six of them saving more power for the hardware. It has not been revealed how they are connected to SPR processors – Intel stated that there will be a unique memory architecture between the CPU and the GPU.
The insight added by Intel today is that each Ponte Vecchio dual-stack implementation (a diagram that Intel has repeatedly shown two stacks side by side) will have a total of 64 MB of L1 cache and 408 MB of L2 cache, supported by HBM2e.
408 MB of L2 cache in two stacks means 204 MB per stack. If we compare this with other hardware:
- The NVIDIA A100 has 40 MB of L2 cache
- AMD’s Navi 21 has 128 MB Infinity Cache (efficient L3)
- AMD’s CNDA2 MI250X in the Frontier has 8 MB of L2 per ‘stack’, or a total of 16 MB
Whichever way you cut it, Intel has a hard time betting on the correct PVC cash hierarchy. The PVC diagrams also show 4 HBM2e chips in half, suggesting that each two-stack PVC design can have 128 GB of HBM2e. It is likely that none of them are ‘reserved’ for yield purposes, as the chip-based design allows Intel to make PVC using a well-known matrix from the start.
On top of that, we also get the official number on how many Ponte Vecchio GPUs and Sapphire Rapids (+ HBM) processors we need for the Aurora. Back in November 2019, when Aurora was only listed as a 1EF supercomputer, I made some rough figures based on Intel’s words that Aurora has 200 racks and speculation speculation – I came up with 5,000 CPUs and 15,000 GPUs, at which each PVC needed about 66.6TF of performance. At the time, Intel had already shown 40 TF of performance per card on early silicone. Intel’s official numbers for the Aurora 2EF machine are:
18000+ CPU and 54000+ GPU is a lot of hardware. But dividing 2 Exaflops with 54000 PVC accelerators comes to only 37 TeraFlops per PVC as the upper limit, and that number is assumed to be zero performance coming from the CPU.
To add to the mix: Intel CEO Pat Gelsinger said just a few weeks ago that PVC comes with double the performance originally expected, which allowed Aurora to be a 2EF machine. Does this mean that the original performance target for PVC was ~ 20 TF FP64? No matter what, AMD’s recent announcement of the MI250X last week showcased a dual-GPU chip with 47.9 TF of FP64 vector performance, moving to 95.7 TF in FP64 matrix performance. The end result could be that AMD’s MI250X actually has better raw performance than PVC, however AMD needs 560 W for that card, while Intel’s power numbers have not been revealed. We could calculate a napkin here too.
- Frontier uses 560W MI250X cards and is rated for 1.5 ExaFlops FP64 Vector at 30 MW. This means that the Frontier needs 31,300 cards (1.5 EF / 49.7 TF) to meet the performance targets, and for each 560W MI250X card, the Frontier has allocated 958 watts of power (30 MW / 31,300 cards). This is 71% of the cost for each card (meaning cooling, storage systems, other computers / management, etc.).
- Aurora uses PVC at an unknown power, rated for 2 ExaFlops FP64 Vector at 60 MW power. We know that PVC has 54000+ cards to meet performance goals, which means the system has set aside 1053 W (that’s 60 MW / 54000) per card to include a PVC accelerator and other necessary costs. If we were to assume (I know a big guess) that Frontier and Aurora have similar costs, then we are looking at 615 W per PVC.
- This would end up with PVC at 615 W for 37 TF, against MI250X at 560 W for 47.9 TF.
- This raw discussion does not talk about the specific characteristics that each card has for its use case.
|Compute GPU Accelerator Comparision
|Product||Old Bridge||MI250X||A100 80GB|
|Transistors||100 B||58.2 B||54.2 B|
|Tiles (including HBM)||47||10||6 + 1 spare|
|Compute Units||128||2 x 110||108|
|Matrix Cores||128||2 x 440||432|
|INT8 Tensor||?||383 TOPs||624 TOP|
|FP16 Matrix||?||383 TOPs||312 TOP|
|FP64 Vector||?||47.9 TFLOPS||9.5 TFLOPS|
|FP64 Matrix||?||95.7 TFLOPs||19.5 TFLOPS|
|L2 / L3||2 x 204 MB||2 x 8 MB||40 MB|
|VRAM capacity||128 GB (?)||128 GB||80 GB|
|VRAM Tip||8 x HBM2e||8 x HBM2e||5 x HBM2e|
|VRAM Bandwidth||?||3.2 TB / s||2.0 TB / s|
|Chip to chip Total BW||8||8 x 100 GB / s||12 x 50 GB / s|
|CPU Coherency||Yes||In IF||With NVLink 3|
|TSMC N6||TSMC N7|
|Form Factors||OAM||OAM (560 W)||SXM4 (400W *)
|Date of issue||2022||11/2021||11/2020|
|* Some custom implementations go up to 600W|
Intel has also revealed that it will partner with SiPearl to implement PVC hardware in European HPC efforts. SiPearl is currently building an Arm-based CPU called the Rhea built on the TSMC N7.
Going forward, Intel has also released a mini-road map. There’s nothing too surprising here – Intel has plans to design outside the Ponte Vecchia, and that future Xeon Scalable processors will also have options enabled with HBM.
Friendly communicator. Music maven. Explorer. Pop culture trailblazer. Social media practitioner.