One of the critical disadvantages that Intel has over the competition in its server platform is the number of cores – other companies enable multiple cores in one of two ways: smaller cores or individual chips connected together. On Architecture Day 2021, Intel unveiled features of its new generation Xeon Scalable platform, one of which is the transition to paved architecture. Intel is set to combine four boards / chips via its fast built-in bridges, leading to better processor scalability across multiple cores. As part of the discovery, Intel has also expanded its new Advanced Matrix Extension (AMX) technology, support for CXL 1.1, DDR5, PCIe 5.0, and an accelerator connection architecture that could lead to custom Xeon processors in the future.
What is Sapphire Rapids?
Built on the Intel 7 process, Sapphire Rapids (SPR) will be Intel’s next-generation Xeon scalable server processor for its Eagle Stream platform. Using its latest Golden Cove processor core, which we described in detail last week, Sapphire Rapids will bring together a number of key technologies for Intel: Acceleration Engines, native semi-precision support for FP16, DDR5, Survive 300 Series DC memory, PCIe 5.0, CXL 1.1, expands and faster UPI, its latest bridging technology (EMIB), new QoS and telemetry, HBM and specialized workload acceleration.
Scheduled for launch in 2022, Sapphire Rapids will be Intel’s first modern CPU product to leverage multi-matrix architecture that aims to reduce latency and maximize throughput thanks to Embedded Multi-Die Interconnect Bridge technology. This allows for multiple high-performance cores (Intel has yet to say how many there are), with a focus on ‘metrics that are important to its user base, such as node performance and data center performance’. Intel calls the SPR “the biggest leap in DC capabilities in a decade.”
The benefits of the title can be easily discovered. PCIe 5.0 is an upgrade over the previous generation Ice Lake PCIe 4.0 and we are moving from six 64-bit DDR4 memory controllers to eight 64-bit DDR5 memory controllers. But major improvements are cores, accelerators and packaging.
Golden Cove: High performance core with AMX and AIA
Using the same basic design on its Sapphire Rapids business platform and the Alder Lake consumer platform, there are some of the same synergies we saw in the early 2000s when Intel did the same thing. We have covered Zlatna uvala in detail in the deep diving of our Alder Lake architecture, however here is a short summary:
The new core, according to Intel, will have more than a + 19% increase in IPC in single-threaded workloads compared to the Cypress Cove, which was Intel’s Ice Lake background. This boils down to some major significant changes, including:
- 16B → 32B length decoding
- 4-wide → 6-wide decoding
- 5K → 12K branching targets
- 2.25K → 4K μop cache
- 5 → 6 wide assignment
- 10 → 12 executive ports
- 352 → Buffer to change the order of 512 entries
The goal of each core is to process more things faster, and the latest generation is trying to do it better than before. Many Intel changes make sense, and those who want deeper details are encouraged to read our deep dive.
There are some big differences between the consumer version of this core in Alder Lake and the server version in Sapphire Rapids. The most obvious is that the consumer version does not have the AVX-512, while the SPR will allow it. The SPR also has 2 MB of private L2 cache per core, while the consumer model has 1.25 MB. In addition, we are talking about Advanced Matrix Extensions (AMX) and the new Accelerator Interface Architecture (AIA).
So far in Intel CPU cores we have scalar operation (normal) and vector operation (AVX, AVX2, AVX-512). The next stage in this is dedicated dedicated matrix solving or something similar to the tensor core in the GPU. AMX does this by adding a new expandable registry file with dedicated AMX instructions in the form of TMUL instructions.
AMX uses eight 1024-bit registers for basic data operators, and through memory references, TMUL instructions will operate on data plates using those plate registers. TMUL is supported through a dedicated engine coprocessor built into the core (each core has one), and the basis behind AMX is that TMUL is just one such co-processor. Intel designed the AMX to be a wider range than this — in case Intel goes deeper with its multi-matrix silicon strategy, at one point we might see custom accelerators enabled through the AMX.
Intel has confirmed that we should not see frequency drops worse than AVX — there are new fine power controllers per core for invoking vector and matrix instructions.
This fits pretty nicely into the discussion about AIA, the new acceleration interface. Typically, when additional acceleration cards are used, commands must move between the kernel and user space, set up memory, and route any virtualization between multiple hosts. The way Intel describes its new Acceleration Engine interface is similar to talking to a PCIe device as if it were just an accelerator on the processor, even though it is connected via PCIe.
Initially, Intel will have two capable AIA hardware bits.
Intel Quick Assist Technology (QAT) is one of those we’ve already seen, as it is shown in special variants of the Skylake Xeo chipset (which required a PCIe 3.0 x16 connection), as well as an additional PCIe card version will support up to Symmetric Cryptography from 400 Gb / s or compression to 160 Gb / s plus decompression of 160 Gb / s at the same time, doubles the previous version.
The other is Intel’s Data Flow Accelerator (DSA). Intel has documentation on the DSA on the web since 2019, stating that it is a copy accelerator and accelerator for the transformation of data for transferring data from storage and memory or to other parts of the system via the DMA hardware mapping unit / IOMMU. DSA is a requirement of specific hyperscaler customers who want to implement it within their own internal cloud infrastructure, and Intel wants to emphasize that some users will use DSA, some will use Intel’s new infrastructure processing unit, while some will use both, depending on the integration level or abstractions that interest them. Intel told us that the DSA is an upgrade over the Crystal Beach DMA engine that was present on Purley (SKL + CLX) platforms.
On top of that, Sapphire Rapids also supports AVX512_FP16 instructions for semi-precision, mostly for AI workloads as part of its DLBoost strategy (Intel was pretty quiet on DLBoost during the event). These FP16 commands can also be used as part of AMX, with INT8 and BF16 support. Intel now also supports CLDEMOTE for cache line management.
A side word about CXL
Through Sapphire Rapids presentations, Intel wanted to emphasize that it will support CXL 1.1 at launch. CXL is a connectivity standard designed to handle much more than what PCIe does – in addition to simply acting as a data transfer from host to device, CXL has three branches of support, known as IO, cache, and memory. As defined in CXL 1.0 and 1.1, these three form the basis of a new way to connect a host to a device.
Of course, we expected all CXL 1.1 devices to support all three standards. It was only after Hot Chips, a few days later, that we learned that Sapphire Rapids only supports part of the CXL standard, especially CXL.io and CXL.cache, but CXL.memory will not be part of the SPR. We’re not sure to what extent this means that SPR is not compatible with CXL 1.1, nor what it means for CXL 1.1 devices – without CXL.mem, according to the diagram above, all Intel is losing is support for Type -2. Perhaps this is more of an indication that the market around CXL better serves CXL 2.0, which will no doubt come in a later product.
On the next page, we’ll take a look at Intel’s new architecture paved for Sapphire Rapids.
Friendly communicator. Music maven. Explorer. Pop culture trailblazer. Social media practitioner.