Intel’s next-gen Alder Lake processors are set to release later this year, bringing with them a new design philosophy on a new node meant to challenge AMD. But before its release, Intel provided a breakdown of its best features at Intel Architecture Day 2021 to answer one question: what makes Alder Lake tick?
Alder Lake diverges significantly from Intel’s previous processor designs. Reminiscent of system-on-chips (SoC) of smartphones, it features not just one, but two core architectures linked together using Intel’s new packaging technology. It will feature up to 16-cores split between eight Golden Cove performance cores (Intel calls them P-cores) and eight Gracemont efficiency cores (E-cores). Alder Lake also features up to 30MB of cache, 16 lanes of PCIe 5, and DDR5 as well as DDR4 memory support and Tiger Lake’s Xe LP graphics ported to the Intel 7 node.
Alder Lake uses three fabrics to link all its parts together and fine-tune power consumption. The path between the compute cores, graphics, last-level cache (LLC) and memory is the compute fabric, which can operate at 1TB/s. The input/output (I/O) fabric, which operates at 64GB/s, passes data between I/Os and internal devices. And lastly, the memory fabric operates at 204GB/s and can dynamically adjust the bus width and frequency for multiple operating points. Intel says that having multiple dynamically scaling fabrics allows Alder Lake to more efficiently direct power to where it’s needed the most.
Gracemont efficiency core
Gracemont is the architecture name for Intel’s efficiency cores. It features an overhauled architecture with a deeper frontend, wider backend, and will be built on the Intel 7 node, previously known as Intel 10nm+. Its numerous energy and performance enhancements, as well as advanced transistors, converge to form the efficient cores that will debut in Alder Lake.
Branch prediction is a critical feature in modern CPUs. It predicts what instructions are needed next before a program even requests them, thereby reducing CPU wait times and wasted instructions. Many of the CPU processing stages depend on accurate branch predictions; for example, if there’s a mispredict, then the instructions stored in the out-of-order buffer may need to be flushed. Gracemont has a 5,000 entry-long branch target cache for its history-based branch prediction to help generate accurate instruction pointers, reducing the chances of mispredicts.
In addition to cutting wait time, more branch prediction resources reduce cache miss by loading more relevant data into the cache, also before the program requests them. Gracemont carries a 64KB instruction cache that stores the most frequently used instructions close at hand, as well as Intel’s first “on-demand instruction length decoder” that decodes new code quickly.
The main instruction decoder got an upgrade too. It can now decode up to six instructions per cycle while maintaining the efficiency of a much narrower core. The decoder, which translates opcode into micro-ops (uOps), is important in keeping the backend fed at all times so the processor achieves maximum efficiency; being able to decode more instructions per clock is of course better for overall performance.
The decoders are aided by a new hardware-driven load balancer. Instead of dumping a long chain of sequential instructions onto a few decoders, load balancers break them up into smaller segments and distribute them across all of the decoders, increasing parallelism.
On the backend, Gracemont features a five-wide allocation stage and a 256 entry out-of-order window. The allocation stage bridges the front end with the back end of the CPU, while the out-of-order window specifies how many uOp entries out of order it can buffer before they’re dispatched to the execution units.
Intel says that Gracemont’s microarchitecture enhancements deliver higher general IPC increases while consuming a fraction of the power.
Further down the process flow are the data execution units, or EUs for short. Gracemont’s 17 execution ports can be scaled to the requirements of each unit. The integer EU ports are complemented by dual-energy multipliers and dividers. In addition, the single-instruction, multiple-data (SIMD) arithmetic logic units (ALUs) in the vector operations now support Intel’s virtual Neural Network Instructions (VNNI).
Two floating-point pipelines allow the execution of two independent add or multiply operations, as well as two multiply-add instructions per cycle thanks to new vector extension instructions. Gracemont’s vector stack also comes with cryptography units that provide AES and SHA acceleration, allowing it to offload the encryption workloads in security-sensitive applications.
Finally, there’s the memory subsystem. To increase cache bandwidth, Intel has added two load and two store pipelines that enable 32 bytes read and write simultaneously. The L2 cache size is configurable between 2 and 4 MB.
In a core-on-core comparison, Intel said that Gracemont delivers 40 per cent more performance at the same power as Skylake, and delivers the same performance using 40 per cent less power. This means that Gracemont is around 2.5 times more efficient in single-core scenarios. In a four-core configuration, Gracemont delivered 80 per cent more performance than two Skylake cores running four threads while still consuming less power. Moreover, Intel noted that four Gracemont cores can fit into the same footprint as a single Skylake core.
Golden Cove performance core
The story is much the same with Golden Cove, Alder Lake’s performance-core (P-core). The theme of making them deeper, wider and smarter persists, starting with branch prediction.
Like Gracemont, Golden Cove also features a deeper out-of-order scheduler and buffer, more physical registers, a wider allocation window and more execution ports to increase parallelism.
It can perform four table walks in parallel. A table, or a page table, is a “map” of virtual addresses assigned to a program and is used to help allocate memory more effectively. A table walk is the tracing of page tables to scope out which virtual memory addresses are mapped to physical ones. The mappings are stored in a translation lookaside buffer (TLB) to keep table walks to a minimum.
For programs with larger code footprints, Alder Lake’s P-cores feature double the number of 4K pages stored in iTLB, as well as improved branch prediction accuracy to reduce jump mispredicts and better code prefetch mechanism. The branch target buffer is also twice as large as the previous generation’s and uses a machine-learning algorithm to dynamically adjust its size to reduce power consumption or boost performance.
It also includes new dedicated hardware and ISA extensions for matrix multiplication, which Intel says will greatly improve AI accelerated workload.
Being the performance core doesn’t mean efficiency is left on the curb; power management is also one of Golden Cove’s key focuses. On that front, Golden Cove features a new microcontroller that can measure and adjust power consumption in microseconds instead of milliseconds. Intel says the change is based on actual application behaviour instead of general speculation. The finer power tuning enables higher average frequency in any application without a severe power penalty.
Golden Cove has six lengthier decoders capable of running at 32 bytes per cycle. The uOp cache has been increased to hold 4,000 instead of 2,250 operations, allowing it to increase frontend bandwidth while reducing latency in a shorter pipeline.
The frontend certainly saw some improvements, but Intel attributes the out-of-order engine as the component that separates Alder Lake from the previous architectures. The P-cores feature a six-wide register rename allocation and 12 execution ports, up from the five and 10 of the previous generation. Other enhancements include more physical registers, a deeper scheduling window and a new 512 deep reorder buffer.
Both the L1 and L2 cache sizes have been expanded and their fetch rate increased. Two L2 cache configurations are available: 1.25MB for consumers and 2MB for enterprises.
All in all, Intel says that these enhancements lend Alder Lake’s P-cores a 19 per cent average performance lead over its previous-gen Rocket Lake’s Cypress Cove core at the same frequency. Rocket Lake’s Cypress Cove microarchitecture is built on Intel’s 14nm node and backports the Sunny Cove microarchitecture from Ice Lake.
Intel Thread Director
Improving performance while reducing power consumption is a perpetual balance struggle in processor design. With previous processor generations, a single core architecture had to do double-duty on both ends. With Alder Lake’s dedicated performance and efficiency cores, Intel hopes to better address the two ends, similarly to the way it’s done in today’s smartphone chips.
But blending architectures introduces its own set of challenges. Now that the processor is no longer monolithic, it needs a data highway that connects its parts to prevent latency, something engineers work hard to minimize. Thread scheduling also becomes an issue; what workloads should be prioritized and how? And how can they be optimized for both today’s and emerging workloads?
To address these issues, Intel added a new hardware scheduler called Intel Thread Director. Its job is to keep tabs on the instruction types being fed into the processor and help the operating system make optimal scheduling decisions. In addition to the programs, Thread Director still accounts for thermals, operating conditions, and power limits. It also picks out threads that need the most performance so it can assign them to the P-cores. Similarly, it delegates background tasks to the E-cores, and AI threads to P-cores. Everything is dynamic, based on the tasks at hand, and is fully autonomous.
But that doesn’t mean Thread Director locks heavy workloads exclusively to P-cores. It will take advantage of any idle cores if there are resources available. In a heavy multithreaded workload, Thread Director will distribute the workload across all P and E-cores.