L4 cache, is expected to go mainstream?
L4 cache cpu Intel electronic components Meteor Lake?
According to Wikipedia, cache (English: cache) is referred to as cache. Original meaning refers to a RAM with faster access speed than normal random access memory (RAM). Usually, it does not Duse DRAM technology like system main memory, but uses expensive but faster SRAM technology.
When the CPU processing data, it will first go to Cache to find, if the data is temporarily stored by the previous operation, there is no need to read the data from the random access memory (Main memory) —— because the CPU running speed is generally faster than the main memory read speed, the main memory cycle (the time needed to access the main memory) for several clock cycles. So to access primary memory, you have to wait for several CPU cycles to waste.
As early as PC-AT / XT and 80286, there was no Cache, the CPU and memory were slow, and the CPU had direct access to memory. But the 80386 chipset added support for the optional Cache, with an advanced motherboard with 64 KB and even a high-end 128KB Write-Through Cache.
In the 80486 CPU, 8 KB L1 Unified Cache was added, also called internal Cache, regardless of code and data; Cache in the chipset, became L2, also called external Cache, ranging from 128 KB to 256 KB; the Cache attribute of Write-back was added. The Pentium CPU L1 Cache is divided into Code and data, each with 8 KB; the L2 is also placed on the motherboard. The L2 of Pentium Pro was placed into the Package of the CPU. Pentium 3 Starting, the L2 Cache is put into the Die of the CPU. From Intel Core CPU, L2 Cache is multi-core sharing.
The CPU cache was once an advanced technology used on supercomputers, But the AMD or Intel microprocessors used on computers today integrate different sized data caches and instruction caches inside the chip, The common term is the L1 cache (L1 Cache, Level 1 On-die Cache, * On-chip buffers); The L2 cache larger than the L1 was once placed outside the CPU (on the motherboard or CPU interface card), But now it has become a standard component inside the CPU; The more expensive CPU will have an L3 cache larger than the L2 cache (level 3 On-die Cache Level 3 cache).
But until now, mainstream chips have stopped at the L3.
01 The evolution of the cache
According to Quora blogger, senior architect Jerason Banes, the reason why L4 was not adopted has a lot to do with the development of CPU architecture.
'If we roll back the clock to the 6502 CPU, you'll find it's a very simple design, very accurate in practice, mainly because the 6502 uses the von Neumann architecture,' he said.'This is a very simple structure, where the memory, CPU, and I / O all happen in lock steps according to a master clock.
This means that if the CPU is running at 1.4 MHz, the memory has to run at 1.4 MHz, but this quickly becomes a problem.
As the CPU starts to accelerate to 10 MHz, RAM becomes very expensive to keep up with, he says. In fact, the physical distance from the RAM to the CPU is almost impossible to keep up with the CPU. In addition, the increase in RAM capacity (megabytes size) means that more complex control circuits are needed to address memory locations. This means that when you read memory nonsequentially, memory becomes unpredictable setting time.
For response, the industry's proposed solution is to keep a small amount of CPU fast memory, which is the L1 cache. Now, the L1 + CPU acts like the old 6502 CPU, but it can call RAM, just as it is an ultra-fast hard drive, to fetch any RAM block that is not currently present in L1. Like a hard drive, the CPU only waits for the necessary data to be extracted into the L1 cache.
This approach works because the key parts of the program are often very small. Small enough that the CPU can spend a lot of time running the program before needing to return to primary memory. But the ointment is —— data and code. Because a program may handle large amounts of data. This results in poor use of the L1 cache and causes a large number of "misses" in the cache. So the CPU will effectively slow down the main memory, which is acceptable, except for the fact that the CPU is slower than that.
The problem is that the CPU is running code is easily falling into the "least recent" bucket and is phased out of L1 to support the data being processed. Once the CPU is returned to that code block, it must stop and reacquire the code from the main memory. This is very inefficient.
To solve this problem, the CPU has turned to the Harvard architecture.(Name after the earlier Harvard-1 computer.) In the Harvard architecture, the code and the data are stored in different memory. This has the advantage of simplifying data paths because the executable code comes from one set of pathways and the data from another. More importantly, by dividing the L1 cache into two parts (code and data), the data will never accidentally evict the running code.
So we split the 64KB L1 cache into 32 KB code and 32 KB data.
The rest of the cache is used to try to process the faster CPU. The faster the CPU clock, the shorter the distance of electric power transmission in the time slice between the clock cycles. Since this is an insurmountable physical problem, the CPU designers begin to "withhold" the data they need from the primary memory. The L2 is usually half the CPU, but is larger, so it can stream the next few blocks in sequence when the CPU itself is busy. When the CPU requests the next data, it hardly needs to wait that long, and that is how we get the 256KB L2 cache.
But what about L3? Where did that come from?
The last piece of the puzzle comes from a multicore processor. Each kernel wants to keep working, but if they have to lag behind each time they need to get more data from primary memory. Thus, the L3 cache acts as a buffer between the different L2 caches on each kernel. It will in turn try to service all main memory calls a little each time to increase the likelihood that it already has the data at the L2 cache request. This is why the number of L3 tends to increase with the number of cores. The more cores there are, the more likely they are to get into the "battle" of the main memory.
According to the data, L3 cache has become the mainstream in Nehalem (* Core I 7 series), which is a single quad-core CPU chip, which means that all four cores are on the same silicon chip. In contrast, its predecessor, Core 2 Quad, consists of two separate Core 2 Duo chips on the same package.
02 Why stop at L4?
Before this section begins, we have to state that while L4 is not currently mainstream adopted, IBM added L4 cache to some of its own X86 chipsets as early as the 2000s, and L4 cache to the System z11 mainframe NUMA interconnect chipset in 2010.
IBM's z11 processor has four cores, each with 64 KB of L1 instructions and 128 KB of L1 data cache, as well as 1.5 MB of L2 cache for each core and 24 MB of shared L3 cache between the four cores. The NUMA chipset for z10 has two rows of 96 MB L4 cache for a total of 192 MB.
When talking about why L4 cache was not used, Zhihu user tjunangang responded in a question and answer that —— cache is too local, and the investment and return are not proportional, it is not cost-effective. As shown in the figure below, he says, the three-level cache occupies almost the area of the two cores. What if you add four levels? Level 5? To know that the cache capacity is increasing rapidly, if there are four levels of cache, the individual cache area is larger than the entire existing CPU.
In tjunangang's view, the continued introduction of the L4 cache may lead to the following problems:
1. The same wafer, originally can produce 32 CPU, plus four cache may not produce 10 yuan, the price skyrocketing no one bought.
2. Large core area, large power consumption, large heat generation, high requirements for heat dissipation equipment, and the future development trend against the dry.. And the core area is large, the good rate is low, the more tragic is easy to break (I have broken many pieces)
3. Stacking cache is obviously a rich style, because Intel early processors didn't even have L3 Cache. And the key problem is that you just stack cache, similar to the previous race, you might as well find a way to improve the architecture.
4. For the overall role of performance, L1 cache *, L2 times, L3 even less than one tenth of L1 cache. Regardless of the cost, L4 is understandable, but L5 is almost about the system memory.
Famous blogger "Lao Wolf" said that there is already an L4, which comes in two forms: eDRAM and Optane DIMM. In Intel's Iris series, for example, a high-speed DRAM is placed in the Package (eDRAM), which can normally act as a video memory or be set as an L4 cache.
In an interview with nextplatform, Rabin Sugumar, a former chip architect at Cray Research, Sun Microsystems, Oracle, Broadcom, Cavium and Marvell, said that no one stipulated that the L4 cache must be made up of embedded DRAM (as IBM does with its chips) or more expensive SRAM.
At his point, our L3 is now big. So with regard to the L4 cache, Rabin Sugumar also thinks maybe eDRAM or even HBM or DRAM. One that's looking interesting in this case is that the —— L4 cache implementation is to use HBM as a cache rather than latency cache rather than bandwidth cache.
"The idea is that because HBM has limited capacity and high bandwidth, we can get some performance improvements, and we do see significant gains in bandwidth-limited use cases. Number of cache missed hits. But in terms of performance and cost, the math is worth adding another cache layer."Rabin Sugumar said.
With so many people making so many opinions, in my opinion, Intel is the key factor. A few days ago, they brought in a new share?
03 Going to the mainstream?
According to tomshardware, Intel is about to launch a processor, code-named Meteor Lake, and unofficial information that will have an L4 cache has been circulating for some time. Now, a new Intel patent discovered by VideoCardz shows that Intel is ready for a cache block, code-named Adamantine L4, which will be used for some CPUs.
"The IC can compete with AMD's 3 DV-Cache in some applications, but the small chip will not be used only as a performance booster."VideoCardz representation.
The patent shows that Intel's Adamantine (or ADM) cache can improve not only the communication between the CPU and memory, but also between the CPU and the security controller. For example, L4 can be used to improve bootstrap optimization and even retain data in the cache on reset to shorten loading time.
According to the report, although the patent itself does not mention Meteor Lake, but the accompanying image clearly shows a processor with the production of two high performance Redwood Cove and eight energy-saving Crestmont, which also includes a small graphics chip based on Intel Gen 12.7, and a SoC block containing more than two Crestmont kernel, and a small Intel Foveros 3D technology interconnection I / O chip. The description corresponds to Intel's Meteor Lake processor.
Meanwhile, the Adamantine L4 cache can be used for a wide range of applications other than the Meteor Lake.
Introducing the patent, Intel said the next-generation client SoC architecture could introduce large package caches, which will allow for new uses.
They believe that the L4 (for example, "Adamantine" or "ADM") cache may have a much shorter access time than the DRAM, which is used to improve the host CPU and security controller communication. Embodiments facilitate innovations in conservation initiation optimization. High-end chips with higher pre-initialization memory at resetting add value and may increase revenue. Making memory available on reset also helps eliminate traditional BIOS assumptions and support modern device use cases (such as car IVI, home and industrial robots, etc.), which drives products into new market segments.
That said, the L4 cache is coming?