What chip is Jim Keller working on anyway?
What chip is Jim Keller working on anyway?
Startup Tenstorrent, headed by industry icon Jim Keller, has assembled a first-class team of AI and CPU engineers with ambitious plans for universal processors and AI accelerators.
Currently, the company is developing an 8-wide decoding RISC-V kernel that can handle both client and HPC workloads, which will first be used first for a 128-core high-performance CPU for data centers. The company also has a roadmap for multi-generation processors, which we'll see below.
01 Why the RISC-V?
We recently spoke to Wei-Han Lien, the chief CPU architect at Tenstorrent, about the company's vision and roadmap. Lien has an impressive background and has worked on NexGen, AMD, PA-Semi, Apple, probably most notably his work on Apple A6, A7 (world * 64-bit Arm SoC) and M1 CPU microarchitecture and implementation.
The company has many world-class engineers with extensive experience in x86 and Arm design, and one may ask why Tenstorrent decided to develop the RISC-VCPU, because this instruction set architecture (ISA) data center software stack is not as comprehensive as x86 and Arm. Tenstorrent The answer is simple: x86 is controlled by AMD and Intel, while Arm is controlled by Arm Holding, which limits the pace of innovation.
"There are only two major companies in the world that can produce x86 CPU," says Wei-Han Lien."Due to x86 licensing restrictions, innovation is basically controlled by one or two companies. When companies become very big, they become bureaucratic and the pace of innovation [slows].[...] The Arm is somewhat similar. They claim that they are like a RISC-V company, but if you look at their specifications, [it] becomes so complicated. It's actually a bit of dominated by an architect.[...] Arm is somewhat of of all possible scenarios, even architecture [licensing] partners.”
In contrast, RISC-V develops rapidly. According to Tenstorrent, because it is an open source ISA, it's easier and faster to use for innovation, especially when it involves emerging and rapidly developed AI solutions.
"I've been looking for a matching processor solution for [Tenstorrent's] AI solution, and then we wanted the BF16 data types, and then we went to Arm and said,'Hey, can you support us?'They say' no,'which could take two years of internal discussions and discussions with partners and so on, " Mr.Lien explains."But we talked to SiFive; they just put it there. So, no limit, they built it for us, and it is free.”
On the one hand, Arm Holding's approach ensures high-quality standards and a comprehensive software stack, but it also means that the pace of ISA innovation is slowing down, which can be a problem for emerging applications such as AI processors that are designed to grow rapidly.
02 One microarchitecture, five CPU IP a year
Since Tenstorrent focuses on and addresses the entire AI application, it requires not only different on-chip systems or system-level encapsulations, but also a variety of CPU microarchitecture implementations and system-level architectures to achieve different power and performance goals. That is exactly what Wei-Han Lien's department is working on to address.
The humble consumer electronics SoC and the powerful server processor have little in common, but can share the same ISA and microarchitecture (implemented differently). That is where Lien's team works. Tenstorrent Said the company's CPU team has developed a disordered RISC-V microarchitecture and implemented it in five different ways to solve problems for a variety of applications.
Tenstorrent There are now five different RISC-VCPU core IP —— with two, three, four, six, and eight wide decoding —— for its own processor that may be given to interested parties. For prospects who need a very basic CPU, the company can offer small cores with two widths of execution, but for those requiring higher performance edges, client PC and high performance computing, it has six widths of Alastor and eight wide Ascalo cores.
Each scrambled Ascalon (RV64ACDHFMV) kernel with octet decoding has six ALU, two FPU, and two 256-bit vector cells, making it very powerful. Given that the modern x86 design uses either a four-wide (Zen 4) or a six-wide (Golden Cove) decoder, we are looking for a very powerful kernel.
Wei-Han Lien is one of the designers responsible for Apple's "wide" CPU microarchitecture, which can execute up to eight instructions per clock. Apple's A14 and M1 SoC, for example, have eight wide, high-performance Firestorm CPU cores and, two years after their launch, remain one of the most energy-efficient designs in the industry. Lien may be one of the industry's experts in "wide" CPU microarchitecture *, to the best we know, is the processor designer of * a team of leading engineers to develop the eight-wide RISC-V high-performance CPU kernel.
In addition to the various RISC-V universal kernels, Tenstorrent also has proprietary Tensix cores tailored for neural network inference and training. Each Tensix kernel contains five RISC cores, an array mathematical unit for tensor operations, a SIMD unit for vector operations, 1MB or 2MB SRAM, and fixed functional hardware for accelerating network packet operations and compression / decompression. The Tensix kernel supports a variety of data formats, including BF4, BF8, INT 8, FP16, BF16, and even FP64.
03 An impressive road map
Currently, Tenstorrent has two products: a machine learning processor called Grayskull that provides about 315 INT8 TOPS of performance, a PCIe Gen4 slot, and a network Wormhole ML processor that has about 350 INT8 TOPS of performance and uses a GDDR6 memory subsystem, a PCIe Gen4 x16 interface, and a 400 GbE connection to other machines.
Both devices require a host CPU that can be used as an add-on board, or in a pre-built Tenstorrent server. A 4U Nebula server with 32 Wormhole ML cards provides a performance of about 12 INT8 POPS at 6kW.
Later this year, the company plans to launch its * standalone CPU + ML solution, ——Black Hole——, which combines 24 SiFive X280 RISC-V cores with multiple third-generation Tensix cores, which use two 2D torus network interconnect learning efforts running in the opposite direction of the machine. The device will provide 1 INT8 POPS of computational throughput (about three times the performance improvement compared to its predecessor), eight GDDR 6 memory channels, 1200 Gb/s Ethernet connectivity, and PCIe Gen5 channels.
In addition, the company looks forward to adding a 2 TB/s die to die interface for dual-chip solutions and for future use. The chip will be manufactured in a 6nm process (we expect it to be TSMC N6, but Tenstorrent has not yet confirmed this), but at 600mm² it will be smaller than TSMC's 12nm node. One thing to remember is that Tenstorrent has not yet developed its Blackhole, and its final feature set may be different from what the company revealed today.
Next year, the company will release its * product: A multi-small chip solution called Grendel, which has its own Ascalon universal kernel, its own RISC-V microarchitecture, an eight-bit decoder and a small Tensix-based chip for ML workloads.
Grendel Is the * product set that Tenstorrent will release next year: A multi-chiplet solution includes a Aegis chiplet with a high-performance Ascalon universal kernel and one or more chiplet with Tensix kernels for ML workloads. Depending on business requirements (and the company's financial capabilities), Tenstorrent can achieve AI chiplet using 3nm process technology to take advantage of higher transistor density and Tensix cores, or it can continue to use Black Hole chiplet for AI workloads (even assigning some work to 24 SiFive X280 cores, the company says). The small chips will communicate with each other using the above 2 TB/s interconnection.
The Aegis microchip with 128 universal RISC-V eight-width Ascalon cores, organized in four 32-core clusters with inter-cluster consistency, will be manufactured using 3nm level process technology. In fact, Aegis CPU small chips will be the first to use a 3 nm manufacturing process, which could put the company at the top in high-performance CPU design.
At the same time, Grendel will use LPDDR5 memory subsystems, PCIe, and Ethernet connectivity, so it will provide significantly higher reasoning and training performance than the company's existing solutions. Speaking of Tensix cores, it is important to note that while all the AI cores of Tenstorrent are called Tensix, these cores are actually evolving.
"The [Tensix] changes are gradual, but they do exist," company founder Ljubisa Bajic explains."[They added] new data formats, change ratios of FLOPS / SRAM capacity, SRAM bandwidth, on-chip network bandwidth, new sparse features, and general features.”
Interestingly, the different Tenstorrent slides mention the different memory subsystems for the Black Hole and Grendel products. This is because the company has been looking for the most efficient memory technology, and because it is licensed for a DRAM controller and a physical interface (PHY). Therefore, it has some flexibility in choosing the exact memory type. In fact, Lien says, Tenstorrent is also developing its own memory controller for future products, but for the 2023-2024 solution, it intends to use third-party MC and PHY. Meanwhile, for this reason, Tenstorrent does not intend to use any strange memory, such as HBM.
04 Business model: Sales solutions and licensing IP
While Tenstorrent has five different CPU IP (albeit based on the same microarchitecture), it only AI / ML products use SiFive or Tenstorrent's eight-width Ascalon CPU kernel in the X280 in the pipeline (if not considering fully configured servers). So it is reasonable to ask why it needs so many CPU kernel implementations.
The short answer to this question is that Tenstorrent has a unique business model that includes IP licenses (in RTL, hard macros, or even GDS), selling small chips, selling add-on ML acceleration cards, or ML solutions with CPU and ML small chips, and selling fully configured servers containing those cards.
Companies that build their own SoC can license Tenstorrent developed RISC-V, core, broad CPU IP combinations enabling companies to compete for solutions that require different levels of performance and power.
Server vendors can use Tenstorrent's Grayskull and Wormhole accelerator cards or Blackhole and Grendel ML processors to build their machines. At the same time, entities that do not want to build the hardware can buy pre-built Tenstorrent servers and deploy them.
This business model looks somewhat controversial because in many cases Tenstorrent competes with and will compete with its customers. Ultimately, however, vendors such as Nvidia offer additional cards and prefabricated servers based on these motherboards, and companies like Dell or HPE don't seem to worry very much about this because they offer solutions for specific customers, not just building blocks.
05 Summary
About two years ago, Tenstorrent joined the fire with the hiring of Jim Keller. Within two years, the company has recruited a group of * engineers who are developing high-performance RISC-V cores for data-center-level AI / ML solutions and systems. The development team's achievements include the global * eight-bit RISC-V universal CPU core, and the appropriate system hardware architecture available for AI and HPC applications.
The company has a comprehensive roadmap, including high-performance CPU small chips based on RISC-V and advanced AI accelerator small chips, which promise to provide powerful solutions for machine learning. Keep in mind that AI and HPC are the major trends promising for explosive growth, and providing AI accelerators and high-performance CPU cores seems to be a very flexible business model.
The AI and HPC markets are highly competitive, so when you want to compete with established competitors (AMD, Intel, Nvidia) and emerging players (Cerebras, Graphcore), you have to hire some of the world's best engineers. Like the big chip developers, Tenstorrent has its own general-purpose CPU and AI / ML accelerator hardware, which is a unique advantage. At the same time, because the company uses RISC-VISA, it is currently unable to address some markets and workloads, at least in terms of the CPU.