Lunar Lake Overview: In-Depth With Lead Architect and Design Mgr. | Talking Tech | Intel Technology

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hi, and welcome to "Talking Tech". I'm your host Alejandro Hoyos, and I had the great pleasure to meet Arik and Yaron who are the lead engineers who have been working on Intel's latest processor, Lunar Lake. (bright music) - So, I'm a Lunar Lake lead architect. I'm actually managing the SoC architecture team in IDC that works on the SoC definition for the client products, mobile as well as desktop products. - I'm today the general manager of the client organization in Intel. Was also the Lunar Lake design manager. And this is my own, we are creating the good products for the PC market. - So let's start talking about Lunar Lake and kind of what do you guys have in mind when you were first designing? So for example, when we're doing a Raptor Lake and Alder Lake, we're thinking about like, hey, these two architectures processors are gonna span like mobile and desktop. What was the thought behind when you were designing for Lunar Lake? - When we started to design Lunar Lake, we had in mind that we need to be more power efficient. And we also wanted to target to the low power envelope, to the low power segment. And we said, let's think out of the box. What do we do to reduce power? In this in mind we look at all the vectors, how to do a power efficiency, how to reduce the power, how to optimize how to build the customized product, and not just try to scale it up to all the segments, like from desktop to the, I don't know, 65 watt to the 8 watt. So this is what we wanted to achieve and we worked out, you know, to make it happen. - Usually, you know, when I did Alder Lake back then we started with desktop and then we scaled down to the smaller, low power parts. But in Lunar Lake it was different. We wanted to aim specific power envelope. We started from fanless, then a bit more, like higher power envelopes. And it was a dedicated part for this segment. And we wanted to make it best in class. - We wanted to build a premium, low power optimized product. - What were the goals? So you said for a premium low power, what kind of things that you, when you sit down with the fellow architects and designers, like all right, these are kind our goals, then what do we have to do to reach that goal? - So first of all, you know, it's to identify every transistor that is not needed to be removed. It's high level. And not to have a super set product. So it's really customized. Usually we're not doing that. So we did it in Lunar Lake. - First, we said it's a premium product. So everything there is premium. We put the best graphics, we put the best AI engine, the NPU. Our cores will be optimized for this specific usage, for specifically light usages. And we'll have a significant improvement in power consumption, which will lead to much better battery life. And we want all of it to be working together as a single piece. Together with that we added memory on package to have best DRAM consumption as well as a luxury in the power delivery rails with a new PMICs on the platform, on the motherboard. So they can deliver the best rails that we can supply the chip. So, it'll be very power efficient. - You added the DRAM in there to keep the power lower. Why is it lower when you have it on package? - The footprint is much smaller. So the lines are very short and then you consume much lower power because you need to route much smaller distance. - So smaller distances, that means less power to send a signal and reach that. - And it's also optimized just for this specific DRAM types. Yeah, so the PHY could be much better. - Okay, so it a specific RAM that we're optimizing for. - Correct. - Basically with this memory on package, we could achieve 40% PHY power reduction. And also the short traces are then able to meet the higher frequency with less effort, let's say. - Yeah, yeah, that makes sense. So PHY is the physical layer in which it actually transmits. And so saving 40%, and like you said, because you have shorter traces, you have higher speed. So what speeds are we able to reach now on Lunar Lake? - [Yaron] We can reach up to 8.5 gigatransfers per second, which is an impressive number. - That is. Yeah, that is pretty impressive. So for Meteor Lake, we had four different tiles. How does it look like for Lunar Lake? How many tiles do we have in Lunar Lake? - Lunar Lake, we have two tiles. The big one is the CPU tile and we have the PCT tile. So we try to reduce the number of tiles, of course, to increase the power efficiency. - The compute tile puts together all the XPUs, all the compute engines. So the cores, it has the graphics, it has the NPU, it has the imaging, it has display and media as well. And together with that, the memory components, the memory subsystem. The caches, the memory controllers, the DRAM PHY, all of it together makes it very optimized and efficient. - Before we kind of dive into each one of those blocks, on Meteor Lake we had those spread across different tiles. What was the reasoning of bringing all those back to just one compute tile? - Yeah, first we wanted to put all the elements on a leading process. It was very important for us. We wanted to get the best for those transistors. This means that it will be optimizing in power, they will get the best speeds. And the second was to optimize latency. Latency is important for performance, and therefore if they all reside together, we can get the best latency we possibly can get. - So let's talk about the different blocks and the different speeds they have in there. So we have, I guess just like we had before, we have performance cores and we have E-cores. Let's talk about the performance cores. How many are they? Have they changed? - Yeah, we have four P-cores, four performance cores. It's a new Lion Cove core. The big thing there is that it's a single thread core. It has only one thread. But it's improved in its IPC. It also improved its performance per watt per area because it's a single thread core. It has a bigger cache and it also resides inside the LLC structure, same structure as we had in the past, which boosted even more in terms of latency and performance. Overall, a very powerful unit. - And the reason that it's, to see if I understand correctly, the reason that it is just single thread is because we were designing for a better PNP power and performance, right? - That's correct. And when you look on the evolvement of the hybrid architecture, we started in Alder Lake in a performance hybrid architecture. Now it's a bit different. The hybrid evolvement led us to a thought that we don't really need the single thread, sorry, the SMT support, the hyper-threading support inside the P-core, because we actually scale in multi-thread using more and more E-cores as we go up in the power envelopes. And therefore it makes sense to go with a single thread performance core that is optimized, while if you want to scale to more threads and have higher multiple performance, you just add more E-cores. And this hybrid architecture is a bit different in a sense that the efficient cores, maybe we'll talk about them, are much more performance capable. - Right, right. That's actually what I wanna touch on. So just because the fact that they're called efficient cores doesn't mean that they're less powerful than the P-cores. Actually, there's a lot of changes that you could probably let us in on. - So you know, efficiency is the amount of performance you can have per power envelope. So performance per watt and efficiency is somewhat the same. And yeah, indeed, the efficient cores are increasing their performance and they're also improving their power, and therefore the performance per watt is better. When you have a power limitation, if you are better in efficiency, if you are better in performance per watt, your absolute performance that you can get to in a specific power envelope is higher. And this is why they are efficient core. They're also outperforming the P-cores if you go to low power envelope. And that's something that we actually use quite a lot in Lunar Lake. We want most of the usages to be used by the efficient core because you usually need casual type of performance where if you really need some intense workload performance, then you switch to the P-cores and you get those peak performance numbers. - And can you tell us a little bit more about some of the changes? Because I know on Meteor Lake we had the E-cores, but there were some on the compute tile and there were also some other ones on the SoC tile, on the low power island, and there has been some changes to that. - Right, so Meteor Lake had three types of cores. In the compute tile they had the P-cores, the performance cores. They had the E-cores also reside inside the compute tile together with them on the same caching structure as well as the fabric. And they had the low power island, which is yet another flavor of the E-cores which were aimed for low power. We actually changed the concept a bit. First, we only have performance cores residing on the big cache structure with the ring. And the second, we took the low power island and we improved it to still work in low power, but to also scale to high power envelopes and be much more efficient. We did a couple of things. We changed the structure from two cores to four cores. We increased the cache from two megabyte to four megabyte. We added a memory side cache, which contributes a lot to the instruction per cycle of the E-core because it's yet another cache that the E-core can use. And we added a dedicated power delivery so you could shift the voltage in frequency of the E-cores separately from everything. Separately from the other cores, or from the GPU, or NPU, or the memory subsystem. And the last thing is that it's on a leading node, so it gives the value as well. So all of those things makes the E-cluster, what we call it today in Lunar Lake, which was previously low power island, a much more efficient structure. - Okay, so what is the main idea of separating the performance cluster and the efficient cluster? - The main idea of separating them is that in Lunar Lake, we created a strong efficient cluster that is doubling the frequency, doubling the number of cores, and also putting the compute time on a leading edge process. And once you have this strong, efficient cluster, you can run a lot of application on this cluster which connected directly to the interconnect, what we call network on a chip. And on the other side you have the P-core cluster which is connected to the ring, which is more power consuming, that you can use for more performance demanding application. So, the main idea is to try to run most of the application on the efficient cluster and save the power of the P-core cluster and only when you need performance you will activate the P-core cluster. And this is one of the big things that we did in Lunar Lake. - And I guess that's kinda one of the reasons why the micro-architecture on the E-cluster was completely revamped and there were so many changes done. - Correct. - That was so you can have more workloads being run on them than having to go and activate the P-core. - Correct. Every time you activate the P-core, you consume the ring fabric and the power. So basically, you need to do soft optimization, how to do optimization to make it efficient, but once you have the good recipe, you know how to control, how to run a containment of tasks to the E-core cluster. And the performance demanding task will go to the performance cluster. - So besides the L2 within the E-cluster cache that side memory cache is completely new? - Yes. - And the reason, if I understand correctly, that it saves power is because we have a bigger cache. So it saves us the amount of, we wouldn't have to go to the DDR anymore and fetch the data. - Correct. So similar to other caches, this is a memory side cache, so it works a bit differently, but it does similar things. So, first it's close to the silicon, so the latency is much better. So you get more performance. And the second is by accessing this cache, you do not go out to the DRAM. So you also save power. - In Lunar Lake we did for the first time, you know, we have the hardware, we have the capabilities, it requires a software optimization, of course. So, this is something that we see benefit from it. - And can it only be accessed by the E-cores, or can it also be accessed by other blocks? - No, it can be accessed by other blocks and actually it's used by other blocks. So like the other engines, like the AI engines and the media engines. We do leverage it for, you know, buffering and reduce the access to the memory as I said. - It's accessible by any agent in the system. It's not valuable for any agent in the system. Therefore, we are doing some cache waste allocation, dynamic cache waste allocation, in which we can put or allocate some ways for a specific engine when we find it beneficial. So we allocate some of the waste to the Atom cores or to the E-core cluster. We allocate some for the P-cores, we allocate some for media, and it's dynamic. It does contribute the E-core cluster more than it contributes to P-core cluster. And it does also contribute some devices as well. - That's pretty amazing, because you have to do all this traffic controlling, like traffic of, okay, who's gonna access the memory and who's gonna be able to fetch data and retrieve the data? So, that seems like it's very complex, which is pretty neat. I just... - Yeah, it is. It is what we did for the first time. It is complex, and again, we are working to improve it as we go. - You were mentioning when we were talking about the the E-cores cluster, that you can actually turn it on or off. So does that mean that it has like, its own power rail? - Yeah, as I said, one of the changes that we did is that it has an independent power rail and this is what actually the PMIC allows us to do. So, you are connecting a completely separate rail, it's only for the E-core cluster, then you turn it off completely, including its cache, while the P-core can work. Same goes vice versa. You can completely turn off the P-core cluster and let the E-core work efficiently. So you do not need to pay for the leakage or the power of each of the blocks while you're working. You also can flush those caches in each side and therefore it's more efficient. - So how many other independent rails and voltage rails we have there? - Oh, we have quite many. I think it's a record in Lunar Lake. - Okay. - We have like, more than 15. - But the major ones I think, because there's four, four PMICs, right? So which one of those does control? - So, the big ones are, we have the P-core, we have the E-core one, we have the system agent one, and we have the graphics. Those are the four big ones. Except for that we have plenty of others. We have also added variable rails, not only fixed rails. So for example, the Atom rail is variable. We added some variable rails for the DRAM itself. We wanted to make everything efficient. So the PMIC allowed us to add those capabilities. Therefore, it's premium. - These PMICs allow us more, better resolution, better accuracy in the voltage supply, and also give us the flexibility to have a luxury of rails. So, we're also using the telemetry indication to control them and consume power more efficiently. So, basically this is why we use the PMIC, again, for a premium product, power efficient product, energy efficient, that is the main purpose. - And who, you said that from the telemetry, who's providing the telemetry, or who's controlling- - The PMIC itself. The PMIC chips supports telemetry that we read them and use them for controlling the power resolution better. - And that way, so four PMICs for four different power rails, so you can actually turn on and off, or provide more accurate or different voltages so you can have better power savings. - Correct. - Let's change gears here a little bit. Let's talk about the changes that were done on the graphics side. What is new when it comes to the graphics side for Lunar? - Yeah, first it's a big graphics engine. Okay? So it gives the value by allowing creators and gamers much better user experience. - So, our GPU is also a new micro-architecture, Xe2. Xe2 brings a significant performance increase and also improves our AI capabilities. It's integrating XMX engines, which take care of the AI usages. So overall, you know, performance increase, scale to AI, and new micro-architecture. - So, we have the graphics side. And you said we also have a new media engine and display engine. Can you tell us a little bit more about that one? - Yeah, so we have the, actually the media engine and the display engine are part of the Xe2 architecture. It integrates HDMI 2.1 as well as DP 2.1. It also has a low power eDP 1.5 channel, which enables us good output as well as a low power output. - Do you know how many display pipelines it has? Like, how many monitors? - We have three display pipelines. And actually I didn't mention, but we are also supporting the H.266 VVC decoding, which is a major upgrade in our decode capabilities. - And we still support, because in Meteor Lake, we were supporting AV1 encode and decode. - Correct. We still support it, yes. - Okay, that's great. So that was pretty cool when it comes to the graphics side. Let's talk about the NPU, the neural processing unit, and what's new in there? - Yeah, it's a major upgrade. We actually increased the number of tiles. We have six neural processing engines and 12 SHAVE DSPs. Both of them improve AI acceleration significantly with nine megabyte of cache, and actually the architecture improves power efficiency significantly. So the big claim for NPU is it gets a lot of TOPS in a very low power so it's very efficient. And you know, by increasing the number of engines, we can run it lower for cases in which we are running background AI task or don't need high power throughput, and therefore it's very efficient. - Okay, so we increased the number of neural core engines and therefore we can do all the those things. - Correct. - Okay. So those are all the different engines that we have in the compute tile. What can we find in the platform control tile? - So in the platform control tile, we can divide it to two main functions. The security and the connectivity. - Yeah, we have three engines of security. We have the Intel Partner Security Engine, which is a new engine. And we have the Intel Silicon Security Engine, which is responsible for all the authentication stuff. And we have the CSME, which is the legacy engine, which is used for secure data from power. - And from connectivity perspective, we have integrated Wi-Fi 7, which was the first time we're integrating the Wi-Fi 7 inside the chip. We have USB4 and USB3 ports. We have the PCI gen five and PCI gen four ports. And we have Bluetooth and Thunderbolt and some other enhancement that we did. But basically this is the main functions of the PC. - So we have the compute tile, the platform controller tile, and we have the DRAM, and they're all connected where they're on the same package and they're connected through- - Correct. - Through Foveros, or? - Yeah, so we have the compute tile and the PCT tile on a single Foveros chip, and we have DRAM packaged on the same package with them. - So compared to Meteor Lake that there was also a network on chip, right? That's the name. There has been a lot of changes done when it comes to Lunar Lake that you were talking, and one of those changes was that now, like a unified protocol that it can actually talk to? Was that different on Meteor Lake? - So in Lunar Lake we did a lot of changes in the interconnect. Also, we created the unified protocol and separated within the network layer and the protocol layer. We also improved the interconnect power efficiency. And second thing we did is what we call extended scalability. We have the ability to move IPs between dies. This is also something that we created in Lunar Lake. We have this flexibility. The idea is that you can scale this architectural goodness to the next generation, regardless of the cutline between the dies. So you have fabric in the CPU die, you have the fabric in the PCT die, but in general you can switch IPs between them because these are the same fabrics. And this is something that we call extended scalability. It's like, you know, for other segment, for other purpose, for different mix and match between process and cost, you can have this flexibility to scale IPs. And by the way, we are taking Lunar Lake goodness to the next generations. - That's pretty cool. Because that means that you're not attached to one tile. You can move that block or that IP to a different tile and depending on the process, so you're not tied to one process, you can just move it around and you can just, it's like a Lego building block. Just like, oh let me move it here, put it over here. - Yeah, so you can think about it as IP agnostic and partitioning agnostic. - Yaron, thank you very much. Appreciate it. Learned a lot today. - You're welcome. - Arik, I learned a lot today. Thank you so much. I appreciate it, man. This has been great. - Thank you very much. (upbeat music)
Info
Channel: Intel Technology
Views: 17,998
Rating: undefined out of 5
Keywords:
Id: hmlDCAiD1bA
Channel Id: undefined
Length: 23min 41sec (1421 seconds)
Published: Tue Jun 04 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.