- Hi, and welcome to "Talking Tech". I'm your host Alejandro Hoyos, and I had the great pleasure
to meet Arik and Yaron who are the lead engineers who have been working on
Intel's latest processor, Lunar Lake. (bright music) - So, I'm a Lunar Lake lead architect. I'm actually managing the
SoC architecture team in IDC that works on the SoC definition
for the client products, mobile as well as desktop products. - I'm today the general manager of the client organization in Intel. Was also the Lunar Lake design manager. And this is my own, we are creating the good
products for the PC market. - So let's start talking about Lunar Lake and kind of what do you guys have in mind when you were first designing? So for example, when we're doing a Raptor
Lake and Alder Lake, we're thinking about like, hey, these two architectures
processors are gonna span like mobile and desktop. What was the thought behind when you were designing for Lunar Lake? - When we started to design Lunar Lake, we had in mind that we need
to be more power efficient. And we also wanted to target to the low power envelope,
to the low power segment. And we said, let's think out of the box. What do we do to reduce power? In this in mind we look
at all the vectors, how to do a power efficiency,
how to reduce the power, how to optimize how to build
the customized product, and not just try to scale
it up to all the segments, like from desktop to the, I don't know, 65 watt to the 8 watt. So this is what we wanted to achieve and we worked out, you
know, to make it happen. - Usually, you know, when I did Alder Lake back then we started with desktop and then we scaled down to
the smaller, low power parts. But in Lunar Lake it was different. We wanted to aim specific power envelope. We started from fanless, then a bit more, like
higher power envelopes. And it was a dedicated
part for this segment. And we wanted to make it best in class. - We wanted to build a premium,
low power optimized product. - What were the goals? So you said for a premium low power, what kind of things that you, when you sit down with the
fellow architects and designers, like all right, these are kind our goals, then what do we have to
do to reach that goal? - So first of all, you know, it's to identify every transistor that is not needed to be removed. It's high level. And not to have a super set product. So it's really customized.
Usually we're not doing that. So we did it in Lunar Lake. - First, we said it's a premium product. So everything there is premium. We put the best graphics, we put the best AI engine, the NPU. Our cores will be optimized
for this specific usage, for specifically light usages. And we'll have a significant improvement in power consumption, which will lead to much
better battery life. And we want all of it
to be working together as a single piece. Together with that we
added memory on package to have best DRAM consumption as well as a luxury in
the power delivery rails with a new PMICs on the
platform, on the motherboard. So they can deliver the best rails that we can supply the chip. So, it'll be very power efficient. - You added the DRAM in there
to keep the power lower. Why is it lower when
you have it on package? - The footprint is much smaller. So the lines are very short and then you consume much lower power because you need to route
much smaller distance. - So smaller distances, that means less power to
send a signal and reach that. - And it's also optimized just
for this specific DRAM types. Yeah, so the PHY could be much better. - Okay, so it a specific RAM
that we're optimizing for. - Correct. - Basically with this memory on package, we could achieve 40% PHY power reduction. And also the short traces are then able to meet the higher frequency
with less effort, let's say. - Yeah, yeah, that makes sense. So PHY is the physical layer
in which it actually transmits. And so saving 40%, and like you said, because you have shorter
traces, you have higher speed. So what speeds are we able
to reach now on Lunar Lake? - [Yaron] We can reach up to
8.5 gigatransfers per second, which is an impressive number. - That is. Yeah, that
is pretty impressive. So for Meteor Lake, we
had four different tiles. How does it look like for Lunar Lake? How many tiles do we have in Lunar Lake? - Lunar Lake, we have two tiles. The big one is the CPU tile
and we have the PCT tile. So we try to reduce the
number of tiles, of course, to increase the power efficiency. - The compute tile puts
together all the XPUs, all the compute engines. So the cores, it has the
graphics, it has the NPU, it has the imaging, it has
display and media as well. And together with that,
the memory components, the memory subsystem. The caches, the memory
controllers, the DRAM PHY, all of it together makes it
very optimized and efficient. - Before we kind of dive into
each one of those blocks, on Meteor Lake we had those
spread across different tiles. What was the reasoning of
bringing all those back to just one compute tile? - Yeah, first we wanted
to put all the elements on a leading process. It was very important for us. We wanted to get the best
for those transistors. This means that it will
be optimizing in power, they will get the best speeds. And the second was to optimize latency. Latency is important for performance, and therefore if they all reside together, we can get the best latency
we possibly can get. - So let's talk about the different blocks and the different speeds
they have in there. So we have, I guess
just like we had before, we have performance cores
and we have E-cores. Let's talk about the performance cores. How many are they? Have they changed? - Yeah, we have four P-cores,
four performance cores. It's a new Lion Cove core. The big thing there is that
it's a single thread core. It has only one thread. But it's improved in its IPC. It also improved its
performance per watt per area because it's a single thread core. It has a bigger cache and it also resides
inside the LLC structure, same structure as we had in the past, which boosted even more in terms
of latency and performance. Overall, a very powerful unit. - And the reason that it's, to see if I understand correctly, the reason that it is just single thread is because we were designing for a better PNP power and performance, right? - That's correct. And when you look on the evolvement of the hybrid architecture, we started in Alder Lake in a performance hybrid architecture. Now it's a bit different. The hybrid evolvement led us to a thought that we don't really
need the single thread, sorry, the SMT support, the hyper-threading
support inside the P-core, because we actually scale in multi-thread using more and more E-cores as we go up in the power envelopes. And therefore it makes sense to go with a single thread performance
core that is optimized, while if you want to scale to more threads and have higher multiple performance, you just add more E-cores. And this hybrid architecture
is a bit different in a sense that the efficient cores, maybe we'll talk about them, are much more performance capable. - Right, right. That's
actually what I wanna touch on. So just because the fact that they're called efficient cores doesn't mean that they're less
powerful than the P-cores. Actually, there's a lot of changes that you could probably let us in on. - So you know, efficiency
is the amount of performance you can have per power envelope. So performance per watt and
efficiency is somewhat the same. And yeah, indeed, the efficient cores are
increasing their performance and they're also improving their power, and therefore the performance
per watt is better. When you have a power limitation, if you are better in efficiency, if you are better in performance per watt, your absolute performance
that you can get to in a specific power envelope is higher. And this is why they are efficient core. They're also outperforming the P-cores if you go to low power envelope. And that's something that we actually use quite a lot in Lunar Lake. We want most of the usages to
be used by the efficient core because you usually need
casual type of performance where if you really need some
intense workload performance, then you switch to the P-cores and you get those peak
performance numbers. - And can you tell us a little bit more about some of the changes? Because I know on Meteor
Lake we had the E-cores, but there were some on the compute tile and there were also some
other ones on the SoC tile, on the low power island, and there has been some changes to that. - Right, so Meteor Lake
had three types of cores. In the compute tile they had the P-cores, the performance cores. They had the E-cores also reside inside the compute tile together with them on the same caching structure
as well as the fabric. And they had the low power island, which is yet another flavor of the E-cores which were aimed for low power. We actually changed the concept a bit. First, we only have performance cores residing on the big cache
structure with the ring. And the second, we took
the low power island and we improved it to
still work in low power, but to also scale to high power envelopes and be much more efficient. We did a couple of things. We changed the structure
from two cores to four cores. We increased the cache from
two megabyte to four megabyte. We added a memory side cache, which contributes a lot to
the instruction per cycle of the E-core because
it's yet another cache that the E-core can use. And we added a dedicated power delivery so you could shift the
voltage in frequency of the E-cores separately from everything. Separately from the other cores, or from the GPU, or NPU,
or the memory subsystem. And the last thing is that
it's on a leading node, so it gives the value as well. So all of those things
makes the E-cluster, what we call it today in Lunar Lake, which was previously low power island, a much more efficient structure. - Okay, so what is the main idea of separating the performance cluster and the efficient cluster? - The main idea of separating
them is that in Lunar Lake, we created a strong efficient cluster that is doubling the frequency, doubling the number of cores, and also putting the compute time on a leading edge process. And once you have this
strong, efficient cluster, you can run a lot of
application on this cluster which connected directly
to the interconnect, what we call network on a chip. And on the other side you
have the P-core cluster which is connected to the ring, which is more power consuming, that you can use for more
performance demanding application. So, the main idea is to try to run most of the application
on the efficient cluster and save the power of the P-core cluster and only when you need performance you will activate the P-core cluster. And this is one of the big
things that we did in Lunar Lake. - And I guess that's
kinda one of the reasons why the micro-architecture
on the E-cluster was completely revamped and
there were so many changes done. - Correct. - That was so you can have more
workloads being run on them than having to go and activate the P-core. - Correct. Every time you activate the P-core, you consume the ring fabric and the power. So basically, you need
to do soft optimization, how to do optimization
to make it efficient, but once you have the good recipe, you know how to control, how to run a containment of
tasks to the E-core cluster. And the performance demanding task will go to the performance cluster. - So besides the L2
within the E-cluster cache that side memory cache is completely new? - Yes. - And the reason, if I
understand correctly, that it saves power is because
we have a bigger cache. So it saves us the amount of, we wouldn't have to go to the DDR anymore and fetch the data. - Correct. So similar to other caches, this is a memory side cache, so it works a bit differently, but it does similar things. So, first it's close to the silicon, so the latency is much better. So you get more performance. And the second is by accessing this cache, you do not go out to the DRAM. So you also save power. - In Lunar Lake we did for the first time, you know, we have the hardware, we have the capabilities, it requires a software
optimization, of course. So, this is something that
we see benefit from it. - And can it only be
accessed by the E-cores, or can it also be
accessed by other blocks? - No, it can be accessed by other blocks and actually it's used by other blocks. So like the other engines, like the AI engines and the media engines. We do leverage it for, you know, buffering and reduce the
access to the memory as I said. - It's accessible by
any agent in the system. It's not valuable for
any agent in the system. Therefore, we are doing
some cache waste allocation, dynamic cache waste allocation, in which we can put or allocate some ways for a specific engine when
we find it beneficial. So we allocate some of the
waste to the Atom cores or to the E-core cluster. We allocate some for the P-cores, we allocate some for
media, and it's dynamic. It does contribute the E-core cluster more than it contributes
to P-core cluster. And it does also contribute
some devices as well. - That's pretty amazing, because you have to do all
this traffic controlling, like traffic of, okay, who's gonna access the memory and who's gonna be able to fetch
data and retrieve the data? So, that seems like it's very complex, which is pretty neat. I just...
- Yeah, it is. It is what we did for the first time. It is complex, and again, we are working
to improve it as we go. - You were mentioning when we were talking about
the the E-cores cluster, that you can actually turn it on or off. So does that mean that it
has like, its own power rail? - Yeah, as I said, one of
the changes that we did is that it has an independent power rail and this is what actually
the PMIC allows us to do. So, you are connecting a
completely separate rail, it's only for the E-core cluster, then you turn it off
completely, including its cache, while the P-core can work. Same goes vice versa. You can completely turn
off the P-core cluster and let the E-core work efficiently. So you do not need to pay for the leakage or the power of each of the
blocks while you're working. You also can flush those
caches in each side and therefore it's more efficient. - So how many other independent rails and voltage rails we have there? - Oh, we have quite many. I think it's a record in Lunar Lake. - Okay.
- We have like, more than 15. - But the major ones I think, because there's four, four PMICs, right? So which one of those does control? - So, the big ones are, we have the P-core, we
have the E-core one, we have the system agent one, and we have the graphics. Those are the four big ones. Except for that we have plenty of others. We have also added variable
rails, not only fixed rails. So for example, the Atom rail is variable. We added some variable
rails for the DRAM itself. We wanted to make everything efficient. So the PMIC allowed us to
add those capabilities. Therefore, it's premium. - These PMICs allow us
more, better resolution, better accuracy in the voltage supply, and also give us the flexibility
to have a luxury of rails. So, we're also using
the telemetry indication to control them and consume
power more efficiently. So, basically this is why we use the PMIC, again, for a premium product, power efficient product, energy efficient, that is the main purpose. - And who, you said
that from the telemetry, who's providing the telemetry,
or who's controlling- - The PMIC itself. The PMIC chips supports
telemetry that we read them and use them for controlling
the power resolution better. - And that way, so four PMICs for four
different power rails, so you can actually turn on and off, or provide more accurate
or different voltages so you can have better power savings. - Correct. - Let's change gears here a little bit. Let's talk about the changes that were done on the graphics side. What is new when it comes to
the graphics side for Lunar? - Yeah, first it's a big
graphics engine. Okay? So it gives the value by allowing creators and gamers much better user experience. - So, our GPU is also a new
micro-architecture, Xe2. Xe2 brings a significant
performance increase and also improves our AI capabilities. It's integrating XMX engines, which take care of the AI usages. So overall, you know, performance increase, scale to AI, and new micro-architecture. - So, we have the graphics side. And you said we also
have a new media engine and display engine. Can you tell us a little
bit more about that one? - Yeah, so we have the, actually the media engine
and the display engine are part of the Xe2 architecture. It integrates HDMI 2.1 as well as DP 2.1. It also has a low power eDP 1.5 channel, which enables us good output
as well as a low power output. - Do you know how many
display pipelines it has? Like, how many monitors? - We have three display pipelines. And actually I didn't mention, but we are also supporting
the H.266 VVC decoding, which is a major upgrade
in our decode capabilities. - And we still support,
because in Meteor Lake, we were supporting AV1 encode and decode. - Correct. We still support it, yes. - Okay, that's great. So that was pretty cool when
it comes to the graphics side. Let's talk about the NPU,
the neural processing unit, and what's new in there? - Yeah, it's a major upgrade. We actually increased the number of tiles. We have six neural processing
engines and 12 SHAVE DSPs. Both of them improve AI
acceleration significantly with nine megabyte of cache, and actually the architecture improves power efficiency significantly. So the big claim for NPU is it gets a lot of
TOPS in a very low power so it's very efficient. And you know, by increasing
the number of engines, we can run it lower for cases in which we are running background AI task or don't need high power throughput, and therefore it's very efficient. - Okay, so we increased the
number of neural core engines and therefore we can do
all the those things. - Correct.
- Okay. So those are all the different engines that we have in the compute tile. What can we find in the
platform control tile? - So in the platform control tile, we can divide it to two main functions. The security and the connectivity. - Yeah, we have three engines of security. We have the Intel Partner Security Engine, which is a new engine. And we have the Intel
Silicon Security Engine, which is responsible for all
the authentication stuff. And we have the CSME,
which is the legacy engine, which is used for secure data from power. - And from connectivity perspective, we have integrated Wi-Fi 7, which was the first time
we're integrating the Wi-Fi 7 inside the chip. We have USB4 and USB3 ports. We have the PCI gen five
and PCI gen four ports. And we have Bluetooth and Thunderbolt and some other enhancement that we did. But basically this is the
main functions of the PC. - So we have the compute tile,
the platform controller tile, and we have the DRAM, and they're all connected where
they're on the same package and they're connected through- - Correct. - Through Foveros, or? - Yeah, so we have the compute tile and the PCT tile on a single Foveros chip, and we have DRAM packaged on
the same package with them. - So compared to Meteor Lake that there was also a
network on chip, right? That's the name. There has been a lot of changes done when it comes to Lunar
Lake that you were talking, and one of those changes was that now, like a unified protocol that
it can actually talk to? Was that different on Meteor Lake? - So in Lunar Lake we did a lot of changes in the interconnect. Also, we created the unified protocol and separated within the network layer and the protocol layer. We also improved the
interconnect power efficiency. And second thing we did is what we call extended scalability. We have the ability to
move IPs between dies. This is also something that
we created in Lunar Lake. We have this flexibility. The idea is that you can scale
this architectural goodness to the next generation, regardless of the
cutline between the dies. So you have fabric in the CPU die, you have the fabric in the PCT die, but in general you can
switch IPs between them because these are the same fabrics. And this is something that
we call extended scalability. It's like, you know, for other
segment, for other purpose, for different mix and match
between process and cost, you can have this
flexibility to scale IPs. And by the way, we are taking Lunar Lake
goodness to the next generations. - That's pretty cool. Because that means that you're
not attached to one tile. You can move that block or
that IP to a different tile and depending on the process, so you're not tied to one process, you can just move it
around and you can just, it's like a Lego building block. Just like, oh let me move
it here, put it over here. - Yeah, so you can think about it as IP agnostic and partitioning agnostic. - Yaron, thank you very
much. Appreciate it. Learned a lot today.
- You're welcome. - Arik, I learned a lot today. Thank you so much. I appreciate it, man. This has been great.
- Thank you very much. (upbeat music)