- Hi, and welcome
(energetic electronic music) to "Talking Tech." I'm your host Alejandro Hoyos, and today we have a very
special guest with us. His name is Darren and
he's gonna be talking about this new building block
that we have in Meteor Lake, which is called the
Neural Processing Unit. So the NPU or the Neural Processing Unit, it's completely new for Meteor Lake, and I am very eager to learn all about it 'cause this is completely new for me. So please tell us a
little bit more about it. - Sure, yeah, I guess I can use this diagram here to explain it. Well, first off, just to
describe the motivation, you know, the reason we
have this is, you know, we see the number
(transition whooshing) of use cases for AI seems to be exploding, right? And a lot of these use cases
we want to bring to the client, (transition whooshing) but these algorithms are
actually running on the CPU and you're kind of
limited by the amount of, you know, efficient compute you have. You know, you can actually
make the algorithm better, but then it's gonna burn too much power. So when we're bringing the NPU, it's basically kind of a
power efficient way to do AI. So you can take those algorithms, you can actually improve them,
they'll take more compute. But we basically have, you
know, power-efficient compute. - We think about it at Intel
in three different chunks. We have AI that is low-power, optimized for always on
doing AI and that's our MPU, and that's a very optimized piece for delivering good
performance at very low power. We also have the CPU. So when you're doing
light AI or occasional AI and you need very low latency, you can just run it on the CPU. People have been doing this forever. But we also have a GPU, and the GPU is actually better organized for large batches of AI. So if you're doing some
kind of content creation or you're doing some kind
of filtering of data, all of that would really
be well run on the GPU. Three different IPs that use three different software paradigms to bring AI to the client. The cool thing
(transition whooshing) is we're working on the software layers that allow developers to access all three of those IPs very easily. So we're using industry
standard API layers like DirectML or say, ONNX Runtime, or even our own, OpenVino. And that means that developers can pick and choose not just which
IP they want to run, but also what API they want to use to develop their AI applications. For us, it's back to
that big intel strategy, which is build a hardware
platform that's awesome and then be open in our
approach to software. So we're using industry-standard APIs and we have an optimized API for ourselves or an optimized SDK using OpenVino. (graphic whooshing) - Essentially this is the NPU section and then here's kind of
the rest of the system. So we basically have
kind of two components. One is, let's say like the host interface and kind of the control for the NPU. And then the rest is kind
of the compute portion. So the host interface portion
communicates with the host that does all the kind of
scheduling and memory management. The host interface also
controls the scheduling on the device, power management,
those kind of things. And then everything below
here is kind of compute. Really, the heart of the NPU is our fixed function compute block. We're calling it the inference pipeline. When you have a neural network, basically, the majority of the compute kind of boils down to
matrix multiplication. So we have what we're
calling the MAC array in the inference pipeline
block that does all that, you know, matrix multiplication. Actually, there's a lot of
neural network operations, if you look at like OpenVino, for example, there's like 160 plus
(transition whooshing) kind of operators we have to support. Not all of 'em are matrix multiplication or activation functions. - AI is really just a
computational problem, right? You're running lots of
these things called MACs, which are multiply accumulates. All AI algorithms are
generally large trees of multiply accumulate
across big data sets. So what we found over
years is that the precision of that calculation is not
that critical, typically. So instead of running a
full precision of 32 bits, often AI algorithms work very well with smaller, less precise math. And that's called INT8, which is just an eight
bit integer operation. So if you imagine if you do smaller math, you can do more of it and
get higher performance. That's what DP4A does. It takes our 32 bit SIMD register and it divides it up into eight bit chunks so that we can improve
performance by 4X for AI. - Typical sequence of operations you have on a neural network. So we basically have
(transition whooshing) kind of these three blocks, you know, kind of the MAC array,
the matrix multiplication. We'll have the activation function block and then a data conversion block. So we have a DSP. So this DSP is basically
fully programmable. It can essentially support everything, but you know, if you did
matrix multiplication on it, it's just not nearly as fast as our fixed-function hardware here. But if we have a lower compute operation or we have some operation that just doesn't happen very often, we would run it on the DSP. You know, the key things
for neural network execution or neural network power
are really how many times you read and write data
and then also how efficient your kind of matrix multiplication is. So through this, we can get a
lot of, you know, data reuse. We have kind of internal
register files, let's say, inside our MAC array where we
can get a lot of data reuse that can help reduce
the power consumption. - Okay, so we have covered
what the hardware side, but what about drivers and software side? - So there's kind of like
two factors in the software. So the first is what is the
driver model for the device? And that basically means, you know, how do we do the power management? How do we do the memory management? How does the security work? So we have a driver model called MCDM, which is the Microsoft
Compute Driver Model. Then the other part is, you know, how does the developer program the device? You know? What's the
programming interface? What's the programming API? (transition whooshing) And we've tried to have
kind of a common API across the different hardware we support. So, it's GPU, CPU, or NPU, you know, we're all supporting, you know, DirectML, WinML, OpenVino,
and ONNX Runtime. So we're trying to ease
the developer experience. I mean, that's really
the key for adoption. - So this is great.
(transition whooshing) Can you give us an
example on applications? - Let's say you're doing
video conferencing, you wanna blur the background, you wouldn't have exactly
this kind of network. You'd have what's called,
like, a segmentation network. (transition whooshing) So in that kind of
network, you would have, you know, different filter values. Let's say you're running,
you know, OpenVino, you essentially would load the model. (transition whooshing) That model has a description of what the neural network
structure looks like, and then it has what
the filter values are. You would've put that
through the compiler, and the compiler would've compiled
(transition whooshing) essentially the machine code
for what's gonna run here. You would put in an input image. So your input image, you put
in the picture, you know, your face is there and all
your background is there. And then what the neural
network would go do is it would give an output
and it would, again, say for every pixel,
foreground or background. And then from there, it's gonna go to maybe
the GPU to do the blur. So the GPU is gonna, basically, get that, you know, mask, kind of the foreground-background mask, and it's gonna take in the image and then it's gonna
apply a blur to any pixel that's says its background and then not do a blur on any pixel that says its foreground. - How are we expanding AI
into the client's head? Because nowadays, we know that most of the AI is done at the cloud, but-- - Yeah, yeah, it's really interesting. Well, I mean, starting at the top, which would be why do you
want AI in the client, right? If it's in the cloud, why do
you need it in the client? Well, there's lots of reasons. One is if you have AI running locally, then you don't have to share your data with all these different services. So if you're doing, like, financials or your private
pictures or whatever you want, having that local is very
important to a lot of people. The second thing you can do is effectively have lower latency. So if you're using AI
for facial recognition or for some kind of fingerprint or maybe even creation, the latency to go up to the cloud and back can actually degrade the experience. So having high-performance, low-power AI capability on the client is critical going forward. - For you as an architect,
as a design engineer, how do you get, per
se, predict the future? Do you have a little crystal ball? Like, how do you design for the future? How do you know what's coming up next? - Yeah, I mean, the exciting
thing about the field is everything moves so fast. So, you know, and this is all incremental, I think we have the right
kind of base architecture. So we're making kind of, you
know, incremental tweaks. (transition whooshing) So let's say, like, a new paper comes out and there's a new kind
of network architecture, we'll take that and then
we kind of analyze it. We do, like, a simulation to
see what's our performance, and then we look to see, you know, like, what's the bottleneck? (transition whooshing) You know, what's taking
the most amount of time? Or maybe are we not having
good enough efficiency? So then we go look to see, you know, should we add some new
fixed function hardware in the inference pipeline, or should we tweak, let's say, some of the way that we're processing data in the inference pipeline, or should we add new
instructions in the SHAVE DSP, I mean, in Meteor Lake, we added a few new instructions based on new activation functions. There was new activation
functions that were coming out, so we're like, "Okay, well
we wanna make those faster." In our analysis, we showed
that was a bottleneck. So we added new, like, vector instructions to the DSP, for example. - Well, I mean, if you
think about Intel's history and strength, we're all about scale. So in this case, Meteor Lake
is going to ship hundreds of millions of devices
over the next five years. And so that install base will attract multiple different types
(bell chiming) of applications. And we can only imagine what
this platform's gonna do. Our strategy is to build the hardware, build the software infrastructure, and then work with ISP partners to deliver the best experiences. That includes folks
like Microsoft and Adobe and others that are basically building AI-enabled applications today.
(energetic electronic music) - As you can see, Meteor Lake, from an architecture point of view and also from a product point of view, has a lot to offer, from new AI, new graphics,
new process technology, a completely new type of architecture where we go from
monolithic to desegregated, brings a lot of new features. Thank you for watching with us and please stay tuned for more videos that will be coming your way. (cheerful electronic music) (no audio)