Meteor Lake: AI Acceleration and NPU Explained | Talking Tech | Intel Technology

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hi, and welcome (energetic electronic music) to "Talking Tech." I'm your host Alejandro Hoyos, and today we have a very special guest with us. His name is Darren and he's gonna be talking about this new building block that we have in Meteor Lake, which is called the Neural Processing Unit. So the NPU or the Neural Processing Unit, it's completely new for Meteor Lake, and I am very eager to learn all about it 'cause this is completely new for me. So please tell us a little bit more about it. - Sure, yeah, I guess I can use this diagram here to explain it. Well, first off, just to describe the motivation, you know, the reason we have this is, you know, we see the number (transition whooshing) of use cases for AI seems to be exploding, right? And a lot of these use cases we want to bring to the client, (transition whooshing) but these algorithms are actually running on the CPU and you're kind of limited by the amount of, you know, efficient compute you have. You know, you can actually make the algorithm better, but then it's gonna burn too much power. So when we're bringing the NPU, it's basically kind of a power efficient way to do AI. So you can take those algorithms, you can actually improve them, they'll take more compute. But we basically have, you know, power-efficient compute. - We think about it at Intel in three different chunks. We have AI that is low-power, optimized for always on doing AI and that's our MPU, and that's a very optimized piece for delivering good performance at very low power. We also have the CPU. So when you're doing light AI or occasional AI and you need very low latency, you can just run it on the CPU. People have been doing this forever. But we also have a GPU, and the GPU is actually better organized for large batches of AI. So if you're doing some kind of content creation or you're doing some kind of filtering of data, all of that would really be well run on the GPU. Three different IPs that use three different software paradigms to bring AI to the client. The cool thing (transition whooshing) is we're working on the software layers that allow developers to access all three of those IPs very easily. So we're using industry standard API layers like DirectML or say, ONNX Runtime, or even our own, OpenVino. And that means that developers can pick and choose not just which IP they want to run, but also what API they want to use to develop their AI applications. For us, it's back to that big intel strategy, which is build a hardware platform that's awesome and then be open in our approach to software. So we're using industry-standard APIs and we have an optimized API for ourselves or an optimized SDK using OpenVino. (graphic whooshing) - Essentially this is the NPU section and then here's kind of the rest of the system. So we basically have kind of two components. One is, let's say like the host interface and kind of the control for the NPU. And then the rest is kind of the compute portion. So the host interface portion communicates with the host that does all the kind of scheduling and memory management. The host interface also controls the scheduling on the device, power management, those kind of things. And then everything below here is kind of compute. Really, the heart of the NPU is our fixed function compute block. We're calling it the inference pipeline. When you have a neural network, basically, the majority of the compute kind of boils down to matrix multiplication. So we have what we're calling the MAC array in the inference pipeline block that does all that, you know, matrix multiplication. Actually, there's a lot of neural network operations, if you look at like OpenVino, for example, there's like 160 plus (transition whooshing) kind of operators we have to support. Not all of 'em are matrix multiplication or activation functions. - AI is really just a computational problem, right? You're running lots of these things called MACs, which are multiply accumulates. All AI algorithms are generally large trees of multiply accumulate across big data sets. So what we found over years is that the precision of that calculation is not that critical, typically. So instead of running a full precision of 32 bits, often AI algorithms work very well with smaller, less precise math. And that's called INT8, which is just an eight bit integer operation. So if you imagine if you do smaller math, you can do more of it and get higher performance. That's what DP4A does. It takes our 32 bit SIMD register and it divides it up into eight bit chunks so that we can improve performance by 4X for AI. - Typical sequence of operations you have on a neural network. So we basically have (transition whooshing) kind of these three blocks, you know, kind of the MAC array, the matrix multiplication. We'll have the activation function block and then a data conversion block. So we have a DSP. So this DSP is basically fully programmable. It can essentially support everything, but you know, if you did matrix multiplication on it, it's just not nearly as fast as our fixed-function hardware here. But if we have a lower compute operation or we have some operation that just doesn't happen very often, we would run it on the DSP. You know, the key things for neural network execution or neural network power are really how many times you read and write data and then also how efficient your kind of matrix multiplication is. So through this, we can get a lot of, you know, data reuse. We have kind of internal register files, let's say, inside our MAC array where we can get a lot of data reuse that can help reduce the power consumption. - Okay, so we have covered what the hardware side, but what about drivers and software side? - So there's kind of like two factors in the software. So the first is what is the driver model for the device? And that basically means, you know, how do we do the power management? How do we do the memory management? How does the security work? So we have a driver model called MCDM, which is the Microsoft Compute Driver Model. Then the other part is, you know, how does the developer program the device? You know? What's the programming interface? What's the programming API? (transition whooshing) And we've tried to have kind of a common API across the different hardware we support. So, it's GPU, CPU, or NPU, you know, we're all supporting, you know, DirectML, WinML, OpenVino, and ONNX Runtime. So we're trying to ease the developer experience. I mean, that's really the key for adoption. - So this is great. (transition whooshing) Can you give us an example on applications? - Let's say you're doing video conferencing, you wanna blur the background, you wouldn't have exactly this kind of network. You'd have what's called, like, a segmentation network. (transition whooshing) So in that kind of network, you would have, you know, different filter values. Let's say you're running, you know, OpenVino, you essentially would load the model. (transition whooshing) That model has a description of what the neural network structure looks like, and then it has what the filter values are. You would've put that through the compiler, and the compiler would've compiled (transition whooshing) essentially the machine code for what's gonna run here. You would put in an input image. So your input image, you put in the picture, you know, your face is there and all your background is there. And then what the neural network would go do is it would give an output and it would, again, say for every pixel, foreground or background. And then from there, it's gonna go to maybe the GPU to do the blur. So the GPU is gonna, basically, get that, you know, mask, kind of the foreground-background mask, and it's gonna take in the image and then it's gonna apply a blur to any pixel that's says its background and then not do a blur on any pixel that says its foreground. - How are we expanding AI into the client's head? Because nowadays, we know that most of the AI is done at the cloud, but-- - Yeah, yeah, it's really interesting. Well, I mean, starting at the top, which would be why do you want AI in the client, right? If it's in the cloud, why do you need it in the client? Well, there's lots of reasons. One is if you have AI running locally, then you don't have to share your data with all these different services. So if you're doing, like, financials or your private pictures or whatever you want, having that local is very important to a lot of people. The second thing you can do is effectively have lower latency. So if you're using AI for facial recognition or for some kind of fingerprint or maybe even creation, the latency to go up to the cloud and back can actually degrade the experience. So having high-performance, low-power AI capability on the client is critical going forward. - For you as an architect, as a design engineer, how do you get, per se, predict the future? Do you have a little crystal ball? Like, how do you design for the future? How do you know what's coming up next? - Yeah, I mean, the exciting thing about the field is everything moves so fast. So, you know, and this is all incremental, I think we have the right kind of base architecture. So we're making kind of, you know, incremental tweaks. (transition whooshing) So let's say, like, a new paper comes out and there's a new kind of network architecture, we'll take that and then we kind of analyze it. We do, like, a simulation to see what's our performance, and then we look to see, you know, like, what's the bottleneck? (transition whooshing) You know, what's taking the most amount of time? Or maybe are we not having good enough efficiency? So then we go look to see, you know, should we add some new fixed function hardware in the inference pipeline, or should we tweak, let's say, some of the way that we're processing data in the inference pipeline, or should we add new instructions in the SHAVE DSP, I mean, in Meteor Lake, we added a few new instructions based on new activation functions. There was new activation functions that were coming out, so we're like, "Okay, well we wanna make those faster." In our analysis, we showed that was a bottleneck. So we added new, like, vector instructions to the DSP, for example. - Well, I mean, if you think about Intel's history and strength, we're all about scale. So in this case, Meteor Lake is going to ship hundreds of millions of devices over the next five years. And so that install base will attract multiple different types (bell chiming) of applications. And we can only imagine what this platform's gonna do. Our strategy is to build the hardware, build the software infrastructure, and then work with ISP partners to deliver the best experiences. That includes folks like Microsoft and Adobe and others that are basically building AI-enabled applications today. (energetic electronic music) - As you can see, Meteor Lake, from an architecture point of view and also from a product point of view, has a lot to offer, from new AI, new graphics, new process technology, a completely new type of architecture where we go from monolithic to desegregated, brings a lot of new features. Thank you for watching with us and please stay tuned for more videos that will be coming your way. (cheerful electronic music) (no audio)
Info
Channel: Intel Technology
Views: 7,013
Rating: undefined out of 5
Keywords:
Id: QSzNoX0qplE
Channel Id: undefined
Length: 10min 13sec (613 seconds)
Published: Mon Dec 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.