CUDA: New Features and Beyond | NVIDIA GTC 2024

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Now let's get started with our session. This morning I am proud to introduce Stephen Jones, an NVIDIA distinguished software architect, presenting the latest and greatest on our CUDA parallel computing platform. He's been with NVIDIA for over 14 years, starting as a software engineer, senior software engineer, on CUDA in 2008. He had a brief stint, and he may share a little bit about that while he was at SpaceX and meeting Elon. But in between, that was a good diversion, as today he's going to talk about his aeronautical and aerospace engineering experience from the University of Cambridge as his master's degree. And with that, I'm going to introduce Stephen to talk about CUDA. Thanks very much all for coming. It's amazing. I'm sure everyone says this, but I haven't been to a talk yet. It's just amazing to actually talk to people, instead of to a camera above your screen when you're recording a talk. So it's really nice. Thank you all for coming here. I'm one of the architects of CUDA, so I spend my time thinking about CUDA, the platform, the language, all the systems and stuff that goes with it. And also, I work closely with the hardware teams. I spend probably half my time, even though I'm a software guy, working with the hardware teams, working on what our next generation and the one after that GPU is going to be. Because one of the magical things we get to do is we get to combine the hardware and the software so that we build the programming model we want, and that we can program the hardware that we build. Now, I'm going to start today with something which I've learned working with the hardware teams a lot. And I think it really drives a lot of the way of thinking about the thing which I'm talking about today. And also, just a lot of the way I think about how hardware is driven by the laws of physics and by the way the constraints that we all are living under in terms of the hardware that we design, as well as software that programs it. So, this is possibly a contentious thing to say at NVIDIA that accelerated computing is not all about performance. But if you're in the keynote, if you watched the keynote yesterday with Jensen, he was really talking about how it's all about energy. It's not just the performance, it's about the performance per watt. Because ultimately, you've got to power these things, you've got to provide the energy into the machine, and so the efficiency is really the key metric that you have to care about. Yes, you want to be scaling performance, but you've got to scale it with efficiency as well. And the obvious place I went to look for this, just doing a bit of research for the introduction of the talk, you know, I went around looking at data centers, right? They're building data centers at an enormous rate in the world. They're building, they're standing up something like six or seven data centers a day. But I went looking for the number of how many data centers they build and they don't, nobody lists data centers by number, right? They list data centers by megawatt. Because power is the thing that is important in a data center, right? There's five gigawatts worth of data centers in North America right now, there's going to be another three gigawatts standing up in the next year. And when you go and you buy time on a data center, you rent them and they're charged by kilowatts per month. Nobody cares how many servers you're buying. Nobody cares how many data centers you're renting. You are renting power per month. Because power is the metric that really matters for computing, right? And if you look at what a data center typically is, a medium-sized data center runs you maybe 20 megawatts and that's because I've got this big building with a giant power connector coming in and that power provides 20 megawatts of power. So if I build a brand new chip that goes twice as fast but it runs at twice the power, I can't magically put more power into this room, right? I've, I end up with half that number of chips. I've got 20 megawatts. The question is what can I do with my 20 megawatts? And again, if you, if you were watching what, what, what Jensen was saying about Blackwell at the keynote, he talked a lot about the energy, the power efficiency, and that is a really big focus. And on the hardware side, that's something that they're all thinking about. Every transistor really counts. But it's not just in the data centers, it's also in my own home, right? My, my desktop machine gets power from the wall. I can't put 10 GPUs in it. I can't, I can't run that out of my one and a half kilowatts if I'm in America, three kilowatts if I'm in, in, in, in the UK. My maximum system power on my laptop, even smaller, right? So that everybody is constrained by power more than anything else. So it is really all about energy. And the challenge we have is that the energy equation, the energy balance is getting worse, right? On the left-hand side here, the, the very well-known chart, if you like, of, of Moore's law, that's a bunch of numbers which looks at the transistor density from when Moore stated his law back in about 1970. And it's pretty much been going exponentially. It's a log plot on the left-hand side. And on the right-hand side, I went to TSMC's website and I just pulled up all the information about their different chips and their, you know, their transistor density, that, that, that red line, the orange line has been, of course, increasing exponentially as well. But something else you see when you look through that, that data is that if you look at the power efficiency scaling, as I shrink my transistors, I need fewer electrons to turn them on. And so they take less power, but the power is not scaling as fast as the transistor count. And that is a problem in a world which is energy constrained, because when I keep adding transistors, I've got to do something about the power. And so while obviously we look very closely at the hardware and the efficiencies of hardware, it's critically important. I mean, obviously I say this, I'm a software guy, but it is critically important to look at this in terms of the efficiency of the software as well. And so I'm going to be talking about a couple of really key pieces where power comes from. One is data movement and the other is the computation, the two obvious users of electricity and power in one of these machines. Right. And so starting with computation, let's talk for a moment about floating point arithmetic, because that's really the core of computation that is being done, right? In fact, most of these data centers running GPUs, probably overwhelmingly a lot of the flops and a lot of the watts are being spent on things like matrix multiplication. And so I went and I dug up a little table here, you know, on the left hand side, we've got all these different precisions that NVIDIA offers. And there's lots of reasons for that, but I'm going to get into a particular one focused on power right now. And on the right hand side, I can break this down into the energy that doing different precisions of multiplications and the FMA is a fused multiply add, it's the sort of the fundamental unit of arithmetic operation in a computer. And if you look at the top, you see that the standard flop, the 32 bit single precision flop, for which I've normalized everything to about one X, double precision is about two and a half times the power and half precision is about half the power. And so it's not, the key here is that the higher precisions don't just scale linearly, floating point multiplication scales as the square of the mantissa length, that's that blue, excuse me, that's the blue section on the left hand side, right? So the longer my number is, the more power it takes to compute that number because I've got more decimal places, I've got more bits to move around. And then you look at the tensor cores, right? The tensor cores are completely different. That takes these single operations and they group them together with an economy of scale, you see dramatic changes and improvements in energy per flop, right? So this is one of the reasons you see these investments in these tensor cores, in these dense processing units is because I've got 20 megawatts in my data center, I want to cram all the flops in that I can get. And this is where, this is the kind of place it's coming from. And so you can look at these interesting balances, right? The tensor core at FP64, the double precision tensor core, and this is looking at Hopper H100 numbers, is more efficient than individual multipliers and that's the economy of scale I was talking about. But if you look at the difference between say 16 bit, right? Instead of being one and a half times more efficient, it's four times more efficient because again, it's that square of the mantissa bits, which is winning me the power. But I'm going to look at something which I think is a more interesting thing to look at, which is the difference between these 64 bit operations, which is two and a half times the power of a single precision, and these tensor cores at reduced precision, which is a tiny fraction of a factor of about 20 in power efficiency, which is sitting there. But of course, there's a very big difference between a 64 bit precision and a 16 bit precision number. So it turns out there's not as much of a difference as you would think. And this is not new work, but there's a lot that's been going on about it recently. I really wanted to tell you about this because I think this to me is a really exciting way that we are attacking the power wall, the power problem that I see from those curves, from the silicon processes, right? A couple of years ago, I was telling you about some work that my colleague Azam Haidar, and there's a reference to the paper down below, was using tensor cores to do an LU factorization matrix solve, where you would do the heavy lifting in 16 bit and then you'd have this iterative process, this GM res process to take your 16 bits and progressively improve the accuracy back up to 64 bit values, right? And so in the paper they look at this, excuse me, and this graph shows on the bottom here the number of iterations. I take these 16 bit tensor cores, they output actually a 32 bit result, and then I iterate progressively three, four times, and there's that line in the middle of what the FP64 accuracy is, and I can get closer and closer to the final result, and finally I exceed the perfect accuracy of a true native 64-bit number. So this is not in any way compromised on the results, this is exactly accurate to the bit of what you would get, except I've done the heavy lifting using those much more power efficient tensor cores, and that's a really big deal, right? Because here's a chart of actually, Azam kindly ran this for me this weekend, of running, in this case, an LU decomposition solve, the thing that was in that flow chart before, and the green line is the the gray top of the GH200, or they're both GH200, so the green line is the 16-bit plus 64-bit, the middle column on this chart, the blue line is the pure native double precision values. And so you can see what you're getting, you're not only getting the power benefit on the bottom, which is huge, right? Almost a factor six improvement in flops per watt, which is unbelievable. I can now do six times more flops in that power-limited data center using this technique than I could if I had natively got the exact same result using double precision numbers. And at the same time I'm going almost four times as fast, right? So I'm faster and I'm more efficient. This is amazing, this is huge. This algorithm is actually implemented in the CUDA solver library, but I see this reaching out into everything, right? If I can do work faster and more efficiently, this is one way we can attack this power wall that I see coming. It's not just Azzam's work as well, there's other people, my friend, my longtime friend, Ryo Yokota, at the University of Tokyo, he and some colleagues wrote a paper looking at a completely different approach, but again using low precision, in his case integer tensor cores, to produce matrix multiplication. And some of our genius guys at NVIDIA implemented it, but what they did here, instead of just taking one of the GH200, like mega powerful chips, they took the L40 or the L40S, and that's sort of the lower power data center part which does not have a native double precision tensor core in the first place. And using the 16-bit tensor cores that are in the L40S, they were able to run matrix multiplications running at six, seven times the performance, purely without even having a proper, well, there is a double precision unit, but it's not the high power, high performance double precision unit that you would normally find in an H100. They compared it, in fact, and it was too busy, I didn't put it on the chart, they compared it to an A100, and it's half the performance of an A100 using no double precision tensor cores at all, which is absolutely incredible, right? This is opening the door to parts with much lower power being able to achieve, you know, 50 % of the performance is incredible, right? And not only that, but the power savings, right? In the same way, at the same time as you're getting a factor six or seven in performance, you're getting a factor seven or eight in power, in performance, in power efficiency, sorry, performance per watt. So this is huge, and so this fascinates me, like, I'm very lucky with this talk, I get to just find all these cool things that are going on around the company and tell you about them because I just find them interesting, and this is one of the things I think is really fascinating, because there's so much, so many things we can do with this type of technique. Now, tensor cores themselves, a lot of people come and ask me, how do I program tensor cores? And there's, tensor cores are a complex system, right? They're these, they have all these different precisions, they have these different ways of using them, but the three main flavors of ways you get access to the tensor cores is, first, through the CuBLAS math library, that is your basic workhorse that has existed since the very beginning of CUDA, it's linear algebra APIs, and you call a matrix multiply, and that naturally, automatically pipes through onto the tensor cores, CuBLAS actually calls that one in the middle called CuBLASlt, CuBLASlt, which you can also access yourself, it's a public library, it gives these advanced APIs where you can really control a lot more aspects of what the tensor cores are doing, the tensor cores have a lot of different configurations, a lot of different modes, you can really get access to them. And on the right hand side, we have something called cutlass, which if you've seen me give this talk before, I talk about it probably every year, because it really is the programmer's way of getting at the tensor cores, it lets you write tensor core code inside your own kernel, and get access to all of the different knobs and configurations that tensor cores have. So I drew this out for myself in a different way, because really, there's a productivity dimension of CuBLASon the left, where I call one API, and I get the peak acceleration, and then there's a control on the right hand side if I really want to start tweaking it, and merging it, and meshing it with my data. And so one of the things that the math libraries have done is they've been working on device extension libraries, it's called, so the CuBLAS device extension library, this brings the CuBLAS on the left hand side, the productivity path, into your device kernel. So while cutlass is a sequence, a hierarchy of C++ classes, which give you incredibly fine-grained control, there's a completely different approach on the CuBLAS dx side, where the idea is that you can get your tensor cores activated in your kernel, just with a single gem call, just like you would do with CuBLAS from the CPU. And so why do you want to do this? Well, you want to do this because sometimes you don't just want a matrix multiplication, you want to then do something with the result. That's what we call a fusion. You take some data, you manipulate it in some way, you do a few big matrix operations, and then you use the result in some way. And by fusing all of these together, and this is a chart of having taken a pre-processing step, fusing two matrix multipliers together, and a post-processing step, all in one kernel, the difference between doing that in one kernel and sequencing it with a series of calls, using thrust in this case with CuBLAS, is a factor of three in performance. So being able to take the same simplified API, put it inside your kernel, customize it in the way that you want, also comes out with performance. And again, I'm not showing perf per watt in these cases, but all of these cases are reaching peak performance, but typically at lower energies. Same is true for FFT. I actually showed this last year because they've been working on the device extension libraries for FFT for some time. FFT also, again, fusions of FFTs with the rest of your operations. In this case, I'm fusing three kernels into one. And again, you see these speedups. And so a lot of this comes from this fusion thing where I've really customized my kernels in ways that give me the ability to string lots of work together. I load data once. If you remember, I said there were two reasons for power cost. One is data movement, and the other is compute. This is solving the data movement so that my compute densely applies to it without the data movement going on in between. And so how does that work? Well, the basic kernel fusion, and probably many of you are aware of this in the room right now, typically I'll have a sequence of operations. Maybe I'll do some precision conversion, I'll multiply things in a matrix multiplication, and then I'll run an activation function, a relu or something like that on it. Some very standard sequence of operations. And this is what those charts were showing just a moment ago. By fusing them together, you load your data once, you operate on it many times, and then you store your data out the other end, and you end up with these single fused kernels. And this is a great idea, and everybody should do it if they can. The challenge is that I don't just have one thing to do, right? I might have a hundred different types of things to do. And so I've drawn four on this slide because that's all I could fit. But even with four, I've got 64 possible combinations, and I can't build every single one of them all ahead of time. If I had a hundred on every row, I'd have a million different combinations. That's just not feasible. So what I'm seeing is as people build these codes which fuse things, people are also moving very often towards just-in-time compilation, runtime compilation, where you say, my program needs this, that, and the other units. Configure them precisely for what I need, and then compile it on the spot and run it. And so I see JIT compilation being more and more important in people's workflows inside CUDA. And so our compiler team has spent, this chart covers, what, 18 months, I think, from CUDA 11.8. And their work consistently reducing the cost of this JIT performance, because very often, as I'm showing down there on the bottom left, you've got this iterative loop, right? You build a fused kernel, you run it, you get some data, you look at what you're doing next, you build the next one, and you've got this iterative thing. The compile time becomes part of your main program loop. And so they've worked really hard, and they've got, this is showing the compile time for Hello World. So it's basically just overheads. Hello World is the simplest program you can possibly write. And so the overheads of compilation have come down by a factor of six over the last 18 months. And so really, there's this big focus on how fast can I iterate, how fast can I compile, because JIT compilation is showing up everywhere, right? Now, JIT compilation, these compilation tools, I talk often, if you see me talk about this, about when I think of CUDA, I'm always thinking of the entire platform, right? My job as one of the architects of CUDA is to think about how all of these things fit together, but nothing exists in isolation. And there's kind of an inverted pyramid here. Very, very few people are writing and programming compilers. There's a few of you, and we love you, and we absolutely support you, and we have LLVM and all these other things that you can target. But fundamentally speaking, you can probably count on both hands the number of people who really get sit down and start writing compilers. Above that, there's kernels, libraries, hosts called libraries, and then this massive universe of frameworks and SDKs at the top, right? Now, one of the things which I'm thinking a lot about these days, and that I pay a lot of attention to, certainly over the last several years, is Python, right? Because when I look at the world of Python developers, I think my pyramid is suddenly much, much wider. Instead of having a million users at the top, I've got 10 million users at the top. And so it's much the gap between something that you can build at the bottom and the impact that it has at the top is even more broad, right? So making a change to compilers like JIT compilation. JIT compilation is incredibly important in Python because Python is this very runtime interpreted language, and you're constantly generating data dynamically. And so a compiler in the loop is completely normal. In fact, the Python interpreter basically is one of those. And so these changes we make at the very bottom affect enormous ranges of people. And so looking at the Python stack, you have to invest everywhere all the way across it. And so I've listed a few things here in terms of places that we are really looking at, but really our goal, and I put that as the subtitle of this slide, but really it's the vision that we have in terms of where I think Python needs to be, where we, all of us in CUDA do, which is, as I say, towards a complete NVIDIA experience for the Python developer, right? The whole CUDA ecosystem available and accessible to Python programming. One of the aspects of that is that you're seeing our libraries and our tools start supporting Python more and more. And so the math excuse me, the math library teams have put a ton of work into producing a Pythonic interface, which natively and naturally connects Python applications to these accelerated libraries, which I think fundamentally that the libraries are the most common way that people access GPU acceleration. And at the bottom here, by the way, through many of these slides, I've got links to other people's talks. And this is a link to my friend Artie and Haroon's talk, where they're talking very much about all to do or everything to do with the libraries. And this is a big piece of it. And so if you ever want to know more, there's an index list at the end of this as well. You can just go and follow up and see what all the different talks are, which I've drawn from the material in this presentation. But the Python libraries, it's a full stack, which goes all the way from your application through JIT compilation, through the different APIs, both CPU side and GPU side, all the way down onto underlying libraries, the GPU accelerated ones, the NVPL, NVIDIA performance libraries, which target the ARM processor, MKL, anything else, right? So a universal front end for the accelerated libraries. The other aspect of TensorCores that I was talking about before was Cutlass, which gives you detailed configuration and control over the TensorCores. And Cutlass as well has a Python interface. And on the left hand side here, I've got just a couple of boxes, one showing what the C++ interface looks like and the equivalent Python interface below. And you can go install it, you can go and find documentation for it and so on. On the right hand side, they've integrated this with PyTorch extension. And so you can emit PyTorch from Cutlass, and you can automatically bring Cutlass extension, TensorCore custom kernels in Python into PyTorch. There's a Cutlass talk, it was on the previous slide, actually the link for it. Go and have a look at the Cutlass talk, which is going to tell you a lot more about how this type of thing works. And as I said, we're not just investing in libraries, we're also investing in tools. And so the developer tools team for the CUDA platform, the Insight guys, have been putting a lot of effort into being able to combine their output for both C++ code and for Python code, all at the same time in the same timeline. And so here on the right, I've got an example of doing exactly that. Likewise, the code annotations, what we call NVTX, which allows you to identify code regions by annotating it. And so you can have a green region, a blue region, so it's much easier for you to find regions that you want in complicated profiler traces. This is all configurable through JSON files, and it all works nicely with Python programs. There's just all of these different pieces, that pyramid that I was showing, you've got to start putting the building blocks in all these different places, so that ultimately you end up with an ecosystem that works up and down across the board. As I said, I look around and find these amazing things that people are doing. And one of the things that's really caught my eye inside NVIDIA is Warp. My friend Myles Macklin, who is normally in New Zealand, but he's up here to give a talk about Warp this week, he runs a team that has built this thing called Warp, which is a very special thing. It's a Python, it lets you write GPU kernels in Python, but they are differentiable kernels. And it naturally and automatically takes the kernels that you have written. And with JIT compilation, again, remember, as I said, JIT compilation is showing up everywhere, you can automatically produce the reverse mode, differential version of your flow, so you can have a forward pass, it records it, and you can replay it as a backward. And so you can construct simulation code, physics code, computational code in the kernel, GPU accelerator, this compiles straight down onto the GPU and runs at full compiled GPU performance, but with this back differentiable pipeline available as well. And the things you can do with it are incredible, right? So there's a whole compiler chain inside of here, which takes in the Python and turns it into PTX and runs it on the GPU. But it lets you do these things, these amazing simulation things. His talk is down here, go and check it out, because first, it's incredible technology. Second, he's doing it in the realm of computer graphics, so he's got beautiful videos and visuals as well. But this is an example where modeling something incredibly complicated like this plastic system of tearing bread apart. And the big one is the simulation and the ground truth looks almost exactly the same. And being able to do this and teach a neural model to follow how something like this, some plastic defamation functions and works correctly. Through auto differentiation, you can run the simulation. The backwards differential path is used to train the model, and then the model can very, very quickly start producing just amazing computer graphics like this and raising simulation results like this. Go and check out his talk. So last year, and I very rarely reuse slides, but this slide nicely summarizes, I told you about something called Legate. And I want to tell you a bit more again, because again, it fits into a lot of the stuff that I've been talking about again. Legate is a framework which takes your basic single-threaded code and distributes it very widely across a large number of machines, right? These machines are getting bigger and bigger, you're processing more and more data. To program these things gets increasingly hard. And this is what something like Legate is for. It's a layer, it's a stack where you have libraries on top, a runtime in the middle, and it runs across the accelerated libraries across your whole machine. And last year, I showed you the basic stencil benchmark using NumPy. NumPy can talk to this thing we have called QNumeric, which is a NumPy implementation based on Legate. And it automatically scales your NumPy program, in this case, across a thousand GPUs, right? It's a pretty straightforward stencil computation, but it's a very powerful tool. And so what they've done with this is they've taken Legate and they've applied it to the JAX framework, another framework for differentiable computing. Many of you have probably heard of it. And the JAX framework, it's heavily used, of course, in machine learning and AI, but it actually is a framework that can run more or less arbitrary simulations. It's another differentiable computing thing, similar to the warp in Python that I was showing you a moment ago. And JAX is based on the XLA compiler, which takes in all the different layers of JAX and compiles it down to a particular target. So what the Legate guys have done is they've integrated Legate into JAX at that compiler level, at the XLA level. So your JAX program does not change. The structure of your JAX program is the same. You mark up a few things and you indicate with decorators and configurations about what the pipeline stages are of your program, which I think they'll be able to put in the compiler in the future. And then this plugin to XLA, the compiler for JAX, then takes your code, maps it across all the Legate runtime and allows it to scale. And so what they've done with that, and my friend Wanchan has a talk on this where he goes into way more detail because I only get to give you two or three slides on every single topic. And just running it, comparing it against PaxML and ALPA, which are common distribution frameworks inside of JAX, the scaling and the ease of use is very impressive. So go and check that, go and check his talk out if you're a JAX programmer because scaling is really just such a powerful thing to be able to do. At the same time, the scaling across these big systems, and again, oddly, reusing another slide from last year just because it's a good description, end site systems have spent an enormous amount of effort working on their distributed system analysis, right? Putting a break point on a GPU is hard enough with a quarter of a million threads and figuring out how to make a tool break a quarter of a million threads and tell me useful information is incredibly difficult. And now I scale this up to thousands of machines, there's just no possible way. So you need new tools and they've really invested on these new tools. And I showed you some of those before and I'll show you, I've got a quick picture again, but a key piece that they've done with this is they've taken these large distributed tools, these multi-node tools that they've got, and they can now embed it not just in the end site systems main viewer, but in your Jupyter notebook as well. And so your tools are available at the place that you're writing this code. And again, it's all about those building blocks across, up and down the stack, right? And it's amazing, they take vast amounts of data and they can boil it down to a picture. In this case, I've got a heat map showing how the GPU utilization and communication are or are not overlapping. So I can find compute only zones where I have opportunities for asynchronous communication. And again, it's all about energy, right? I have my communication and my computer working together, everything moves faster than if I'm doing them one after another. And then I've got high power running for twice as long. So at the other end of the scale from Legate, but still a very large system scale, is something called NVShmem. And this is something we've had for quite some time. NVSHMEM has been around for several years. It evolves and has had a lot of different, a lot of new things come into it all the time. There's a whole talk by my friend, Jiri, who talks about all things multi-GPU programming. And he is one of the best speakers I know, and his talk is absolutely worth going and seeing. But what NVSHMEM does is it gives you low latency, fine grain control over the overlap of compute and communication. It's one of those things that sits underneath a lot of the stuff that you use without you really knowing you're using it. But what I'm going to be telling you about is actually the thing that sits underneath that, because it's really interesting. These things, these NVSHMEM things on my pyramid, they fit down at the bottom level, right? This is something that maybe a hundred people use, but which affects a million people through these different layers. And one of the technologies which is deep, deep down inside of this is something called GPU direct. And I've told a quick sequence explaining what it is, because when I've got data being produced by a GPU and I've got to get to the network, and the network has historically been a peripheral attached to the CPU. In the past, before I had GPU direct, my GPU would generate data and I'd have to go through four different steps to get that data out onto my network. I'd have to synchronize, copy a couple of different times, trigger some things. So there was four hops to go through in order to be able to get my data out of my GPU and onto my network. And so GPU direct came along and said, this is ridiculous, especially for the amount of data that I'm moving. Let's just move my data directly to the network device. And so I eliminated my fourth hop. And now with a direct single path to pass copy, GPU direct allowed me to generate my data and then send it directly from GPU to network card. And that's very powerful, but it still keeps the CPU in the loop. So they came up with a thing called GPUDirect-Async. And this is this, these evolutions that happen over the years as they work and improve in these technologies. And so now I've kind of got a two and a half step process. What the GPUDirect-Async does is the CPU can do the setup, but it lets the GPU trigger it. And so the data moves automatically and directly, and there's some CPU proxy that handles the triggering, but it's now fully controlled by the GPU. So the GPU program doesn't have to stop so that data can be sent. It can keep on going and just signal, now send it, now send it, now send the next one. And now finally, they've got this thing called GPU direct kernel initiated. And this is where you take the CPU out of the picture entirely. This is a truly two hops process. You can never get fewer than two hops. You've got to first prepare it and tell the network that it's coming. And the second thing is to stream all the data off and onto the network. The two is the lowest number that you can get here. So we've gone from four to three to sort of two and a half to two, and this embeds everything entirely in the kernel. And the result is incredible, right? This is a run of training on a graph neural network. I'll explain more about that later actually, where that middle line is that two and a half step process. So the two and a half step process is still 20 % faster than the vanilla normal non-GPU direct process. But once you put everything on the GPU and you cut the CPU out of the picture, I no longer have CPU threads waiting and polling and trying to orchestrate everything. It's all coming straight out of the GPU that is producing the data and sending the data. On this particular training run, we're talking a factor two speed up end to end. And in terms of the transfer, the feature transfer, the movement of the data that you're caring about, we're looking at an order of magnitude speed up more. So the power of being able to make the communication more streamlined, more autonomous. I don't mean the power in Watts in this case, but the potential of that is so enormous. And these things sit there and they quietly and silently plug into something like NVM. They plug into something like Nickel and you end up, Nickel rests on top of this. And for Nickel on the left-hand side, Nickel is the thing that moves all your data between GPUs when you're doing any kind of communicating multi-GPU job. Small messages are the hardest, right? One byte messages are extremely difficult because you're sending a lot of overhead for a small amount of data. And here, what this does is cuts your latency considerably. And on the right-hand side, you've got much more bandwidth and potential because again, you're cutting out the overhead and you can really communicate much more efficiently. And again, tools integration everywhere. It's so important to be able to see what's going on. We've got NVM and Nickel traces built into the tools. So I want to talk about the thing that I think I've had the most questions about over the last year, which is Grace Hopper and the programming model and the way you program those machines. The philosophy of CUDA has always been that we have a single program constructed of effectively a GPU function annotated by global and a CPU function. And it's all in one program. It's a heterogeneous program, right? It's not two separate things. It's one program with functions running in two different places, right? And this relates to something that Jensen was saying to me a few weeks ago. It's not that you're replacing serial work with parallel work. It's that it extends it. You need both and you want to do both, right? And so the idea is that CPU code runs on the CPU and GPU code runs on the GPU, right? And between them, historically, we've had this PCI bus. And so even though you've got these very high-speed memories going on, the PCI bus has historically been a bottleneck. And so the obvious thing to do, which we did with the Grace Hopper and we talked about this last year, is that you can combine them together with this thing called the NVLink C2C connection, which is many, many, many times faster than PCI. And so my data transfer goes through much better, right? And this is called Grace Hopper. This is what the machine is. But it's not just a device with a very fast interconnect. In fact, it can be that, but I think that's really just missing the point of what this is all about. The reason that I love this thing is that you've got really one processor with two characteristics, natively two different things, right? I've got two memory systems optimized for their own processor, but one is optimized for a latency system, right? My CPU is a latency processor. It has deep caches. It cares about linear operations very fast. My GPU is a throughput machine. It has these very high bandwidth memories and it has very high bandwidth caches. And the way that you treat these things is different because the way these things run code is different. And so on one of these Grace Hopper machines, it's a single unified memory system on two different ways of executing. And if I've got something like a linked list, run it on the CPU. It's much better. If I've got something like a parallel reduction, run it on the GPU. That's what it's for. And I can pick and choose, just like my program was a hybrid of two things. I can literally run whatever I want at the right place for it because these two systems are unified with one address space. So it's more than just the fast link. It's the fact the GPU can see and modify and touch the CPU memory. In doing so, we can detect it and we can move that over to the GPU so the GPU can get the benefit of its very high bandwidth caches. That could be as much as a factor of 10 improvement of performance if you're touching that data all the time. And so the ability to both combine the single address space, but also intelligently move things around while we're working on it, is unbelievably powerful. That lets me put the compute and the data where it needs to be. And at the same time, of course, the migration doesn't affect the CPU. It can still access and touch and see that data. It's a little bit of extra latency. Of course, it's going over the bus. But really, it's really one machine. And that's kind of the point I'm trying to get to. And this is some results from, and very generously, I'm able to show some results from Thomas Schultes' talk. He's the director of CSCS, which runs the ICON code on the newly brought up ALPS machine, a Grace Hopper machine in Switzerland. And this is just a fantastic example of exactly what I was talking about. There's a simulation here where you've got an ocean simulation running purely on CPU code. And you've got an atmosphere simulation on the GPU in green. And the coupling is extremely tight. And so you're moving data around a lot. And so historically, you've more or less been limited to the performance of the CPU code. But when you move to something like the GPU, you're really able to run both of these things at the same time. The CPU code on the CPU, the GPU code on the GPU, the very close coupling and exchange of data, automatic. And the result is, this is a factor three speed up. This is unbelievable. And this is at the scale of 64 GPUs. And this is the kind of thing that is going to affect the number of days I can forecast in my weather forecast, and really important things like that, which impact everybody. At the same time, other great examples. My colleague Matthias has to talk about this. Just looking at fine tuning of language models. A language model is a series of transformer layers. And when you go through your transformer layers, as you're processing these in your forward pass training it, you generate these intermediate tensors. And there can be a large number of layers, and therefore a large amount of data. So typically, what we do is we throw away the data. And then on the way back, we recalculate it all. So we double our computation in exchange for saving some memory. But with the Grace Hopper device, I can actually cache some of that. Instead of throwing it away, I'll keep some of it around on the GPU. The small things are not worth throwing away. The blue things, instead, I will cache them and save them on the Grace memory. Because remember, memory is just one giant memory system. And then on my way back, I can recall it back in from the GPU. And so I don't have to do that recomputation. And the result is a 20 % speed up. And in this particular example, this is taking a 10 million parameter mixture of expert model. And you can see on the left, the light green is offload. And the dark green is recompute. The recompute time is, of course, the same for both. But if I'm doing on Grace Hopper, if I do the offload of data instead of the recompute, I'm gaining in time. Because I've got this very tightly coupled memory system that lets me do it. Another example, which I see a lot of these days, is graph neural networks. Graph neural networks are the kind of things which financial institutions go and analyze if your credit card has been used fraudulently. Things like that. Massive, massive, massive interconnections of information. And the GraphSAGE model is a primary model for going and using neural networks to solve graphs. And so this is a simple walkthrough of how it works. And my friend Joe Eaton has a whole talk on this. So again, he's the expert. I'm just the messenger. But basically, you sample your neighborhood. You've got these little convolutional networks that run at all these different types of nodes. The challenge with a graph network is that it's not just one single collection of data that I'm operating on. My entire universe could be touched on any edge between any two nodes in the graph. So I have a massive pile of data which is completely randomly accessed. I might access only 10 % of it at any one time. I don't know which 10%, and it's going to be different on every iteration. So what I need is just a big pool of very fast memory so I can randomly access and touch it as I go through the flow of the GraphSAGE model. And putting this on Grace Hopper has just been an incredible performance improvement, where previously I spent a lot of my time fetching data and moving things in and out and on off the GPU. Now, with this unified pool of memory, you're looking at a factor, again, a factor of two speed up. These are huge. A factor of two speed up is like a generational speed up in most codes. You will spend ages, a whole PhD getting a 20% speed up in something. This is a factor of two because now you have a new architecture that can do new things. So finally, from one form of graphs to another, and this is, I must admit, a little bit of just... As an engineer, you plan something, you design it, and CUDA graphs is something I started designing several years ago. And you have all these ideas, and it takes a lot longer than you think to get where you're going to go. And so the idea of CUDA graphs, which I've talked about a few times, and hopefully you know, the idea is you define your workflow up front, and then I can issue a single launch operation to launch an arbitrary amount of work. So it can be a very fast way of putting work onto the GPU, and I can see really good improvements and speed ups to launch. But it's a lot more than just a fast way to launch work. So I actually went back and I found my slide deck from 2018. And this was for GTC, for conversations with developers, just saying, could this be useful to you? And I just thought I'd grab some of my slides from then because it's so interesting to see what I was thinking at the time, and where it's finally going. And so, you know, a quick description of task graphs, where you had these nodes, and they could be different things, and this is largely what we built. And I had this sequence where you say, you know, the task graph properties, they're reusable, I can launch them over and over again, I define it once, and I run it many times. But then, Cyclic. I wanted a graph not to just be a straightforward flow of dependencies. Why not be able to jump back to the beginning? Why not be able to have a dynamic graph? Something where node B could decide it wanted to go to C or D, based on some data that it came up with, right? Data-dependent dynamic control flow. And then finally, Hierarchy, which is a key part of any graph system. But these are literally my very first slide deck of graphs saying, here is what I want. And finally, we've built it, six years or seven years later, however long it's been. And so, let me tell you about this thing we built, because it is everything that I had in my mind about how these things would be used. And it opens the door to a lot of potential, I think. So, what I've got on the left here is an incredibly trivialized version of something called Conjugate Gradient. It's like a gradient descent type of thing. It's a very, very standard way of solving a system of linear equations, and it's just pseudocode on the left-hand side. But the key part about it is there's an iterative loop, right? There's a main loop where I do something, and I run that main loop over and over and over again until I have my solution. And the loop body, typically, traditionally with CUDA graphs, the idea with my loop body is that I'm going to take that body, and I'm going to turn it into a graph. And then I'm going to run that graph many times. So, my program starts looking very simple. Instead of having all of these different things that I have to do, I have one launch call. And this is great. This is how people use graphs today, and it speeds things up very efficiently. But the challenge is that this data-dependent execution, very common, iterates until converged, right? It's almost a universal pattern. The iteration requires reading the result back and deciding if I'm going to do my while loop again. So, I keep having to stop my program, copy the data back in order to evaluate my while residual is greater than epsilon, and then I can go back and do another launch. And so, now we're moving data-dependent execution to the GPU, right? So, I take the main loop, and now I create a graph with these new nodes. We create two new node types, and I'll tell you about them in just a moment, an if node and a while node. And now I can put the while on the GPU. So, the convergence check, the while check, is done without having to come back to the CPU. And my program no longer has a main loop at all. The main loop is now completely moved dynamically to the GPU, and I can just launch a conditional graph, if you want to call it that. And my program is much simpler. So, now my CPU is out of the picture. I can run a hundred of these independently, all at the same time, because I no longer need CPU threads to manage them. And the way it works is we've taken one of these conditional nodes. It's just another type of graph node, but it's a graph node that's either an if or a while. And inside the if node of the graph, it evaluates the condition. It either runs a subgraph or it doesn't. Remember, graphs are hierarchical. That was one of the things on my very, very early slides. And so, I've got these conditional nodes, which encapsulate what to do if the condition is true. Now, because graphs are hierarchical, you can nest these. I can have a conditional node inside a conditional node, to any depth that I want. And so, I can have a while node. And so, I can have an if node that, if something happens, go and run this. And this contains a while, which iterates continuously. And all of this can just run 100 % be described inside my task graph. A lot of people ask me, why did you make graphs control dependencies instead of data dependencies? And this is the reason. This is why we built it, with control flow dependencies. Because you want to be able to say things like while and if, which Dataflow does not allow you to do. And there's other constructs you can do. This thing on the right is like a switch. It's like multiple ifs. If x, if y, if z. That's like a switch with cases. All of these types of things, if and while, are the key fundamental building blocks. And maybe we'll optimize switch ourselves later to make it more efficient. But fundamentally, you can now describe a fully dynamic control, like workflow, on the GPU without having to return to the CPU to be the control, the hand-holding control. And that's very much a theme for the kind of way we're moving things, to reduce the amount of communication, to keep the GPU busy, to keep the power as efficient, and the computation as efficient as we possibly can get it. And this, finally, after six years, was out a few weeks ago with CUDA 12.4. So it's really, it's so nice to be able to stand up and show you this thing that has been in my head forever. And we've never, we've just not been able to get to it until now. Turns out you have to build a lot of things before this can work. And so that's it. That's what I've got. Here's the list of all the references of everything that I've told you about. Because I'm just the messenger. All I do is tell you about amazing work that everybody else around the company is doing. And I just, I get to stand up here and tell you about it. And so here is the list of all the fascinating stuff that I've dug up to find. This is shared in the PDF of the slides. And so if you want to go back and stream some talks or even attend them in person, these people are really worth listening to. Thank you very much. That's pretty awesome, Stephen. We have a few minutes to take some Q &A. We have microphones up at the front here. If you have any questions, feel free to ask. We all, we have a couple of online questions for you. So I'm going to start with the first one. Are there CUDA APIs available for us to measure power consumption? As our programs run on GPUs and break down, how much is due to compute, memory access, or networking? That's a hard question. The power consumption is a system level thing. So you need different system APIs. And so what we have, we have a monitoring system called DCGM, which allows you to monitor your data center and all these nodes in real time and see what the utilization is, the power is of these different things. But you have to use that to collect the data across your system and extrapolate a single CUDA function. There's no way to identify just the power purely from that, because power is an external factor that depends on not just compute, but memories and buses and all sorts of things like that. Great. We have questions here. Go ahead. I have a question about effectiveness. In case of company, for example, if we combine all workstations in one server, how much energy we can save? If you combine what? If you combine... For example, if we have multiple workstations, CPU, GPU, and we are everything put in one server, CPU, GPU, how much energy can we save? So the energy saving is going to be very algorithm dependent, of course. But typically the most expensive thing in any system is communication. It's moving electrons around. And so the more you combine into a single localized space, this is why you see density increasing in data center racks, because it takes much less energy to move electrons a few inches instead of meters. So I just combined the different units there. But so in general, you will be saving energy, but it really depends on your algorithm exactly how much. I think it's hard to predict that. You would need a model of your system.
Info
Channel: NVIDIA Developer
Views: 4,559
Rating: undefined out of 5
Keywords:
Id: pC0SIzZGFSc
Channel Id: undefined
Length: 50min 8sec (3008 seconds)
Published: Tue Apr 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.