Now let's get started
with our session. This morning I am proud to
introduce Stephen Jones, an NVIDIA distinguished software
architect, presenting the latest and greatest on our
CUDA parallel computing platform. He's been with NVIDIA for
over 14 years, starting as a software engineer, senior
software engineer, on CUDA in 2008. He had a brief stint, and he may
share a little bit about that while he was at SpaceX and meeting Elon. But in between, that was a good diversion, as today he's going
to talk about his aeronautical and aerospace engineering
experience from the University of Cambridge as his master's degree. And with that, I'm going
to introduce Stephen to talk about CUDA. Thanks very much all for coming. It's amazing. I'm sure everyone
says this, but I haven't been to a talk yet. It's just amazing to actually
talk to people, instead of to a camera above your screen
when you're recording a talk. So it's really nice. Thank you all for coming here. I'm one of the architects
of CUDA, so I spend my time thinking about CUDA, the platform,
the language, all the systems and stuff that goes with it. And also, I work closely with
the hardware teams. I spend probably half my time, even
though I'm a software guy, working with the hardware teams, working
on what our next generation and the one after that GPU is going to be. Because one of the magical things
we get to do is we get to combine the hardware and the software
so that we build the programming model we want, and that we can
program the hardware that we build. Now, I'm going to start
today with something which I've learned working with
the hardware teams a lot. And I think it really drives
a lot of the way of thinking about the thing which I'm
talking about today. And also, just a lot of the
way I think about how hardware is driven by the laws of physics
and by the way the constraints that we all are living under in terms
of the hardware that we design, as well as software that programs it. So, this is possibly a contentious
thing to say at NVIDIA that accelerated computing is not
all about performance. But if you're in the keynote, if
you watched the keynote yesterday with Jensen, he was really talking
about how it's all about energy. It's not just the performance, it's
about the performance per watt. Because ultimately, you've
got to power these things, you've got to provide the
energy into the machine, and so the efficiency is really the key
metric that you have to care about. Yes, you want to be scaling
performance, but you've got to scale it with efficiency as well. And the obvious place I went to
look for this, just doing a bit of research for the introduction of
the talk, you know, I went around looking at data centers, right? They're building data centers at
an enormous rate in the world. They're building, they're
standing up something like six or seven data centers a day. But I went looking for the number
of how many data centers they build and they don't, nobody lists
data centers by number, right? They list data centers by megawatt. Because power is the thing that is
important in a data center, right? There's five gigawatts worth of
data centers in North America right now, there's going to
be another three gigawatts standing up in the next year. And when you go and you buy
time on a data center, you rent them and they're charged
by kilowatts per month. Nobody cares how many
servers you're buying. Nobody cares how many data
centers you're renting. You are renting power per month. Because power is the metric
that really matters for computing, right? And if you look at what a
data center typically is, a medium-sized data center
runs you maybe 20 megawatts and that's because I've got this
big building with a giant power connector coming in and that power
provides 20 megawatts of power. So if I build a brand new
chip that goes twice as fast but it runs at twice the power,
I can't magically put more power into this room, right? I've, I end up with half
that number of chips. I've got 20 megawatts. The question is what can I
do with my 20 megawatts? And again, if you, if you were
watching what, what, what Jensen was saying about Blackwell at
the keynote, he talked a lot about the energy, the power efficiency,
and that is a really big focus. And on the hardware side,
that's something that they're all thinking about. Every transistor really counts. But it's not just in the
data centers, it's also in my own home, right? My, my desktop machine gets
power from the wall. I can't put 10 GPUs in it. I can't, I can't run that out
of my one and a half kilowatts if I'm in America, three kilowatts
if I'm in, in, in, in the UK. My maximum system power on my
laptop, even smaller, right? So that everybody is constrained
by power more than anything else. So it is really all about energy. And the challenge we have is that
the energy equation, the energy balance is getting worse, right? On the left-hand side here, the,
the very well-known chart, if you like, of, of Moore's law, that's a
bunch of numbers which looks at the transistor density from when Moore
stated his law back in about 1970. And it's pretty much been
going exponentially. It's a log plot on
the left-hand side. And on the right-hand side, I
went to TSMC's website and I just pulled up all the information about
their different chips and their, you know, their transistor density,
that, that, that red line, the orange line has been, of course,
increasing exponentially as well. But something else you see
when you look through that, that data is that if you look
at the power efficiency scaling, as I shrink my transistors, I need
fewer electrons to turn them on. And so they take less power,
but the power is not scaling as fast as the transistor count. And that is a problem in a world
which is energy constrained, because when I keep adding
transistors, I've got to do something about the power. And so while obviously we look
very closely at the hardware and the efficiencies of hardware,
it's critically important. I mean, obviously I say this,
I'm a software guy, but it is critically important to
look at this in terms of the efficiency of the software as well. And so I'm going to be talking
about a couple of really key pieces where power comes from. One is data movement and the
other is the computation, the two obvious users of electricity
and power in one of these machines. Right. And so starting with
computation, let's talk for a moment about floating point
arithmetic, because that's really the core of computation
that is being done, right? In fact, most of these data
centers running GPUs, probably overwhelmingly a lot of the
flops and a lot of the watts are being spent on things
like matrix multiplication. And so I went and I dug up a
little table here, you know, on the left hand side, we've
got all these different precisions that NVIDIA offers. And there's lots of reasons
for that, but I'm going to get into a particular one
focused on power right now. And on the right hand side, I can
break this down into the energy that doing different precisions
of multiplications and the FMA is a fused multiply add, it's
the sort of the fundamental unit of arithmetic operation in a computer. And if you look at the top,
you see that the standard flop, the 32 bit single precision
flop, for which I've normalized everything to about one X,
double precision is about two and a half times the power and half
precision is about half the power. And so it's not, the key here
is that the higher precisions don't just scale linearly,
floating point multiplication scales as the square of the
mantissa length, that's that blue, excuse me, that's the blue section
on the left hand side, right? So the longer my number is,
the more power it takes to compute that number because
I've got more decimal places, I've got more bits to move around. And then you look at the
tensor cores, right? The tensor cores are
completely different. That takes these single operations
and they group them together with an economy of scale, you see
dramatic changes and improvements in energy per flop, right? So this is one of the reasons
you see these investments in these tensor cores, in
these dense processing units is because I've got 20 megawatts
in my data center, I want to cram all the flops in that I can get. And this is where, this is the
kind of place it's coming from. And so you can look at these
interesting balances, right? The tensor core at FP64, the double
precision tensor core, and this is looking at Hopper H100 numbers,
is more efficient than individual multipliers and that's the economy
of scale I was talking about. But if you look at the difference
between say 16 bit, right? Instead of being one and a
half times more efficient, it's four times more efficient
because again, it's that square of the mantissa bits, which
is winning me the power. But I'm going to look at something
which I think is a more interesting thing to look at, which is
the difference between these 64 bit operations, which is
two and a half times the power of a single precision, and
these tensor cores at reduced precision, which is a tiny fraction
of a factor of about 20 in power efficiency, which is sitting there. But of course, there's a very
big difference between a 64 bit precision and a 16 bit
precision number. So it turns out there's not as much
of a difference as you would think. And this is not new work,
but there's a lot that's been going on about it recently. I really wanted to tell you
about this because I think this to me is a really exciting
way that we are attacking the power wall, the power problem
that I see from those curves, from the silicon processes, right? A couple of years ago, I was
telling you about some work that my colleague Azam Haidar,
and there's a reference to the paper down below, was
using tensor cores to do an LU factorization matrix solve, where
you would do the heavy lifting in 16 bit and then you'd have
this iterative process, this GM res process to take your 16 bits and
progressively improve the accuracy back up to 64 bit values, right? And so in the paper they look
at this, excuse me, and this graph shows on the bottom
here the number of iterations. I take these 16 bit tensor
cores, they output actually a 32 bit result, and then
I iterate progressively three, four times, and there's that line in the
middle of what the FP64 accuracy is, and I can get closer and closer
to the final result, and finally I exceed the perfect accuracy
of a true native 64-bit number. So this is not in any way
compromised on the results, this is exactly accurate to the bit of what
you would get, except I've done the heavy lifting using those much more
power efficient tensor cores, and that's a really big deal, right? Because here's a chart of
actually, Azam kindly ran this for me this weekend, of running,
in this case, an LU decomposition solve, the thing that was in that
flow chart before, and the green line is the the gray top of the
GH200, or they're both GH200, so the green line is the 16-bit plus
64-bit, the middle column on this chart, the blue line is the pure
native double precision values. And so you can see what you're
getting, you're not only getting the power benefit on the bottom,
which is huge, right? Almost a factor six
improvement in flops per watt, which is unbelievable. I can now do six times more
flops in that power-limited data center using this technique
than I could if I had natively got the exact same result
using double precision numbers. And at the same time I'm going
almost four times as fast, right? So I'm faster and
I'm more efficient. This is amazing, this is huge. This algorithm is actually
implemented in the CUDA solver library, but I see this reaching
out into everything, right? If I can do work faster and
more efficiently, this is one way we can attack this
power wall that I see coming. It's not just Azzam's work as
well, there's other people, my friend, my longtime friend,
Ryo Yokota, at the University of Tokyo, he and some colleagues
wrote a paper looking at a completely different approach,
but again using low precision, in his case integer tensor cores,
to produce matrix multiplication. And some of our genius guys
at NVIDIA implemented it, but what they did here, instead
of just taking one of the GH200, like mega powerful
chips, they took the L40 or the L40S, and that's sort of the lower
power data center part which does not have a native double precision
tensor core in the first place. And using the 16-bit tensor cores
that are in the L40S, they were able to run matrix multiplications
running at six, seven times the performance, purely without even
having a proper, well, there is a double precision unit, but it's not
the high power, high performance double precision unit that
you would normally find in an H100. They compared it, in fact,
and it was too busy, I didn't put it on the chart, they
compared it to an A100, and it's half the performance of
an A100 using no double precision tensor cores at all, which
is absolutely incredible, right? This is opening the door to parts
with much lower power being able to achieve, you know, 50 % of the
performance is incredible, right? And not only that, but the
power savings, right? In the same way, at the same
time as you're getting a factor six or seven in performance,
you're getting a factor seven or eight in power, in performance,
in power efficiency, sorry, performance per watt. So this is huge, and so this
fascinates me, like, I'm very lucky with this talk, I get to just
find all these cool things that are going on around the company
and tell you about them because I just find them interesting, and
this is one of the things I think is really fascinating, because
there's so much, so many things we can do with this type of technique. Now, tensor cores themselves,
a lot of people come and ask me, how do I program tensor cores? And there's, tensor cores
are a complex system, right? They're these, they have all
these different precisions, they have these different ways
of using them, but the three main flavors of ways you get access to
the tensor cores is, first, through the CuBLAS math library, that
is your basic workhorse that has existed since the very beginning
of CUDA, it's linear algebra APIs, and you call a matrix multiply,
and that naturally, automatically pipes through onto the tensor
cores, CuBLAS actually calls that one in the middle called CuBLASlt,
CuBLASlt, which you can also access yourself, it's a public library,
it gives these advanced APIs where you can really control a lot more
aspects of what the tensor cores are doing, the tensor cores have
a lot of different configurations, a lot of different modes,
you can really get access to them. And on the right hand side, we
have something called cutlass, which if you've seen me give
this talk before, I talk about it probably every year, because
it really is the programmer's way of getting at the tensor
cores, it lets you write tensor core code inside your own kernel,
and get access to all of the different knobs and configurations
that tensor cores have. So I drew this out for myself
in a different way, because really, there's a productivity
dimension of CuBLASon the left, where I call one API,
and I get the peak acceleration, and then there's a control on the
right hand side if I really want to start tweaking it, and merging
it, and meshing it with my data. And so one of the things that the
math libraries have done is they've been working on device extension
libraries, it's called, so the CuBLAS device extension library,
this brings the CuBLAS on the left hand side, the productivity
path, into your device kernel. So while cutlass is a sequence,
a hierarchy of C++ classes, which give you incredibly fine-grained
control, there's a completely different approach on the
CuBLAS dx side, where the idea is that you can get your tensor cores
activated in your kernel, just with a single gem call, just like you
would do with CuBLAS from the CPU. And so why do you want to do this? Well, you want to do this because
sometimes you don't just want a matrix multiplication, you want to
then do something with the result. That's what we call a fusion. You take some data, you manipulate
it in some way, you do a few big matrix operations, and then
you use the result in some way. And by fusing all of these
together, and this is a chart of having taken a pre-processing
step, fusing two matrix multipliers together, and a post-processing
step, all in one kernel, the difference between doing that
in one kernel and sequencing it with a series of calls, using
thrust in this case with CuBLAS, is a factor of three in performance. So being able to take the
same simplified API, put it inside your kernel, customize
it in the way that you want, also comes out with performance. And again, I'm not showing
perf per watt in these cases, but all of these cases are
reaching peak performance, but typically at lower energies. Same is true for FFT. I actually showed this last
year because they've been working on the device extension
libraries for FFT for some time. FFT also, again, fusions of FFTs
with the rest of your operations. In this case, I'm fusing three
kernels into one. And again, you see these speedups. And so a lot of this comes
from this fusion thing where I've really customized my kernels
in ways that give me the ability to string lots of work together. I load data once. If you remember, I said there
were two reasons for power cost. One is data movement, and
the other is compute. This is solving the data movement
so that my compute densely applies to it without the
data movement going on in between. And so how does that work?
Well, the basic kernel fusion, and probably many of you are
aware of this in the room right now, typically I'll
have a sequence of operations. Maybe I'll do some precision
conversion, I'll multiply things in a matrix multiplication, and then
I'll run an activation function, a relu or something like that on it. Some very standard
sequence of operations. And this is what those charts
were showing just a moment ago. By fusing them together, you load
your data once, you operate on it many times, and then you store your
data out the other end, and you end up with these single fused kernels. And this is a great idea, and
everybody should do it if they can. The challenge is that I don't
just have one thing to do, right? I might have a hundred different
types of things to do. And so I've drawn four
on this slide because that's all I could fit. But even with four, I've got
64 possible combinations, and I can't build every single
one of them all ahead of time. If I had a hundred on every
row, I'd have a million different combinations. That's just not feasible. So
what I'm seeing is as people build these codes which fuse
things, people are also moving very often towards just-in-time
compilation, runtime compilation, where you say, my program needs
this, that, and the other units. Configure them precisely for
what I need, and then compile it on the spot and run it. And so I see JIT compilation
being more and more important in people's workflows inside CUDA. And so our compiler team has
spent, this chart covers, what, 18 months, I think, from CUDA 11.8. And their work consistently
reducing the cost of this JIT performance, because very
often, as I'm showing down there on the bottom left, you've
got this iterative loop, right? You build a fused kernel, you
run it, you get some data, you look at what you're doing
next, you build the next one, and you've got this iterative thing. The compile time becomes part
of your main program loop. And so they've worked really hard,
and they've got, this is showing the compile time for Hello World. So it's basically just overheads. Hello World is the simplest
program you can possibly write. And so the overheads of compilation
have come down by a factor of six over the last 18 months. And so really, there's this
big focus on how fast can I iterate, how fast can I
compile, because JIT compilation is showing up everywhere, right? Now, JIT compilation, these
compilation tools, I talk often, if you see me talk
about this, about when I think of CUDA, I'm always thinking
of the entire platform, right? My job as one of the architects
of CUDA is to think about how all of these things fit together,
but nothing exists in isolation. And there's kind of an
inverted pyramid here. Very, very few people are writing
and programming compilers. There's a few of you, and we love
you, and we absolutely support you, and we have LLVM and all these
other things that you can target. But fundamentally speaking, you can
probably count on both hands the number of people who really get sit
down and start writing compilers. Above that, there's kernels,
libraries, hosts called libraries, and then this
massive universe of frameworks and SDKs at the top, right? Now, one of the things which I'm
thinking a lot about these days, and that I pay a lot of attention
to, certainly over the last several years, is Python, right? Because when I look at the
world of Python developers, I think my pyramid is suddenly
much, much wider. Instead of having a million
users at the top, I've got 10 million users at the top. And so it's much the gap between
something that you can build at the bottom and the impact
that it has at the top is even more broad, right? So making a change to compilers
like JIT compilation. JIT compilation is incredibly
important in Python because Python is this very runtime interpreted
language, and you're constantly generating data dynamically. And so a compiler in the loop
is completely normal. In fact, the Python interpreter
basically is one of those. And so these changes we make
at the very bottom affect enormous ranges of people. And so looking at the Python
stack, you have to invest everywhere all the way across it. And so I've listed a few things
here in terms of places that we are really looking at, but really
our goal, and I put that as the subtitle of this slide, but really
it's the vision that we have in terms of where I think Python needs
to be, where we, all of us in CUDA do, which is, as I say, towards
a complete NVIDIA experience for the Python developer, right? The whole CUDA ecosystem
available and accessible to Python programming. One of the aspects of that
is that you're seeing our libraries and our tools start
supporting Python more and more. And so the math excuse me,
the math library teams have put a ton of work into producing
a Pythonic interface, which natively and naturally connects
Python applications to these accelerated libraries, which
I think fundamentally that the libraries are the most
common way that people access GPU acceleration. And at the bottom here, by the way,
through many of these slides, I've got links to other people's talks. And this is a link to my friend
Artie and Haroon's talk, where they're talking very much
about all to do or everything to do with the libraries. And this is a big piece of it. And so if you ever want to
know more, there's an index list at the end of this as well. You can just go and follow up
and see what all the different talks are, which I've drawn from
the material in this presentation. But the Python libraries,
it's a full stack, which goes all the way from your application
through JIT compilation, through the different APIs, both CPU
side and GPU side, all the way down onto underlying libraries,
the GPU accelerated ones, the NVPL, NVIDIA performance libraries,
which target the ARM processor, MKL, anything else, right? So a universal front end for
the accelerated libraries. The other aspect of TensorCores
that I was talking about before was Cutlass, which gives you
detailed configuration and control over the TensorCores. And Cutlass as well has a
Python interface. And on the left hand side
here, I've got just a couple of boxes, one showing what the
C++ interface looks like and the equivalent Python interface below. And you can go install it, you
can go and find documentation for it and so on. On the right hand side,
they've integrated this with PyTorch extension. And so you can emit PyTorch
from Cutlass, and you can automatically bring Cutlass
extension, TensorCore custom kernels in Python into PyTorch. There's a Cutlass talk, it
was on the previous slide, actually the link for it. Go and have a look at the
Cutlass talk, which is going to tell you a lot more about
how this type of thing works. And as I said, we're not just
investing in libraries, we're also investing in tools. And so the developer tools
team for the CUDA platform, the Insight guys, have been putting
a lot of effort into being able to combine their output for both C++
code and for Python code, all at the same time in the same timeline. And so here on the right, I've got
an example of doing exactly that. Likewise, the code annotations,
what we call NVTX, which allows you to identify code regions
by annotating it. And so you can have a green
region, a blue region, so it's much easier for you to
find regions that you want in complicated profiler traces. This is all configurable through
JSON files, and it all works nicely with Python programs. There's just all of these
different pieces, that pyramid that I was showing, you've
got to start putting the building blocks in all these different
places, so that ultimately you end up with an ecosystem that
works up and down across the board. As I said, I look around and
find these amazing things that people are doing. And one of the things
that's really caught my eye inside NVIDIA is Warp. My friend Myles Macklin, who
is normally in New Zealand, but he's up here to give a talk about
Warp this week, he runs a team that has built this thing called Warp,
which is a very special thing. It's a Python, it lets you
write GPU kernels in Python, but they are differentiable kernels. And it naturally and automatically
takes the kernels that you have written. And with JIT compilation,
again, remember, as I said, JIT compilation is showing up
everywhere, you can automatically produce the reverse mode,
differential version of your flow, so you can have a forward
pass, it records it, and you can replay it as a backward. And so you can construct
simulation code, physics code, computational code in the kernel,
GPU accelerator, this compiles straight down onto the GPU and runs
at full compiled GPU performance, but with this back differentiable
pipeline available as well. And the things you can do
with it are incredible, right? So there's a whole compiler
chain inside of here, which takes in the Python and turns
it into PTX and runs it on the GPU. But it lets you do these things,
these amazing simulation things. His talk is down here, go
and check it out, because first, it's incredible technology. Second, he's doing it in the
realm of computer graphics, so he's got beautiful videos
and visuals as well. But this is an example where
modeling something incredibly complicated like this plastic
system of tearing bread apart. And the big one is the simulation
and the ground truth looks almost exactly the same. And being able to do this
and teach a neural model to follow how something like
this, some plastic defamation functions and works correctly. Through auto differentiation,
you can run the simulation. The backwards differential
path is used to train the model, and then the model
can very, very quickly start producing just amazing computer
graphics like this and raising simulation results like this. Go and check out his talk. So last year, and I very rarely
reuse slides, but this slide nicely summarizes, I told
you about something called Legate. And I want to tell you a bit
more again, because again, it fits into a lot of the stuff
that I've been talking about again. Legate is a framework which takes
your basic single-threaded code and distributes it very widely across
a large number of machines, right? These machines are getting
bigger and bigger, you're processing more and more data. To program these things gets
increasingly hard. And this is what something
like Legate is for. It's a layer, it's a stack where
you have libraries on top, a runtime in the middle, and it runs
across the accelerated libraries across your whole machine. And last year, I showed
you the basic stencil benchmark using NumPy. NumPy can talk to this
thing we have called QNumeric, which is a NumPy
implementation based on Legate. And it automatically scales your
NumPy program, in this case, across a thousand GPUs, right? It's a pretty straightforward
stencil computation, but it's a very powerful tool. And so what they've done with
this is they've taken Legate and they've applied it to the
JAX framework, another framework for differentiable computing. Many of you have
probably heard of it. And the JAX framework, it's
heavily used, of course, in machine learning and AI, but it actually
is a framework that can run more or less arbitrary simulations. It's another differentiable
computing thing, similar to the warp in Python that I
was showing you a moment ago. And JAX is based on the XLA
compiler, which takes in all the different layers of JAX
and compiles it down to a particular target. So what the Legate guys have
done is they've integrated Legate into JAX at that compiler
level, at the XLA level. So your JAX
program does not change. The structure of your JAX
program is the same. You mark up a few things and
you indicate with decorators and configurations about what
the pipeline stages are of your program, which I think
they'll be able to put in the compiler in the future. And then this plugin to XLA, the
compiler for JAX, then takes your code, maps it across all the Legate
runtime and allows it to scale. And so what they've done with
that, and my friend Wanchan has a talk on this where he
goes into way more detail because I only get to give you two or
three slides on every single topic. And just running it, comparing
it against PaxML and ALPA, which are common distribution frameworks
inside of JAX, the scaling and the ease of use is very impressive. So go and check that, go and
check his talk out if you're a JAX programmer because scaling
is really just such a powerful thing to be able to do. At the same time, the scaling
across these big systems, and again, oddly, reusing another
slide from last year just because it's a good description, end
site systems have spent an enormous amount of effort working on their
distributed system analysis, right? Putting a break point on a GPU
is hard enough with a quarter of a million threads and figuring
out how to make a tool break a quarter of a million threads
and tell me useful information is incredibly difficult. And now I scale this up to
thousands of machines, there's just no possible way. So you need new tools and they've
really invested on these new tools. And I showed you some of those
before and I'll show you, I've got a quick picture again, but
a key piece that they've done with this is they've taken these large
distributed tools, these multi-node tools that they've got, and they
can now embed it not just in the end site systems main viewer, but
in your Jupyter notebook as well. And so your tools are available
at the place that you're writing this code. And again, it's all about
those building blocks across, up and down the stack, right? And it's amazing, they take
vast amounts of data and they can boil it down to a picture. In this case, I've got a
heat map showing how the GPU utilization and communication
are or are not overlapping. So I can find compute only
zones where I have opportunities for asynchronous communication. And again, it's all
about energy, right? I have my communication and
my computer working together, everything moves faster than if
I'm doing them one after another. And then I've got high power
running for twice as long. So at the other end of the
scale from Legate, but still a very large system scale,
is something called NVShmem. And this is something we've
had for quite some time. NVSHMEM has been around
for several years. It evolves and has had a lot
of different, a lot of new things come into it all the time. There's a whole talk by my
friend, Jiri, who talks about all things multi-GPU programming. And he is one of the best
speakers I know, and his talk is absolutely worth going and seeing. But what NVSHMEM does is it
gives you low latency, fine grain control over the overlap
of compute and communication. It's one of those things that
sits underneath a lot of the stuff that you use without
you really knowing you're using it. But what I'm going to be telling
you about is actually the thing that sits underneath that,
because it's really interesting. These things, these NVSHMEM
things on my pyramid, they fit down at the bottom level, right? This is something that maybe
a hundred people use, but which affects a million people
through these different layers. And one of the technologies which
is deep, deep down inside of this is something called GPU direct. And I've told a quick sequence
explaining what it is, because when I've got data being produced
by a GPU and I've got to get to the network, and the network
has historically been a peripheral attached to the CPU. In the past, before I had GPU
direct, my GPU would generate data and I'd have to go through
four different steps to get that data out onto my network. I'd have to synchronize, copy
a couple of different times, trigger some things. So there was four hops to
go through in order to be able to get my data out of
my GPU and onto my network. And so GPU direct came along
and said, this is ridiculous, especially for the amount
of data that I'm moving. Let's just move my data directly
to the network device. And so I eliminated my fourth hop. And now with a direct single
path to pass copy, GPU direct allowed me to generate my
data and then send it directly from GPU to network card. And that's very powerful, but it
still keeps the CPU in the loop. So they came up with a thing
called GPUDirect-Async. And this is this, these evolutions
that happen over the years as they work and improve in
these technologies. And so now I've kind of got a
two and a half step process. What the GPUDirect-Async does
is the CPU can do the setup, but it lets the GPU trigger it. And so the data moves automatically
and directly, and there's some CPU proxy that handles
the triggering, but it's now fully controlled by the GPU. So the GPU program doesn't have
to stop so that data can be sent. It can keep on going and just
signal, now send it, now send it, now send the next one. And now finally, they've got
this thing called GPU direct kernel initiated. And this is where you take the
CPU out of the picture entirely. This is a truly two hops process. You can never get
fewer than two hops. You've got to first prepare it and
tell the network that it's coming. And the second thing is to
stream all the data off and onto the network. The two is the lowest number
that you can get here. So we've gone from four to
three to sort of two and a half to two, and this embeds
everything entirely in the kernel. And the result is
incredible, right? This is a run of training
on a graph neural network. I'll explain more about that later
actually, where that middle line is that two and a half step process. So the two and a half step
process is still 20 % faster than the vanilla normal
non-GPU direct process. But once you put everything on
the GPU and you cut the CPU out of the picture, I no longer have
CPU threads waiting and polling and trying to orchestrate everything. It's all coming straight out
of the GPU that is producing the data and sending the data. On this particular training
run, we're talking a factor two speed up end to end. And in terms of the transfer,
the feature transfer, the movement of the data that you're
caring about, we're looking at an order of magnitude speed up more. So the power of being able
to make the communication more streamlined, more autonomous. I don't mean the power in
Watts in this case, but the potential of that is so enormous. And these things sit there
and they quietly and silently plug into something like NVM. They plug into something like
Nickel and you end up, Nickel rests on top of this. And for Nickel on the left-hand
side, Nickel is the thing that moves all your data between
GPUs when you're doing any kind of communicating multi-GPU job. Small messages are
the hardest, right? One byte messages are extremely
difficult because you're sending a lot of overhead
for a small amount of data. And here, what this does is
cuts your latency considerably. And on the right-hand side,
you've got much more bandwidth and potential because again,
you're cutting out the overhead and you can really communicate
much more efficiently. And again, tools
integration everywhere. It's so important to be able
to see what's going on. We've got NVM and Nickel traces
built into the tools. So I want to talk about the
thing that I think I've had the most questions about over the
last year, which is Grace Hopper and the programming model and
the way you program those machines. The philosophy of CUDA has
always been that we have a single program constructed of effectively
a GPU function annotated by global and a CPU function. And it's all in one program. It's a
heterogeneous program, right? It's not two separate things. It's one program with
functions running in two different places, right? And this relates to something
that Jensen was saying to me a few weeks ago. It's not that you're replacing
serial work with parallel work. It's that it extends it. You need both and you want
to do both, right? And so the idea is that CPU
code runs on the CPU and GPU code runs on the GPU, right? And between them, historically,
we've had this PCI bus. And so even though you've
got these very high-speed memories going on, the PCI bus
has historically been a bottleneck. And so the obvious thing to
do, which we did with the Grace Hopper and we talked
about this last year, is that you can combine them together
with this thing called the NVLink C2C connection, which is many,
many, many times faster than PCI. And so my data transfer goes
through much better, right? And this is called Grace Hopper. This is what the machine is. But it's not just a device
with a very fast interconnect. In fact, it can be that, but
I think that's really just missing the point of what
this is all about. The reason that I love this
thing is that you've got really one processor with
two characteristics, natively two different things, right? I've got two memory systems
optimized for their own processor, but one is optimized for a
latency system, right? My CPU is a latency processor. It has deep caches. It cares about linear
operations very fast. My GPU is a throughput machine. It has these very high bandwidth
memories and it has very high bandwidth caches. And the way that you treat these
things is different because the way these things run code is different. And so on one of these Grace
Hopper machines, it's a single unified memory system on two
different ways of executing. And if I've got something like a
linked list, run it on the CPU. It's much better. If I've
got something like a parallel reduction, run it on the GPU. That's what it's for. And I can
pick and choose, just like my program was a hybrid of two things. I can literally run whatever
I want at the right place for it because these two systems
are unified with one address space. So it's more than
just the fast link. It's the fact the GPU can see and
modify and touch the CPU memory. In doing so, we can detect it
and we can move that over to the GPU so the GPU can get the benefit
of its very high bandwidth caches. That could be as much as a
factor of 10 improvement of performance if you're touching
that data all the time. And so the ability to both
combine the single address space, but also intelligently move
things around while we're working on it, is unbelievably powerful. That lets me put the compute and
the data where it needs to be. And at the same time,
of course, the migration doesn't affect the CPU. It can still access and touch
and see that data. It's a little bit of extra latency. Of course, it's going over the bus. But really, it's
really one machine. And that's kind of the point
I'm trying to get to. And this is some results from,
and very generously, I'm able to show some results from
Thomas Schultes' talk. He's the director of CSCS, which
runs the ICON code on the newly brought up ALPS machine, a Grace
Hopper machine in Switzerland. And this is just a fantastic
example of exactly what I was talking about. There's a simulation here where
you've got an ocean simulation running purely on CPU code. And you've got an atmosphere
simulation on the GPU in green. And the coupling
is extremely tight. And so you're moving
data around a lot. And so historically, you've
more or less been limited to the performance of the CPU code. But when you move to something
like the GPU, you're really able to run both of these
things at the same time. The CPU code on the CPU, the
GPU code on the GPU, the very close coupling and exchange
of data, automatic. And the result is, this is
a factor three speed up. This is unbelievable. And this
is at the scale of 64 GPUs. And this is the kind of thing
that is going to affect the number of days I can forecast
in my weather forecast, and really important things like
that, which impact everybody. At the same time,
other great examples. My colleague Matthias has
to talk about this. Just looking at fine tuning
of language models. A language model is a series
of transformer layers. And when you go through your
transformer layers, as you're processing these in your forward
pass training it, you generate these intermediate tensors. And there can be a large number
of layers, and therefore a large amount of data. So typically, what we do is
we throw away the data. And then on the way back,
we recalculate it all. So we double our computation in
exchange for saving some memory. But with the Grace Hopper device,
I can actually cache some of that. Instead of throwing it away, I'll
keep some of it around on the GPU. The small things are not
worth throwing away. The blue things, instead,
I will cache them and save them on the Grace memory. Because remember, memory is
just one giant memory system. And then on my way back, I can
recall it back in from the GPU. And so I don't have to do
that recomputation. And the result is a 20 % speed up. And in this particular example,
this is taking a 10 million parameter mixture of expert model. And you can see on the left,
the light green is offload. And the dark green is recompute. The recompute time is, of
course, the same for both. But if I'm doing on Grace
Hopper, if I do the offload of data instead of the recompute,
I'm gaining in time. Because I've got this very
tightly coupled memory system that lets me do it. Another example, which I
see a lot of these days, is graph neural networks. Graph neural networks are the
kind of things which financial institutions go and analyze
if your credit card has been used fraudulently. Things like that.
Massive, massive, massive interconnections of information. And the GraphSAGE model is a
primary model for going and using neural networks to solve graphs. And so this is a simple
walkthrough of how it works. And my friend Joe Eaton has
a whole talk on this. So again, he's the expert. I'm just the messenger. But basically, you sample
your neighborhood. You've got these little
convolutional networks that run at all these different types of nodes. The challenge with a graph
network is that it's not just one single collection of data
that I'm operating on. My entire universe could be
touched on any edge between any two nodes in the graph. So I have a massive pile
of data which is completely randomly accessed. I might access only 10 % of
it at any one time. I don't know which 10%, and
it's going to be different on every iteration. So what I need is just a big
pool of very fast memory so I can randomly access and
touch it as I go through the flow of the GraphSAGE model. And putting this on Grace Hopper
has just been an incredible performance improvement, where
previously I spent a lot of my time fetching data and moving things
in and out and on off the GPU. Now, with this unified pool of
memory, you're looking at a factor, again, a factor of two speed up. These are huge. A factor of two
speed up is like a generational speed up in most codes. You will spend ages, a
whole PhD getting a 20% speed up in something. This is a factor of two because
now you have a new architecture that can do new things. So finally, from one form of
graphs to another, and this is, I must admit, a little bit of just... As an engineer, you plan something,
you design it, and CUDA graphs is something I started designing
several years ago. And you have all these ideas,
and it takes a lot longer than you think to get where
you're going to go. And so the idea of CUDA graphs,
which I've talked about a few times, and hopefully you
know, the idea is you define your workflow up front, and then I can
issue a single launch operation to launch an arbitrary amount of work. So it can be a very fast way of
putting work onto the GPU, and I can see really good improvements
and speed ups to launch. But it's a lot more than just
a fast way to launch work. So I actually went back and I
found my slide deck from 2018. And this was for GTC, for
conversations with developers, just saying, could this
be useful to you? And I just thought I'd grab
some of my slides from then because it's so interesting
to see what I was thinking at the time, and where it's finally going. And so, you know, a quick
description of task graphs, where you had these nodes,
and they could be different things, and this is largely what we built. And I had this sequence where
you say, you know, the task graph properties, they're
reusable, I can launch them over and over again, I define
it once, and I run it many times. But then, Cyclic. I wanted a graph not to
just be a straightforward flow of dependencies. Why not be able to jump back
to the beginning? Why not be able to
have a dynamic graph? Something where node B could
decide it wanted to go to C or D, based on some data
that it came up with, right? Data-dependent
dynamic control flow. And then finally, Hierarchy, which
is a key part of any graph system. But these are literally my
very first slide deck of graphs saying, here is what I want. And finally, we've built it,
six years or seven years later, however long it's been. And so, let me tell you about
this thing we built, because it is everything that I had
in my mind about how these things would be used. And it opens the door to a
lot of potential, I think. So, what I've got on the left
here is an incredibly trivialized version of something called
Conjugate Gradient. It's like a gradient
descent type of thing. It's a very, very standard
way of solving a system of linear equations, and it's just
pseudocode on the left-hand side. But the key part about it is
there's an iterative loop, right? There's a main loop where I
do something, and I run that main loop over and over and over
again until I have my solution. And the loop body, typically,
traditionally with CUDA graphs, the idea with my loop body is that
I'm going to take that body, and I'm going to turn it into a graph. And then I'm going to run
that graph many times. So, my program starts
looking very simple. Instead of having all of these
different things that I have to do, I have one launch call. And this is great. This is how
people use graphs today, and it speeds things up very efficiently. But the challenge is that
this data-dependent execution, very common, iterates until
converged, right? It's almost a universal pattern. The iteration requires reading the
result back and deciding if I'm going to do my while loop again. So, I keep having to stop my
program, copy the data back in order to evaluate my while residual
is greater than epsilon, and then I can go back and do another launch. And so, now we're moving
data-dependent execution to the GPU, right? So, I take the main loop,
and now I create a graph with these new nodes. We create two new node types,
and I'll tell you about them in just a moment, an if node
and a while node. And now I can put the
while on the GPU. So, the convergence check, the
while check, is done without having to come back to the CPU. And my program no longer has
a main loop at all. The main loop is now completely
moved dynamically to the GPU, and I can just launch a conditional
graph, if you want to call it that. And my program is much simpler. So, now my CPU is
out of the picture. I can run a hundred of these
independently, all at the same time, because I no longer
need CPU threads to manage them. And the way it works is we've taken
one of these conditional nodes. It's just another type of graph
node, but it's a graph node that's either an if or a while. And inside the if node of the
graph, it evaluates the condition. It either runs a
subgraph or it doesn't. Remember, graphs are hierarchical. That was one of the things on
my very, very early slides. And so, I've got these conditional
nodes, which encapsulate what to do if the condition is true. Now, because graphs are
hierarchical, you can nest these. I can have a conditional node
inside a conditional node, to any depth that I want. And so, I can have a while node. And so, I can have an if
node that, if something happens, go and run this. And this contains a while,
which iterates continuously. And all of this can just run 100 %
be described inside my task graph. A lot of people ask me, why did you
make graphs control dependencies instead of data dependencies? And this is the reason. This is why we built it, with
control flow dependencies. Because you want to be able to say
things like while and if, which Dataflow does not allow you to do. And there's other
constructs you can do. This thing on the right
is like a switch. It's like multiple ifs.
If x, if y, if z. That's like a switch with cases. All of these types of things,
if and while, are the key fundamental building blocks. And maybe we'll optimize
switch ourselves later to make it more efficient. But fundamentally, you can
now describe a fully dynamic control, like workflow, on
the GPU without having to return to the CPU to be the
control, the hand-holding control. And that's very much a theme
for the kind of way we're moving things, to reduce the
amount of communication, to keep the GPU busy, to keep
the power as efficient, and the computation as efficient
as we possibly can get it. And this, finally, after six
years, was out a few weeks ago with CUDA 12.4. So it's really, it's so nice
to be able to stand up and show you this thing that has
been in my head forever. And we've never, we've just not
been able to get to it until now. Turns out you have to build a lot
of things before this can work. And so that's it.
That's what I've got. Here's the list of all the
references of everything that I've told you about. Because I'm just the messenger. All I do is tell you about
amazing work that everybody else around the company is doing. And I just, I get to stand up
here and tell you about it. And so here is the list of
all the fascinating stuff that I've dug up to find. This is shared in the
PDF of the slides. And so if you want to go back
and stream some talks or even attend them in person, these people
are really worth listening to. Thank you very much. That's pretty awesome, Stephen. We have a few minutes
to take some Q &A. We have microphones up
at the front here. If you have any questions,
feel free to ask. We all, we have a couple of
online questions for you. So I'm going to start
with the first one. Are there CUDA APIs available for
us to measure power consumption? As our programs run on GPUs
and break down, how much is due to compute, memory
access, or networking? That's a hard question. The power consumption is a
system level thing. So you need different system APIs. And so what we have, we have
a monitoring system called DCGM, which allows you to
monitor your data center and all these nodes in real time and
see what the utilization is, the power is of these different things. But you have to use that to collect
the data across your system and extrapolate a single CUDA function. There's no way to identify
just the power purely from that, because power is an external
factor that depends on not just compute, but memories and buses
and all sorts of things like that. Great. We have questions here. Go ahead. I have a question
about effectiveness. In case of company, for
example, if we combine all workstations in one server,
how much energy we can save? If you combine what? If you combine... For example, if we have multiple
workstations, CPU, GPU, and we are everything put in one server, CPU,
GPU, how much energy can we save? So the energy saving is
going to be very algorithm dependent, of course. But typically the most
expensive thing in any system is communication. It's moving electrons around. And so the more you combine into
a single localized space, this is why you see density increasing in
data center racks, because it takes much less energy to move electrons
a few inches instead of meters. So I just combined the
different units there. But so in general, you will be
saving energy, but it really depends on your algorithm
exactly how much. I think it's hard to predict that. You would need a
model of your system.