We are here to talk to you about PyTorch
and how it's a great tool for you to scale your deep learning models
in a cheap and a fast and efficient way. So when we're scaling neural networks up,
what that means is we are adding more and more layers to your neural network. And what this does is it allows your model
to capture more nuance in your data and allows your model to perform
more complex tasks. Well, all of this doesn't come for free. The model will start to require
more memory and more compute. Let's see the llama model, for example. It has four variants ranging from 7 billion parameters to 70 billion parameters. The smallest variant, 7 billion, was trained using 2 trillion tokens and it required around roughly 200,000 GPU hours. That's a long time. So PyTorch, along with offering the modular building blocks
to build your neural networks, it also offers some utilities
to distribute to this training workload. And let's take a look at one of them called Distributed Data Parallel (DDP). So when you're training your model with DDP, there are three steps that need to happen. First, the forward pass
where you take the data and you pass it through the model, computes the loss which is then back propagated through the model,
which then gives you the gradients. The third step where we update the model's
weights is preceded by a synchronization,
where all the computed gradients from each of these replicas
are communicated with each other. Now, the hallmark of DDP
and really all distributed training is an overlap of the computation and communication. Essentially what that means is simultaneously
we are doing the back propagation while communicating
all of the calculated gradients. This saves us time
and keeps the GPU running at a near 100% utilization. At a very high level, what that looks like: You first divide the model into buckets. Each replica calculates the gradients of the first bucket and while it is calculating the gradients of the second bucket, these first bucket
gradients are synchronized. And this is happening simultaneously with the computation of the gradients
in the second bucket. And similarly,
as each bucket is being calculated, the preceding buckets
gradients are being communicated. This ensures that all the GPUs are running at full utilization
and you're not having any idle workers. They're all working
very hard to train your model. And this is the case where the model fits in one GPU. For example,
the 7 billion model can fit on on any cloud GPUs today. But when you start scaling
that up, like, for example, 70 billion or even the 30 billion models, it's
very difficult to fit them in one GPU. So in that paradigm, in those regimes, the DDP model does not work. Now I'll call upon Raghu to talk to us
about what in PyTorch allows us to train models which are larger than what
your single GPU can accommodate. All right, so you've heard all about distributed data parallel,
and how do you scale a small model on very large number of GPUs
and reduce your training times. So now let's look at what happens when we have a model
that does not fit in a single GPU. And that is
where FSDP comes to your rescue. FSDP stands for "Fully Sharded Data Parallel" and what fully sharded
data parallel does is it takes the model, breaks it down into
what are called units, and all of these units are then sharded across GPUs. So you can take that-- Think of it as shredding the model
and then each GPU owns small portions of this model. So that's where
shards are coming in. Now, what happens after that is pretty much
very much like what DDP is doing. Think of instead of a model,
you are thinking of it as a unit. And during the initial unit construction, from the shards, you are going to do an AllGather. And this happens in the forward pass. So you're gathering
the unit, you're computing on top of it. And this happens across all the units. So as soon as your unit is computed,
you lose that unit's memory and then you go to the next unit
and so on. And that's the forward pass. Once the entire forward pass is computed, your loss is computed, and you go-- So this is your forward pass. And now you're going to do an AllGather in the backward pass. And during the backward pass,
what is happening is you are computing again, very much like here,
you are gathering the unit, you're computing the back propagation,
but just in the reverse fashion. Once you have computed the gradients, then those are again, very much
like what you have done for your DDP. Those are synchronized across all the GPUs
that are responsible for holding that particular portion of the model. So once you are synchronized,
that completes your entire step and then you continue doing all of this
and then FSDP, very much like DDP, you are going to utilize overlap
in a significant way because imagine in DDP there was only one
single synchronization step, in FSDP, I have more opportunities
for doing overlap and keeping those GPUs continuously busy
while you are doing your computation. And that is how you achieve scale of these models. So what are the typical ways
in which people train these large models? Most of the world knows about them
as HPC systems, very large scale systems
with very high speed interconnects, state of the art GPUs and servers
and training these models. So what happens if I have "not a HPC
system" that, you know, I have a good node. So let's say that instead of this GPU,
we call this a node. Typically nodes have eight, maybe 16 GPUs in them. And many of these, the typical HPC system has an HPC interconnect things like the interconnect where people may have heard the term InfiniBand. So very high speed
interconnect between nodes; and people may
have heard the term Ethernet. Ethernet is far more common in many environments,
whether it is cloud or on-prem environments,
whereas InfiniBand is less common. So there is a common misconception
that these larger models will always need InfiniBand,
even though that is true for some of the larger models
and it is faster. But with Ethernet also, you can train
these models in an efficient manner. And that is where IBM helped, worked
with the PyTorch community, and introduced this concept of rate
limiter. What rate limiter is doing,
is really balancing off the trade off between the overlap
in terms of communication and computation by managing the memory in the GPU better. And that is how you reduce the amount of communication per GPU computation time step.
And you are increasing the amount of compute while you're keeping
that communication constant. And that is how you can get away
with training on Ethernet while achieving
similar kinds of benefits as InfiniBand. So Raghu, we saw quite a few things
about how PyTorch helps in scaling up your deep
learning workloads and you AI workloads. We saw FSDP,
we saw the rate limiter API. And they do a pretty good job. Absolutely. And we also saw how interconnects play a role on
how do you scale up and so on, right? But what happens inside a single node
between that CPU and GPU? Tell me a little bit more about it. Yeah, you know, it's an interesting question
you raise because there is some more efficiency gains that we can
we can opt in. And let's take a look at, you know,
what's what's happening on the board. So the reason-- one of the reasons --why PyTorch is so popular
is because of its programing paradigm called "eager mode". And eager mode allows developers to have a very Pythonic approach
to their programs. It allows dynamism, like,
you know, you can do your if-else blocks. You're going to have for-loops
within that. It's very flexible. Yeah. And we already spoke about CPU and GPU. Yeah. So I'm going to draw a
small little schematic over here, which sort of illustrates
what's happening in eager mode. Let's say this is your CPU queue and this is your GPU queue. And queue is essentially the order of instructions
that each chip is launching. Mm hmm. Now, taking a step back, your AI programs, your AI models are essentially
a sequence of operations. Small AI models, small deep networks
have lesser instructions comparatively. And larger models have many more. Mm hmm. Mm hmm. So every line of code that I write in in my Python program
basically translates to going to the GPU. Is that essentially what's happening? Yes. Almost every line is an independent instruction that your CPU
is deriving from the program. And so we can probably think
of your model, the execution of your model,
as a sequence of instructions over here. Mm hmm. And because you're using
a hardware accelerator like a GPU, you're going to queue up
those instructions onto the relevant operations
specific to the back end that are using. Okay. So let's say you have CUDA operations on an Nvidia GPU. So this is kind of how eager mode works. This is the appropriate paradigm
that allows you to iterate or interact with your program
in a very, you know, 1-to-1 way. Okay. So what is happening between, you know, these times
when-- what happens there? And that's, you know, if you've hit the nail on the head, these empty spaces over here are actually times
when the GPU is not working. These are idle. And, you know, that's what we do not like about our GPU is
because they're a very expensive resource. We don't like them to be idle. We want them to be working all through, like you want to maximize the utilization over there. And these get more and more as you scale. These are the model size, right? Exactly. Like as you as your models get larger,
the skew starts getting much longer. And as the queue starts getting longer,
the number of these idle spaces start increasing quite a lot. Yeah. And these essentially translate
to a lot of costs, unnecessary costs. You're burning GPU hours without
actually getting any bang for your buck. Okay, so how do we address this? So here we have an interesting tradeoff. It poses a pretty interesting
engineering challenge. We want that you got more, like, flexibility. It makes programing fun. Mm hmm. But we also want that efficiency. Mm hmm. So earlier last last year at the PyTorch conference,
we announced 2.0. So PyTorch 2.0 packs
in a very interesting new paradigm. So it's essentially a new compiler called TorchDynamo. And this still allows you to program your PyTorch code,
as you always have. It's still got that flexibility,
that interactivity. But what it does
differently is instead of having separate instructions queued up like in eager mode, your program is essentially converted
to a graph of operations. It's almost a static graph. And so all of these instructions
are sort of merged together. Mm hmm. And they form like it's a 1-to-m. Mm hmm. And sometimes you might have graph breaks,
You know, you might want to come back to eager
mode paradigm. So your entire program
has been translated to two instructions over here instead of many different instructions. And on the GPU, again,
you're queuing up only two instructions instead of all of n instructions
over here. Yeah. And quite often
you won't even have this break. So your entire program, your entire model,
is this one block of instructions in it which executes seamlessly on the GPU. That is pretty cool. And how do you actually achieve this
from a programing standpoint? Like what does, say me, as a developer, what do I need to do to get this? One line of code. That's wonderful. All of this goodness
in one API called torch.compile. So that's like super easy for me
as a developer to get up to. What kind of speed ups
have you seen with torch.compile? Let's say just about training it. Oh, torch.compile primarily targets training workloads and depending on the architecture
and the model, it varies. But we have seen speed ups of at least
40 times. Wow. So that is an order of magnitude. Many orders of magnitude of on display
and eager mode. Okay, wonderful. And at IBM, what we have been doing
is to really work with the community and to look at torch.compile
from an inference lens. So what are the beauty of compile? I feel, is that it
not only works for training and giving that kind of speed up, but
it actually works for inferencing as well. The same kind of gaps that you're seeing,
we see that being provided to. Are you saying the same fundamental
principle applies? I understand,
but it's not exclusive to this training. It's not exclusive to training. So I think torch.compile
is the new paradigm that is going to change efficiency, completely across
both training and inferencing. So fantastic times for PyTorch. And for the users. Yes, I know. And what else are we doing
to get PyTorch to be more community oriented, to grow that option of larger
models? Yeah, I mean, it's no secret
like we're living right now in the age of language
models, large language models which are just awe inspiring
and what simple algorithms are able to do. Mm hmm. Well, not so simple,
but you're probably familiar with Llama. It's a language model from Meta.
Yeah. Yeah. And a lot of that code is open sourced. You know, the model itself, the way
it's there, available to the community in a very permissive license. Moreover, we also have a lot of scripts available
on how you can be using these downstream. Okay. There is one GitHub repository
that you should check out. Okay. It's called llama recipes. So it's under the facebookresearch/llama-recipes on GitHub. Oh, we'll add a link to to that in the video.A And you'll notice quite interestingly
that they also use FSDP there too. Okay, especially for fine tuning llama
you know. Yeah. You want to adapt all of that knowledge
and that model to a particular domain
or to a particular topic of interest. Yeah.
So you want to fine tune it. Mm hmm.
And of course, like we already saw how [crosstalk] really works and everything. Yeah. Yeah. And how FSDP helps resolve
a lot of that with an easy to use API. And you can see that in practice at the llama recipes
repository. Wonderful! So I think everything is coming together with Python in the models,
the way we train them, tune them, and I'm sure you know, very soon
it'll be around inferencing tools. So this is fantastic
news for the PyTorch community. Yeah, I'm really looking forward
to what the community comes up with and what they're,
how they start using this. Absolutely! Probably had a like
a, probably a new step in this age. Absolutely. Looking forward to that. Thank you, Suraj. Thanks Raghu. Thanks for watching. And if you like the video, please click on
like and subscribe to our channel. And if you want to learn more
about PyTorch, about the concepts we spoke about in this video,
and even Llama, do check out the links
below in the video description.