Scaling AI Model Training and Inferencing Efficiently with PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
We are here to talk to you about PyTorch and how it's a great tool for you to scale your deep learning models in a cheap and a fast and efficient way. So when we're scaling neural networks up, what that means is we are adding more and more layers to your neural network. And what this does is it allows your model to capture more nuance in your data and allows your model to perform more complex tasks. Well, all of this doesn't come for free. The model will start to require more memory and more compute. Let's see the llama model, for example. It has four variants ranging from 7 billion parameters to 70 billion parameters. The smallest variant, 7 billion, was trained using 2 trillion tokens and it required around roughly 200,000 GPU hours. That's a long time. So PyTorch, along with offering the modular building blocks to build your neural networks, it also offers some utilities to distribute to this training workload. And let's take a look at one of them called Distributed Data Parallel (DDP). So when you're training your model with DDP, there are three steps that need to happen. First, the forward pass where you take the data and you pass it through the model, computes the loss which is then back propagated through the model, which then gives you the gradients. The third step where we update the model's weights is preceded by a synchronization, where all the computed gradients from each of these replicas are communicated with each other. Now, the hallmark of DDP and really all distributed training is an overlap of the computation and communication. Essentially what that means is simultaneously we are doing the back propagation while communicating all of the calculated gradients. This saves us time and keeps the GPU running at a near 100% utilization. At a very high level, what that looks like: You first divide the model into buckets. Each replica calculates the gradients of the first bucket and while it is calculating the gradients of the second bucket, these first bucket gradients are synchronized. And this is happening simultaneously with the computation of the gradients in the second bucket. And similarly, as each bucket is being calculated, the preceding buckets gradients are being communicated. This ensures that all the GPUs are running at full utilization and you're not having any idle workers. They're all working very hard to train your model. And this is the case where the model fits in one GPU. For example, the 7 billion model can fit on on any cloud GPUs today. But when you start scaling that up, like, for example, 70 billion or even the 30 billion models, it's very difficult to fit them in one GPU. So in that paradigm, in those regimes, the DDP model does not work. Now I'll call upon Raghu to talk to us about what in PyTorch allows us to train models which are larger than what your single GPU can accommodate. All right, so you've heard all about distributed data parallel, and how do you scale a small model on very large number of GPUs and reduce your training times. So now let's look at what happens when we have a model that does not fit in a single GPU. And that is where FSDP comes to your rescue. FSDP stands for "Fully Sharded Data Parallel" and what fully sharded data parallel does is it takes the model, breaks it down into what are called units, and all of these units are then sharded across GPUs. So you can take that-- Think of it as shredding the model and then each GPU owns small portions of this model. So that's where shards are coming in. Now, what happens after that is pretty much very much like what DDP is doing. Think of instead of a model, you are thinking of it as a unit. And during the initial unit construction, from the shards, you are going to do an AllGather. And this happens in the forward pass. So you're gathering the unit, you're computing on top of it. And this happens across all the units. So as soon as your unit is computed, you lose that unit's memory and then you go to the next unit and so on. And that's the forward pass. Once the entire forward pass is computed, your loss is computed, and you go-- So this is your forward pass. And now you're going to do an AllGather in the backward pass. And during the backward pass, what is happening is you are computing again, very much like here, you are gathering the unit, you're computing the back propagation, but just in the reverse fashion. Once you have computed the gradients, then those are again, very much like what you have done for your DDP. Those are synchronized across all the GPUs that are responsible for holding that particular portion of the model. So once you are synchronized, that completes your entire step and then you continue doing all of this and then FSDP, very much like DDP, you are going to utilize overlap in a significant way because imagine in DDP there was only one single synchronization step, in FSDP, I have more opportunities for doing overlap and keeping those GPUs continuously busy while you are doing your computation. And that is how you achieve scale of these models. So what are the typical ways in which people train these large models? Most of the world knows about them as HPC systems, very large scale systems with very high speed interconnects, state of the art GPUs and servers and training these models. So what happens if I have "not a HPC system" that, you know, I have a good node. So let's say that instead of this GPU, we call this a node. Typically nodes have eight, maybe 16 GPUs in them. And many of these, the typical HPC system has an HPC interconnect things like the interconnect where people may have heard the term InfiniBand. So very high speed interconnect between nodes; and people may have heard the term Ethernet. Ethernet is far more common in many environments, whether it is cloud or on-prem environments, whereas InfiniBand is less common. So there is a common misconception that these larger models will always need InfiniBand, even though that is true for some of the larger models and it is faster. But with Ethernet also, you can train these models in an efficient manner. And that is where IBM helped, worked with the PyTorch community, and introduced this concept of rate limiter. What rate limiter is doing, is really balancing off the trade off between the overlap in terms of communication and computation by managing the memory in the GPU better. And that is how you reduce the amount of communication per GPU computation time step. And you are increasing the amount of compute while you're keeping that communication constant. And that is how you can get away with training on Ethernet while achieving similar kinds of benefits as InfiniBand. So Raghu, we saw quite a few things about how PyTorch helps in scaling up your deep learning workloads and you AI workloads. We saw FSDP, we saw the rate limiter API. And they do a pretty good job. Absolutely. And we also saw how interconnects play a role on how do you scale up and so on, right? But what happens inside a single node between that CPU and GPU? Tell me a little bit more about it. Yeah, you know, it's an interesting question you raise because there is some more efficiency gains that we can we can opt in. And let's take a look at, you know, what's what's happening on the board. So the reason-- one of the reasons --why PyTorch is so popular is because of its programing paradigm called "eager mode". And eager mode allows developers to have a very Pythonic approach to their programs. It allows dynamism, like, you know, you can do your if-else blocks. You're going to have for-loops within that. It's very flexible. Yeah. And we already spoke about CPU and GPU. Yeah. So I'm going to draw a small little schematic over here, which sort of illustrates what's happening in eager mode. Let's say this is your CPU queue and this is your GPU queue. And queue is essentially the order of instructions that each chip is launching. Mm hmm. Now, taking a step back, your AI programs, your AI models are essentially a sequence of operations. Small AI models, small deep networks have lesser instructions comparatively. And larger models have many more. Mm hmm. Mm hmm. So every line of code that I write in in my Python program basically translates to going to the GPU. Is that essentially what's happening? Yes. Almost every line is an independent instruction that your CPU is deriving from the program. And so we can probably think of your model, the execution of your model, as a sequence of instructions over here. Mm hmm. And because you're using a hardware accelerator like a GPU, you're going to queue up those instructions onto the relevant operations specific to the back end that are using. Okay. So let's say you have CUDA operations on an Nvidia GPU. So this is kind of how eager mode works. This is the appropriate paradigm that allows you to iterate or interact with your program in a very, you know, 1-to-1 way. Okay. So what is happening between, you know, these times when-- what happens there? And that's, you know, if you've hit the nail on the head, these empty spaces over here are actually times when the GPU is not working. These are idle. And, you know, that's what we do not like about our GPU is because they're a very expensive resource. We don't like them to be idle. We want them to be working all through, like you want to maximize the utilization over there. And these get more and more as you scale. These are the model size, right? Exactly. Like as you as your models get larger, the skew starts getting much longer. And as the queue starts getting longer, the number of these idle spaces start increasing quite a lot. Yeah. And these essentially translate to a lot of costs, unnecessary costs. You're burning GPU hours without actually getting any bang for your buck. Okay, so how do we address this? So here we have an interesting tradeoff. It poses a pretty interesting engineering challenge. We want that you got more, like, flexibility. It makes programing fun. Mm hmm. But we also want that efficiency. Mm hmm. So earlier last last year at the PyTorch conference, we announced 2.0. So PyTorch 2.0 packs in a very interesting new paradigm. So it's essentially a new compiler called TorchDynamo. And this still allows you to program your PyTorch code, as you always have. It's still got that flexibility, that interactivity. But what it does differently is instead of having separate instructions queued up like in eager mode, your program is essentially converted to a graph of operations. It's almost a static graph. And so all of these instructions are sort of merged together. Mm hmm. And they form like it's a 1-to-m. Mm hmm. And sometimes you might have graph breaks, You know, you might want to come back to eager mode paradigm. So your entire program has been translated to two instructions over here instead of many different instructions. And on the GPU, again, you're queuing up only two instructions instead of all of n instructions over here. Yeah. And quite often you won't even have this break. So your entire program, your entire model, is this one block of instructions in it which executes seamlessly on the GPU. That is pretty cool. And how do you actually achieve this from a programing standpoint? Like what does, say me, as a developer, what do I need to do to get this? One line of code. That's wonderful. All of this goodness in one API called torch.compile. So that's like super easy for me as a developer to get up to. What kind of speed ups have you seen with torch.compile? Let's say just about training it. Oh, torch.compile primarily targets training workloads and depending on the architecture and the model, it varies. But we have seen speed ups of at least 40 times. Wow. So that is an order of magnitude. Many orders of magnitude of on display and eager mode. Okay, wonderful. And at IBM, what we have been doing is to really work with the community and to look at torch.compile from an inference lens. So what are the beauty of compile? I feel, is that it not only works for training and giving that kind of speed up, but it actually works for inferencing as well. The same kind of gaps that you're seeing, we see that being provided to. Are you saying the same fundamental principle applies? I understand, but it's not exclusive to this training. It's not exclusive to training. So I think torch.compile is the new paradigm that is going to change efficiency, completely across both training and inferencing. So fantastic times for PyTorch. And for the users. Yes, I know. And what else are we doing to get PyTorch to be more community oriented, to grow that option of larger models? Yeah, I mean, it's no secret like we're living right now in the age of language models, large language models which are just awe inspiring and what simple algorithms are able to do. Mm hmm. Well, not so simple, but you're probably familiar with Llama. It's a language model from Meta. Yeah. Yeah. And a lot of that code is open sourced. You know, the model itself, the way it's there, available to the community in a very permissive license. Moreover, we also have a lot of scripts available on how you can be using these downstream. Okay. There is one GitHub repository that you should check out. Okay. It's called llama recipes. So it's under the facebookresearch/llama-recipes on GitHub. Oh, we'll add a link to to that in the video.A And you'll notice quite interestingly that they also use FSDP there too. Okay, especially for fine tuning llama you know. Yeah. You want to adapt all of that knowledge and that model to a particular domain or to a particular topic of interest. Yeah. So you want to fine tune it. Mm hmm. And of course, like we already saw how [crosstalk] really works and everything. Yeah. Yeah. And how FSDP helps resolve a lot of that with an easy to use API. And you can see that in practice at the llama recipes repository. Wonderful! So I think everything is coming together with Python in the models, the way we train them, tune them, and I'm sure you know, very soon it'll be around inferencing tools. So this is fantastic news for the PyTorch community. Yeah, I'm really looking forward to what the community comes up with and what they're, how they start using this. Absolutely! Probably had a like a, probably a new step in this age. Absolutely. Looking forward to that. Thank you, Suraj. Thanks Raghu. Thanks for watching. And if you like the video, please click on like and subscribe to our channel. And if you want to learn more about PyTorch, about the concepts we spoke about in this video, and even Llama, do check out the links below in the video description.
Info
Channel: IBM Technology
Views: 5,037
Rating: undefined out of 5
Keywords: IBM Cloud, pytorch, deeplearning, deep learning, machine learning, ML, AI, Artificial Intelligence, fully sharded data parallel, FSDP, GPU, CPU, distributeddataparallel, DDP, infiniband, ethernet, facebook, meta, meta ai, Facebook ai, Eager Mode, Eagermode, PyTorchConf
Id: 85RfazjDPwA
Channel Id: undefined
Length: 18min 29sec (1109 seconds)
Published: Mon Oct 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.