What runs ChatGPT? Inside Microsoft's AI supercomputer | Featuring Mark Russinovich

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(music) - Ever wondered what it takes to run the infrastructure that hosts ChatGPT with over 100 million users today, and how this infrastructure can work for your workloads in Azure at any scale? Well, keep watching as we break down the AI supercomputer infrastructure built to run ChatGPT and other large language models. And joining us to go deep on all the details, is Azure CTO Mark Russinovich. Welcome back to Mechanics! - Thanks for having me back on the show, Jeremy. It's been a long time. - It's been really, really too long, so thanks for coming back. You know, especially now, we're seeing a major shift in technology with AI and large language models. Almost overnight, these have now defined a brand new way of interacting with your apps and data. And this really includes Microsoft services from things like Bing chat to Copilot content creation experience across the Microsoft 365 portfolio of apps, natural language coding with GitHub Copilot and a lot more. And now, these large language models, they're all running on Azure. But for this to be possible there had to be some significant breakthroughs in our AI infrastructure, right? - Yeah, well we've been working on AI at Microsoft for decades now, but there's certainly been a huge acceleration in AI capability over the last 10 years, driven largely by the rise of GPUs and cloud scale infrastructure. And we've contributed to that in Azure. Over the last few years, we've built a specialized hardware and software stack to support the training inference of large language models or LLMs. Our AI supercomputer in Azure can train models that are hundreds of billions of parameters in size. The original GPT-3 was 175 billion parameters, but we built larger models than that since, for example Microsoft's Megatron-Turing natural language model, has 530 billion parameters. And as newer generations of LLMs are released, they're going to be even larger. The thing we had to support with LLMs like GPT is their self-supervised learning, where they learn about a language by examining billions of pages of information over and over again. That directed how we approach building a specialized hardware and software stack that can support the efficient running models of that massive scale. - Right, and the challenge with something like this is that they're super resource intensive and also pretty expensive to run. - An efficient infrastructure for running things at that size is really critical. You have to deal with failures that are happening on a regular basis. Every few days, and even sometimes more frequently when you're at large scale, something's going to fail. Like a server's going to fail or a network link's going to start to flap. So you've got to be able to minimize the rate of failures. When one happens, you've got to be able to diagnose it very quickly and then you've got to be able to fix it very quickly. A lot of things had to come together to build a hardware and software stack that could support that kind of scale. We've invested a lot in our datacenter infrastructure, with state-of-the-art hardware that we run in Azure globally for raw compute horsepower. One of the key things we do is cluster GPUs with a high-bandwidth network to provide connectivity required to efficiently train them. We layer on top of that Azure software platform optimizations where we have for example, enlightened the platform and the hypervisor to be able to access those network GPUs so that the performance is comparable to running on bare metal, but with a full manageability layer on top. As we move up the stack, we've also developed and integrated open-source frameworks like ONYX for model portability. And the DeepSpeed framework that came out of Microsoft Research that helps componentize these models so that they can train across interconnected GPUs at greater scale and speed. In fact, DeepSpeed has become the defacto framework for distributed machine learning training - Right, let's just put this into context, because for the OpenAI service, the supercomputer that we built for OpenAI back in 2020 comprises more than 285,000 AMD InfiniBand connected CPU cores. And separately there are 10,000 NVIDIA V100 Tensor Core GPUs that are also InfiniBand connected. - Right, so one of the techniques used to train those models is called data parallelism. Where you're actually training many, many instances of the model at the same time on small batches of data. And after each batch is processed, the GPUs exchange information and then proceed to the next batch. And that's why you require systems that are that large. By the way, that system that we built back in 2020, was the fifth largest supercomputer in the world at the time and the largest supercomputer ever built in the public cloud. It's really the result of a collaborative effort between Microsoft engineering, Microsoft Research, the OpenAI organization, and NVIDIA. - One of the more impressive things that I've found with this is just how we've been able to network everything together. And I know that we've actually used InfiniBand to achieve this, but what else did you have to do to optimize the hardware side? - Well, we've actually been using InfiniBand for a decade in our HPC clusters. It has better cost performance than ethernet and Azure is unique compared to every other cloud for that. It's one of the key reasons OpenAI actually came to us, we've been battle testing this for many years. And of course, we've since built even bigger systems than the one we talked about publicly back in 2020. We collaborated with NVIDIA to deliver purpose-built AI infrastructure within NVIDIA GPUs. The multiple GPUs required to train LLMs don't all operate on the same board or even the same rack. They're spread out due to the power consumption and heat, and we want to network and cluster as many GPUs as we can. And the new H100 VM series in Azure, powered by NVIDIA H100 Tensor Core GPUs, lets you choose one or more GPUs, cluster up to eight GPUs per VM using NVLink, and scale out to thousands if your demands grow, using NVIDIA Quantum-2 InfiniBand networking. That's up to 30x higher performance in inferencing and 4x higher for training compared to NVIDIA's previous A100 generation of GPUs. These are designed specifically for multi-tenant cloud compute, giving us the ability to isolate customers sharing the same server from one another, and with that we can achieve the elastic scale that we need. - Right, and it's also critical for something like this in training the models of this size, but I'd like to go into throughput. So how much have we improved the throughput then for our GPUs? - Yeah, so from a throughput perspective, starting with a single node using NVSwitch and NVLink 4.0, bisectional bandwidth can reach 3.6 Terabytes per second between eight local GPUs. And to put that in perspective, at that speed, you could download about 80 4K videos in less than a second. - Just wish I could do that. (both laugh) - Everybody does. Then as you scale out with more nodes, using a non-blocking fat-tree network architecture, the InfiniBand links deliver 400 Gigabits per GPU, which is a total of 3.2 Terabytes per server. - Okay, so we've got the raw horsepower then in the hardware, but now as a service provider, where we host these models in Azure, how would we make sure that we can run everything as reliably as we need to? - Well, like I mentioned, models this size, they'll typically run for days, weeks, in some cases months for the largest model sizes. So, you don't want to have to restart training from scratch and lose progress if something goes wrong, like if a server fails or a network link starts to flap, which will happen on a regular basis over that kind of time period and that kind of scale. And the way we address that in Azure is with a containerization and global scheduler service that we've been working on called Project Forge, which is designed specifically to help run Microsoft's global scale AI workloads and maintain really high levels of utilization. Project Forge introduces transparent checkpointing, where it periodically saves the state of a model incrementally, without the model's code needing to do anything. That way, if anything fails, it can quickly resume for the most recent checkpoint. We combine this with our integrated global scheduler that pools GPU capacity from regions around the world. So, if you need to pause a job to prioritize another one, that allows us to migrate that pause job to another region if necessary and available, with minimal impact on its progress. - So, let's you kind of load balance all the different jobs we have at global scale. - That's right, so for example, we can load balance between Bing chat or Copilot services and OpenAI. The Global scheduler provides its single GPU capacity management to ensure resources are not idle with fixed clusters, instead jobs are assigned to virtual clusters and then those virtual clusters get mapped to physical resources, when they're available and needed. - And this is all really pivotal in terms of being able to run these jobs uninterrupted, so is this something that the people watching, Azure customers, can use right now? - Right now, Project Forge is being used for own services. Like for example, if you're using GitHub Copilot, you're using Project Forge under the hood. But we're working on making it directly available to customers in the near future. I've got a Jupyter Notebook that has a machine learning model that we pulled off of GitHub. It's open source, it has no model checkpointing, as part of it. I've connected that Jupyter Notebook to Project Forge's special Python kernel and I've executed the first cell, which initializes the GPUs. So on the right side I've SSH'd into the GPU server where you can see the GPU's utilization and its memory consumption. And you can see that utilization is zero, there's a minimal amount of memory consumption, but now I'm going to kick off that training job and when I do, you see the memory usage spikes, you can see a little bit of GP utilization. It's still pretty low though, because this model's a tiny model compared to the amount of capacity that's on this A100 GPU. So we're going to train the model which just takes a short period of time and now the GPU is going to go idle. And at that point Project Forge sees that it's gone idle and automatically takes a checkpoint, because it knows that I'm not busy using the GPUs. Normal circumstances, I would've been hogging the GPU at that point and preventing it from being used for somebody else's workload. But now that GPU can be repurposed, you can see there's the checkpoint, memory usage goes to zero, GPU utilization goes to zero. Now when I come back though, if I've got priority, or if the GPU is available, a GPU is available, Project Forge can reload the checkpoint, and so that's what I'm going to do. And I can pick up exactly where I left off in that notebook. So that next cell there, depends on the training run I just did and I can execute it and it'll succeed there, because Project Forge is behind the scenes connected to GPU backup with the kernel. - And just to reiterate, you didn't have to modify your code or manually execute that checkpoint, it just ran in the background? - Completely transparently. - Perfect. - And something that will help make that possible is our work with our hardware partners, AMD and NVIDIA which helped implement CRIU, or Checkpoint/Restore In Usermode for their GPUs. Which enables checkpointing and restoring the memory in the GPU and making that consistent with the checkpoints we do on this CPU. - Right, and these are all really key areas in terms of where our software and hardware all work together to speed things up for greater efficiency. So, are there other examples that you've worked on? - Yeah, another great example of that is what we do for the process where you want to customize one of these foundational models for specific domain or for a custom data set. So for example, healthcare data or your enterprise data. And that process of customizing it is called fine tuning. We can make it great at knowing how to program, for example. - Right, it's kind of like in the movie the Matrix where Neo learns Kung Fu. - That's right, would you like a pill? - Gimme the red one this time. - You took the blue one last time. - I did. - I mentioned that GPT-3 model that at one point, which has 175 billion parameters. So, if you just fine tune that using the basic fine tuning technique, it would require 96 GPUs. And each checkpoint during that fine tuning training would be about a Terabyte in size. To make that more efficient, we focus only on the parts of the model that need to improve by using a technique we developed in Microsoft Research called Low Rank Adaptive, or LoRA fine-tuning. And then only that part of the model is updated with changes. With a model like GPT, you would only need to fine tune far less than 1% of the total parameters. So instead of needing 96 GPUs, you might only need 24. And instead of a Terabyte size checkpoint, it's not only 200 megabytes in size. Because we can quickly switch LoRA fine tune parameters on top of a loaded base foundational model, we can also optimize serving lots of fine tune models on top of a smaller number of GPUs. - That's really a big part in terms of how we've been able to apply large language models into our own services at Microsoft. On the other hand though, for everyone who's watching, how can everyone use Azure's supercomputer capabilities for their own workloads in Azure, even if maybe they're coming from a smaller workload or job? - Well, our AI infrastructure supports everything from very small jobs like you saw. Pretty tiny job right there, to very large ones like on OpenAI scale. So if you just want to take advantage of our optimized hardware infrastructure, you can of course just use the virtual machines and do it yourself, taking advantage of our hardware infrastructure, the best in class GPUs, the InfiniBand networking, the fact that we know how to tune those networks to maintain them with a high degree of utilization and efficiency. But if you want to focus on building and running models, you can take advantage of our frameworks and ML Ops services. So for example, you can use our ONYX runtime that I mentioned, or use our DeepSpeed training framework for more efficient distributed training. And you can use Azure Machine Learning PaaS services that natively integrates with our AI infrastructure to build your models. And with AML, you can easily fine tune models like GPT-4 and deploy the fine tune models as managed services for your own applications to use. - So, how are we seeing organizations today that leverage Azure's AI Supercomputer for their own solutions? - Well, there are a lot of really interesting use cases where customers are using training and inference on top of our large scale infrastructure. One of them that's one of my favorites is Wayve. It's a company in the UK that is a leader in autonomous driving technologies. So what they're doing is specializing in distributed AI-powered autonomous systems for self-driving vehicles using primarily vision-based machine learning. And they're using our AI supercomputer to gather, manage, and train those models on millions of hours of driving data per year. That's petabytes of data, consisting of images, GPS data, and data from other sensors, each individual training sample can be hundreds of megabytes in size and by comparison, a training sample for a LLM is typically just a few K in size. So, there are also training models with billions of parameters and use Azure Machine Learning and AKS for large scale distributed deep learning and orchestration. And that's key because Wayve, requires a lot of linear scaling, optimal memory utilization and a fast GPU interconnect with minimal latency with high throughput. And that's where we excel. Right, and Azure is literally the best place to run something like this. But let's switch gears though, because last time you were on, we actually covered Confidential Computing, and this is something that you've helped actually develop as part of the solution set in Azure. So, there's a lot more importance here with Confidential Computing with AI workloads, right? - That's right. When you talk about AI workload, you're talking about the data you use to train them which can be very sensitive data. The model that you train actually can have a lot of IP. Obviously OpenAI has a lot of IP in their models and so do many other customers, like Wayve, have IP in their models. And then finally, for the consumers of those models like people that are using Azure OpenAI, when they send their queries in that data is also very sensitive. So with Confidential Computing, you can protect all of that data end-to-end in a trusted execution environment, which is provided by hardware with a hardware root of trust that only you have access to. That means the data in memory is protected and encrypted. Data is only in the clear while running inside of that TE, and it's re-encrypted and unreadable when it leaves that TE. That prevents third parties and processes, including Azure operators, from accessing the data and models. That enables confidential multi-party collaborations in what we call confidential data cleanrooms, where parties can combine their datasets for training and each party is unable to access anyone else's data. In the example of autonomous vehicles, that would allow multiple vehicle manufacturers, GPS providers, and AI developers to work together, to share the data needed to build intelligent self-driving solutions. And confidential GPU is an area where we're investing heavily in today with NVIDIA and other hardware partners. We announced at GTC recently, that NVIDIA is coming out with confidential hopper GPUs. - Right, so this is something that's evolving quickly this whole space, so what's next then on the horizon? - Everything we do to build these systems pays dividends to anybody using Azure, whether using AI or not. It means they're going to have lower cost, higher scale, more efficient data centers, more powerful software frameworks. Right now, in the area of AI, we're in the process of rolling out those NVIDIA H100 GPUs for AI and HPC, and you're going to see us make Project Forge available for everyone to use as you train and run your AI models like I mentioned earlier. And there's lots more innovation, on an ongoing basis to always make sure we're best in class to run and train your AI workloads. Really such a great deep dive, and I learn so much when you're on the show. Thanks so much for joining us today. And by the way if you want to leverage Azure AI capabilities for yourself, you can start now by using our GPU enabled compute in Azure along with the Azure Machine Learning service. And there are lots of options available to get started really at any scale. So check out aka.ms/AzureAIInfrastructure and also keep watching Microsoft Mechanics for all the latest updates. Thank you so much for watching and we'll see you next time.

Info

Channel: Microsoft Mechanics

Views: 624,378

Rating: undefined out of 5

Keywords: ai supercomputer microsoft, AI Infrastructure, ai supercomputer nvidia, Mark Russinovich, Supercomputer, confidential computing, infiniband switch, checkpointing, hardware and software solutions, hardware and software, microsoft copilot, Azure AI, Azure, tech, low-rank adaption (lora), large language models, chat gpt, natural language processing, openai chatgpt, open ai gpt 4, openai gpt 3, copilot, chatgpt, Open AI, throughput vs latency, artificial intelligence, Microsoft ai

Id: Rk3nTUfRZmo

Channel Id: undefined

Length: 16min 27sec (987 seconds)

Published: Tue May 23 2023