(music) - Ever wondered what it takes
to run the infrastructure that hosts ChatGPT with over
100 million users today, and how this infrastructure can work for your workloads in Azure at any scale? Well, keep watching as we break down the AI supercomputer
infrastructure built to run ChatGPT and other large language models. And joining us to go
deep on all the details, is Azure CTO Mark Russinovich. Welcome back to Mechanics! - Thanks for having me
back on the show, Jeremy. It's been a long time. - It's been really, really too long, so thanks for coming back. You know, especially now, we're seeing a major shift in technology with AI and large language models. Almost overnight, these have now defined a brand new way of interacting with your apps and data. And this really includes
Microsoft services from things like Bing chat to Copilot content creation experience across the Microsoft
365 portfolio of apps, natural language coding
with GitHub Copilot and a lot more. And now, these large language models, they're all running on Azure. But for this to be possible there had to be some
significant breakthroughs in our AI infrastructure, right? - Yeah, well we've been working on AI at Microsoft for decades now, but there's certainly
been a huge acceleration in AI capability over the last 10 years, driven largely by the rise of GPUs and cloud scale infrastructure. And we've contributed to that in Azure. Over the last few years, we've built a specialized
hardware and software stack to support the training inference of large language models or LLMs. Our AI supercomputer in
Azure can train models that are hundreds of billions
of parameters in size. The original GPT-3 was
175 billion parameters, but we built larger
models than that since, for example Microsoft's Megatron-Turing natural language model,
has 530 billion parameters. And as newer generations
of LLMs are released, they're going to be even larger. The thing we had to support with LLMs like GPT is their
self-supervised learning, where they learn about a language by examining billions
of pages of information over and over again. That directed how we approach building a specialized hardware and software stack that can support the
efficient running models of that massive scale. - Right, and the challenge
with something like this is that they're super resource intensive and also pretty expensive to run. - An efficient infrastructure
for running things at that size is really critical. You have to deal with failures that are happening on a regular basis. Every few days, and even
sometimes more frequently when you're at large scale, something's going to fail. Like a server's going to fail or a network link's
going to start to flap. So you've got to be able to minimize the rate of failures. When one happens, you've got to be able to diagnose it very quickly and then you've got to be
able to fix it very quickly. A lot of things had to come together to build a hardware and software stack that could support that kind of scale. We've invested a lot in our
datacenter infrastructure, with state-of-the-art hardware that we run in Azure globally
for raw compute horsepower. One of the key things
we do is cluster GPUs with a high-bandwidth network to provide connectivity required
to efficiently train them. We layer on top of that Azure software platform optimizations
where we have for example, enlightened the platform
and the hypervisor to be able to access those network GPUs so that the performance is comparable to running on bare metal, but with a full
manageability layer on top. As we move up the stack, we've also developed and
integrated open-source frameworks like ONYX for model portability. And the DeepSpeed framework that came out of Microsoft Research that helps componentize these models
so that they can train across interconnected GPUs
at greater scale and speed. In fact, DeepSpeed has
become the defacto framework for distributed machine learning training - Right, let's just put this into context, because for the OpenAI service, the supercomputer that we built for OpenAI back in 2020 comprises more than 285,000 AMD InfiniBand
connected CPU cores. And separately there are 10,000 NVIDIA V100 Tensor Core GPUs that
are also InfiniBand connected. - Right, so one of the techniques used to train those models is
called data parallelism. Where you're actually
training many, many instances of the model at the same time
on small batches of data. And after each batch is
processed, the GPUs exchange information and then
proceed to the next batch. And that's why you require
systems that are that large. By the way, that system
that we built back in 2020, was the fifth largest
supercomputer in the world at the time and the largest
supercomputer ever built in the public cloud. It's really the result
of a collaborative effort between Microsoft engineering,
Microsoft Research, the OpenAI organization, and NVIDIA. - One of the more impressive things that I've found with this is just how we've been able to
network everything together. And I know that we've
actually used InfiniBand to achieve this, but what
else did you have to do to optimize the hardware side? - Well, we've actually
been using InfiniBand for a decade in our HPC clusters. It has better cost
performance than ethernet and Azure is unique compared
to every other cloud for that. It's one of the key reasons
OpenAI actually came to us, we've been battle testing
this for many years. And of course, we've since
built even bigger systems than the one we talked
about publicly back in 2020. We collaborated with NVIDIA to deliver purpose-built AI infrastructure
within NVIDIA GPUs. The multiple GPUs required to train LLMs don't all operate on the same board or even the same rack. They're spread out due to the
power consumption and heat, and we want to network and cluster as many GPUs as we can. And the new H100 VM series in Azure, powered by NVIDIA H100 Tensor Core GPUs, lets you choose one or more GPUs, cluster up to eight GPUs
per VM using NVLink, and scale out to thousands
if your demands grow, using NVIDIA Quantum-2
InfiniBand networking. That's up to 30x higher
performance in inferencing and 4x higher for training compared to NVIDIA's previous
A100 generation of GPUs. These are designed specifically for multi-tenant cloud compute, giving us the ability to
isolate customers sharing the same server from one another, and with that we can achieve
the elastic scale that we need. - Right, and it's also
critical for something like this in training
the models of this size, but I'd like to go into throughput. So how much have we
improved the throughput then for our GPUs? - Yeah, so from a throughput perspective, starting with a single node using NVSwitch and NVLink 4.0, bisectional bandwidth can
reach 3.6 Terabytes per second between eight local GPUs. And to put that in
perspective, at that speed, you could download about 80 4K videos in less than a second.
- Just wish I could do that. (both laugh)
- Everybody does. Then as you scale out with more nodes, using a non-blocking fat-tree
network architecture, the InfiniBand links deliver
400 Gigabits per GPU, which is a total of 3.2
Terabytes per server. - Okay, so we've got the raw horsepower then in the hardware, but now as a service provider, where we host these models in Azure, how would we make sure
that we can run everything as reliably as we need to? - Well, like I mentioned,
models this size, they'll typically run for days, weeks, in some cases months for
the largest model sizes. So, you don't want to
have to restart training from scratch and lose progress
if something goes wrong, like if a server fails or a
network link starts to flap, which will happen on a regular basis over that kind of time period
and that kind of scale. And the way we address that in Azure is with a containerization
and global scheduler service that we've been working
on called Project Forge, which is designed specifically to help run Microsoft's global scale AI workloads and maintain really high
levels of utilization. Project Forge introduces
transparent checkpointing, where it periodically saves the state of a model incrementally, without the model's code
needing to do anything. That way, if anything fails, it can quickly resume for
the most recent checkpoint. We combine this with our
integrated global scheduler that pools GPU capacity from
regions around the world. So, if you need to pause a
job to prioritize another one, that allows us to migrate that pause job to another region if
necessary and available, with minimal impact on its progress. - So, let's you kind of load balance all the different jobs
we have at global scale. - That's right, so for example, we can load balance between Bing chat or Copilot services and OpenAI. The Global scheduler provides its single GPU capacity management
to ensure resources are not idle with fixed clusters, instead jobs are assigned
to virtual clusters and then those virtual clusters get mapped to physical resources, when
they're available and needed. - And this is all really pivotal in terms of being able to run
these jobs uninterrupted, so is this something
that the people watching, Azure customers, can use right now? - Right now, Project Forge is
being used for own services. Like for example, if you're
using GitHub Copilot, you're using Project Forge under the hood. But we're working on making
it directly available to customers in the near future. I've got a Jupyter Notebook that has a machine learning model
that we pulled off of GitHub. It's open source, it has no model checkpointing, as part of it. I've connected that Jupyter Notebook to Project Forge's special Python kernel and I've executed the first cell, which initializes the GPUs. So on the right side I've
SSH'd into the GPU server where you can see the GPU's utilization and its memory consumption. And you can see that utilization is zero, there's a minimal amount
of memory consumption, but now I'm going to kick
off that training job and when I do, you see
the memory usage spikes, you can see a little
bit of GP utilization. It's still pretty low though, because this model's a tiny model compared to the amount of capacity
that's on this A100 GPU. So we're going to train the model which just takes a short period of time and now the GPU is going to go idle. And at that point Project
Forge sees that it's gone idle and automatically takes a checkpoint, because it knows that I'm
not busy using the GPUs. Normal circumstances,
I would've been hogging the GPU at that point and preventing it from being used for
somebody else's workload. But now that GPU can be repurposed, you can see there's the checkpoint, memory usage goes to zero, GPU utilization goes to zero. Now when I come back though, if I've got priority, or if the GPU is available,
a GPU is available, Project Forge can reload the checkpoint, and so that's what I'm going to do. And I can pick up exactly where
I left off in that notebook. So that next cell there, depends on the training run I just did and I can execute it
and it'll succeed there, because Project Forge is behind the scenes connected to GPU backup with the kernel. - And just to reiterate, you didn't have to modify your code or manually execute that checkpoint, it just ran in the background? - Completely transparently.
- Perfect. - And something that will
help make that possible is our work with our hardware partners, AMD and NVIDIA which
helped implement CRIU, or Checkpoint/Restore In
Usermode for their GPUs. Which enables checkpointing and restoring the memory in the GPU and
making that consistent with the checkpoints we do on this CPU. - Right, and these are
all really key areas in terms of where our
software and hardware all work together to speed
things up for greater efficiency. So, are there other examples
that you've worked on? - Yeah, another great example of that is what we do for the
process where you want to customize one of
these foundational models for specific domain or
for a custom data set. So for example, healthcare
data or your enterprise data. And that process of customizing it is called fine tuning. We can make it great at knowing
how to program, for example. - Right, it's kind of like
in the movie the Matrix where Neo learns Kung Fu. - That's right, would you like a pill? - Gimme the red one this time. - You took the blue one last time. - I did. - I mentioned that GPT-3
model that at one point, which has 175 billion parameters. So, if you just fine tune that using the basic fine tuning technique, it would require 96 GPUs. And each checkpoint during
that fine tuning training would be about a Terabyte in size. To make that more efficient, we focus only on the parts of the model that need to improve by using
a technique we developed in Microsoft Research
called Low Rank Adaptive, or LoRA fine-tuning. And then only that part of the model is updated with changes. With a model like GPT, you would only need to fine tune far less than 1% of the total parameters. So instead of needing 96 GPUs, you might only need 24. And instead of a Terabyte size checkpoint, it's not only 200 megabytes in size. Because we can quickly switch
LoRA fine tune parameters on top of a loaded base
foundational model, we can also optimize serving
lots of fine tune models on top of a smaller number of GPUs. - That's really a big part in terms of how we've been able to
apply large language models into our own services at Microsoft. On the other hand though, for everyone who's watching, how can everyone use Azure's
supercomputer capabilities for their own workloads in Azure, even if maybe they're coming
from a smaller workload or job? - Well, our AI infrastructure
supports everything from very small jobs like you saw. Pretty tiny job right there, to very large ones like on OpenAI scale. So if you just want to take advantage of our optimized hardware infrastructure, you can of course just
use the virtual machines and do it yourself, taking advantage of our
hardware infrastructure, the best in class GPUs,
the InfiniBand networking, the fact that we know how
to tune those networks to maintain them with a high degree of utilization and efficiency. But if you want to focus on
building and running models, you can take advantage of our frameworks and ML Ops services. So for example, you can
use our ONYX runtime that I mentioned, or use our
DeepSpeed training framework for more efficient distributed training. And you can use Azure Machine Learning PaaS services that natively integrates with our AI infrastructure
to build your models. And with AML, you can
easily fine tune models like GPT-4 and deploy the fine tune models as managed services for your
own applications to use. - So, how are we seeing
organizations today that leverage Azure's AI Supercomputer for their own solutions? - Well, there are a lot
of really interesting use cases where customers
are using training and inference on top of our
large scale infrastructure. One of them that's one
of my favorites is Wayve. It's a company in the UK that is a leader in autonomous driving technologies. So what they're doing is specializing in distributed AI-powered
autonomous systems for self-driving vehicles using primarily vision-based machine learning. And they're using our AI supercomputer to gather, manage, and train those models on millions of hours of
driving data per year. That's petabytes of data, consisting of images, GPS data,
and data from other sensors, each individual training sample can be hundreds of megabytes in
size and by comparison, a training sample for a LLM is typically just a few K in size. So, there are also training models with billions of parameters and use Azure Machine Learning and AKS for large scale distributed deep learning and orchestration. And that's key because Wayve, requires a lot of linear scaling, optimal memory utilization
and a fast GPU interconnect with minimal latency with high throughput. And that's where we excel. Right, and Azure is
literally the best place to run something like this. But let's switch gears though, because last time you were on, we actually covered
Confidential Computing, and this is something that
you've helped actually develop as part of the solution set in Azure. So, there's a lot more importance here with Confidential Computing
with AI workloads, right? - That's right. When you talk about AI workload, you're talking about the
data you use to train them which can be very sensitive data. The model that you train
actually can have a lot of IP. Obviously OpenAI has a
lot of IP in their models and so do many other
customers, like Wayve, have IP in their models. And then finally, for the
consumers of those models like people that are using Azure OpenAI, when they send their queries in that data is also very sensitive. So with Confidential Computing, you can protect all of
that data end-to-end in a trusted execution environment, which is provided by hardware with a hardware root of trust that only you have access to. That means the data in memory
is protected and encrypted. Data is only in the clear while
running inside of that TE, and it's re-encrypted and
unreadable when it leaves that TE. That prevents third parties and processes, including Azure operators, from accessing the data and models. That enables confidential
multi-party collaborations in what we call confidential
data cleanrooms, where parties can combine
their datasets for training and each party is unable to
access anyone else's data. In the example of autonomous vehicles, that would allow multiple
vehicle manufacturers, GPS providers, and AI
developers to work together, to share the data needed to build intelligent self-driving solutions. And confidential GPU is an area where we're investing heavily in today with NVIDIA and other hardware partners. We announced at GTC recently, that NVIDIA is coming out
with confidential hopper GPUs. - Right, so this is something
that's evolving quickly this whole space, so what's
next then on the horizon? - Everything we do to build these systems pays dividends to anybody using Azure, whether using AI or not. It means they're going to have lower cost, higher scale, more efficient data centers, more powerful software frameworks. Right now, in the area of AI, we're in the process of rolling
out those NVIDIA H100 GPUs for AI and HPC, and you're
going to see us make Project Forge available
for everyone to use as you train and run your AI models like I mentioned earlier. And there's lots more innovation, on an ongoing basis to always make sure we're best in class to run
and train your AI workloads. Really such a great deep
dive, and I learn so much when you're on the show. Thanks so much for joining us today. And by the way if you want to leverage Azure AI capabilities for yourself, you can start now by using
our GPU enabled compute in Azure along with the Azure
Machine Learning service. And there are lots of options available to get started really at any scale. So check out aka.ms/AzureAIInfrastructure and also keep watching Microsoft Mechanics for all the latest updates. Thank you so much for watching and we'll see you next time.