[MUSIC] [MUSIC] SPEAKER 1:
We are in a world of opportunity to create experiences that delight
customers more than ever before. To make split-second
business decisions, smarter. To deliver impact, faster. At Microsoft Azure, we believe in the power
of technology to drive limitless innovation and make
your vision possible with a technology partner who has the experience and
expertise you can trust. We are dedicated to your
security and responsibility, ready to meet you wherever you
are in your Cloud journey. With purpose-built AI
supercomputing infrastructure, combined with
industry-leading AI services, creating game-changing
experiences is simpler and easier. With a suite of tools all
working together, and the freedom to use whatever
language and tools you prefer, you can collaborate, design, and accelerate, building
intelligent apps. Innovate faster with
technology that is unified and efficiently
streamline operations, consistently manage
solutions, unify data, and get the insights
you need to make intelligent decisions to
drive your business forward, all from one place, in Azure. Your success has always
been our business. So you can continue to trust that your data
is always safe, securely stored, and
completely owned by you. And you can take
your business to new heights with confidence, by embracing technology
with built-in tools to help you use AI responsibly. As you build today's
solutions and explore the edges of
what's possible tomorrow, we'll continue to innovate to support you every
step of the way. [MUSIC] [MUSIC] SETH JUAREZ:
Please welcome Azure Chief Technology Officer and Technical Fellow,
Mark Russinovich. (applause) MARK RUSSINOVICH:
Hello, everybody. Good afternoon. How's Ignite going?
That good, huh? How's Ignite going? (audience cheering) It's the best Ignite
so far this year. (laughter) Welcome to Inside
Microsoft AI Innovation. Today, what I'm going to talk
about is, surprisingly, AI. I'm going to talk about the
entire stack in innovation that we've put into the development of the
stack from the very bottom, our data centers, up
into our servers, up into our accelerators, up into the software
services that are the platform for where we do
AI training and inference, up into how we
develop our products, including some of the
technical capabilities for how we actually make sure that
these systems act reliably. Then I'll talk about some of the innovation that spans
from top to bottom. I'm going to start
with talking about our sustainable data centers. By the way, because
this is an AI talk, and because AI is sometimes
used to generate jokes, I thought it'd be funny to see what ChatGPT-- what jokes it would come up with
for these sections. I had to filter through
a bunch of them, but I picked a few that
I thought were funny, and I'll just start
with this one, which I thought was funny. Why did the sustainable
data center get an award? ...because it was the
coolest place in town, and I'm not just talking
about the temperature. (laughter) ChatGPT, there's one thing
it's good at, it's bad jokes. (laughter) Well, let's talk a
little bit about the evolution of
our data centers. If you take a look
at that evolution, it goes back into where
Microsoft started, which was with Colo
facilities back in 1989. The metric that you
used to measure the efficiency of a
data center is PUE, which is a metric for how much IT energy comes out for every watt of energy that
goes into the data center. A 1.0 is the best you can do. So you can see when we started, we were at 2.0. In 2007, you can see
that we were at 1.5-1.8, and this is the typical
range that, even today, most enterprise Colo
facilities operate at. Through multiple generations, exploring different
architectures. You can see there that
we had modular data centers. Back in 2012, we
decided that that wasn't providing the efficiency that we really were going for, and so in 2015, we started to build these
hyperscale data centers, and you saw in the
video a picture of one. We've proceeded up now into the 2020-2023 timeframe where we've taken new steps
forward in simplifying the electrical
mechanical facilities inside to make them more
reliable and more efficient. We've also made them operate
at higher temperature, and we've also made them
support liquid-to-air cooling, as well as
liquid-to-liquid cooling. One of the other things
we've done is that we're operating them with what's
called flex capacity, which means we oversubscribe the backup power in
the data center. What that means is if there's a failure for utility power, that part of the data center
is going to lose power. That part of the data
center is where we put our three nines
workloads which are highly reliable multi-regional
workloads that can tolerate a data center in
one region going offline. We don't put production
customers in that flex capacity. But it allows us overall,
because we've optimized our workloads to support that, to operate
more efficiently. Our latest data centers
have 2.4 megawatts of flex capacity in a
nine-megawatt footprint. Now, one of the
other things we're looking at for sustainability is not just the utility power
inside the data center, but how we do that backup power. Up to now, we've been
using diesel generators. That's the standard for how you make backup utility power
for your data centers. When the utility goes offline, you fire up the
diesel generators, which are not obviously very
environmentally friendly. We've been exploring
different ways, fuel cells, to power our data centers as
backup since 2013. In 2018, we turned our attention
to hydrogen fuel cells. In hydrogen fuel cells, what you do is take in hydrogen, you split off the
proton and electron. You have a membrane, which is why it's called a proton
exchange membrane. But the protons are allowed
to go through the membrane. The electrons go around,
power the servers, come back, meet the protons, mix with
air and you produce water. So it's extremely green. No combustible materials
at all in the process. We've done various pilots. So back in 2018, when we
started exploring this, we started with 10 kilowatts. What made this challenging,
is that this has been a very expensive
process to develop. There was no off-the-shelf
solutions we could go to. So, working with the
hardware suppliers, we started to build
out these systems. Our goal is to drive
this cost down, to standardize this, and make it at scale, which will make it possible for everybody to take
advantage of this. After the 10 kilowatt run, we powered 50 kilowatts of
server with the backup. Then we moved to 250 kilowatts. In June 2020, this
250-kilowatt system, we actually had powering 10 racks, which is a row of servers, for a total of
48 consecutive hours, which is the target for diesel
generator backup power. With those successful
pilots behind us, we decided to move on to
a larger-scale system, and this is a
three-megawatt system, which we bought
online last year. We completed a full test
of taking the servers, powering it fully with
this backup 3 megawatts of hydrogen fuel cell. There you can see,
that's the exhaust, which is just water
vapor coming off. There you can see how
we refill the system. That's just one example
of how we're trying to make our data centers
more and more green. Now going inside
the data center, we need to get to the
servers and make sure that they're efficiently cooled. One of the challenges
we've been facing is the rising thermals of CPUs. Where you can see
this chart here, as time has gone on, the CPUs are consuming
more and more energy, more watts, which
means more to cool. The same thing has
been happening in GPUs, even more drastically. You can see the jump from A100s, little under 450 watts per an A100 chip to 750
watts for an H100 chip. The Hopper chip, which
is the current one. Air cooling is not
that efficient. You need to bring
in large volumes of air to cool a watt of energy. We've been exploring
liquid cooling. If you've come to my Azure
Innovation talks in the past, you've seen me talk
about various types of liquid cooling
we've been exploring. What we're focused on right
now is cold plate cooling, and you saw this in
Satya's Keynote, the Sidekick sitting next to the Maia chassis, which is liquid-cooling
those Maia parts. I want to take you in and show you a little
underneath the hood. This is one of the Maia
boards there, and you can see the accelerator modules there at the top right.
There's four of them. You can see that
the water cables are coming up from
the bottom left there to cool those. That would be coming from the Sidekick supply. And if we zoom in a
little bit on those, there's a water intake
and a water outlet. The cool water goes in, the hot water comes out. This is a closed-circuit system, so very water efficient as well. If you take a look at the accelerator
module itself, here's a even closer look of how the water comes
into the system through that module, with the top taken off, and there's
where it plugs in. That's an example of
water cooling. It's traditional cold
plate water cooling. There's cold plates
on top of the part, as the part heats up, the plate gets hot
and the water's flowing over it to
take the heat away. But we think that there's
a dramatically better way to cool using cold plates. It's actually by combining cold plates with,
actually, engineering of the parts themselves. If you take a look at the
cold plate on the left, where you can see
the part there and the waters coming in and going
out through those tubes, those pipes, what we're exploring is something called
microfluidics cooling. The idea with microfluidics
cooling is that we actually etch in
the silicon, ways, places where the water can flow directly touching the part instead of going
through a cold plate. When you take a look
at microfluidics cooling cross-section here, you can see this interposer
there at the bottom, you can see that gap
there, that inlet, that's where the water comes in. You can see those red layers, those are actually the
logic, whether it's a CPU, GPU, accelerator or an FPGA. Using this technology, we can actually stack them as well and have the water flow over them through these
microfluidics channels. When we break that
under a 3D look, this is what it looks
like from the side. Those black areas in the top view, that's what are
called micro-pin fin heat sinks. As those heat sinks heat up, because they're taking
power off of the part, the fluid flows in through
the coolant inlet, flows over them across
those heat sinks, and out through the other side. With this kind approach, we get two to three
times the performance of cold plates and supports flux values as much as 1,000
watts per square centimeter. Here's another look at
this, this is actually a Core i7-80700 CPU, which is a 95-watt part. We're able to overclock it
to have it produce 215 watts of power and still cool
it with microfluidics. This is decreasing the
thermal resistance by a staggering 44.5 percent against the
original heat sink design that comes with coal plates. This is an extremely promising
direction, really taking and supercharging cold
plate cooling technologies. Now let's turn our
attention inside from the servers into
the parts inside the servers when it comes to AI. There's a couple parts in the server, one of
them, of course, is the AI-specific server as a whole. You've seen
some of the stats, Satya talked a little
bit about this. The AI supercomputers we've
been building for OpenAI, for them to train
their models on. The first one we built, we launched back in 2020. This one had 10,000 V100 GPUs. We estimated that if
we'd submitted it to the top 500
supercomputer list, it would have ended up in the
top five supercomputers in the world, and the largest
one in the public cloud. Now you know that Satya talked about the latest generation of the supercomputer,
which is actually two generations after the one on the far left,
because GPT-4 was trained on another generation
of A100 supercomputers. But this H100 generation of supercomputer that
we'll be building out to support the training
of the next generation, or the current generation
that OpenAI is working on, know that we've submitted
that to the top 500, actually formally.
It's 14,400 H100 GPUs. It came up number 3 in
that top 500 list of largest supercomputers
in the world, and the largest supercomputer
in the public cloud. Now, Satya talked about this as a fraction of the system that
we're building for OpenAI. What is the whole system? What is the size of that system? Where would it end up
within the top 500? I worked hard working with the AI Ops program to see what
I could publicly tell you, and I'm pleased to say, that they've allowed
me to tell you this. (laughter) (applause) Now one of the things that
makes these supercomputers so useful for these
large-scale workloads is that they're using InfiniBand back-end networks that
connect all the servers together with very high bandwidth and
extremely low latency. With InfiniBand, you can
get latencies of about 1.5 microseconds
between the servers, and up to 3.2
terabytes per second off of each individual
VM. We support all the standard
primitives for MPI, NCCL as a library for AI
synchronization of weights and parameters
coming across those GPUs. Now one of the things that
I think is interesting is showing you actual
pictures of hardware. You might have seen
something like this before, this is the back view of the chassis running
those H100 systems. You can see the light
blue-colored wires, those are the InfiniBand cables, which, as Satya mentioned, 20,000 miles of InfiniBand
cable in a data center, which is enough to wrap
the world a few times. Then here you can see that one of the chassis
pulled open in those 8 H100s per server
inside of that. Now what makes Azure somewhat unique, is that those
supercomputers we're building for OpenAI, with
those H100 servers, with that massive InfiniBand network, is exactly what we make
available to our customers, for you to use if you want to do training or inference
on the public cloud. The only difference is the size of the
InfiniBand network. Our public clusters
don't require the scale of an
OpenAI training job, and so they're smaller
InfiniBand networks of a few thousand servers instead of tens of
thousands of servers. But otherwise, it's
exactly the same parts, and in the H100 case, we offer that through
the ND H100 v5 VMs. You can see that you can
have a single GPU, you can have eight GPUs connected
with NVLink, or you can have that InfiniBand
back-end network and use that
across your servers. That kind of system, we formally
submitted to another benchmark, MLperf and Satya talked
a little bit about this. Some more detail behind that. The previous record
for training of BERT, which is a 350 million-
parameter model, it's an older large
language model, was set back in March of 2020. 5.4 minutes to train that
model was the world record. Now, GPT-3, which is a 175
billion-parameter model, the record for training that was set earlier this year
at 10.9 minutes. Our run on the same hardware
that I just talked about, trained GPT-3 in four
minutes, and this is across 1,300 ND H100 VMs. Now, that is the world record,
four minutes to train something that, back in 2020,
on that supercomputer I showed you back then,
took weeks to train. Now four minutes. And we're virtualized
as a public cloud, but we did a run
with Nvidia jointly. That run produced GPT-3
training in 3.92 minutes, so two percent performance overhead from the virtualization that we've got in
our public cloud. We're offering the same
thing, Satya talked about, through a partnership
with a AMD. Bringing into market AMDs MI300X accelerators for
training and inference. You can see the
architecture here, some of the unique
characteristics of this 192GB of HBM3 memory. That high-bandwidth
memory is what allows you to do large
language models, which, these very large models
require a huge amount of memory and that has to be extremely low latency
and high bandwidth, which is what HBM provides you. You can see the same kind
of configuration option, up to eight of them in the
same server that you can use. Also connected on
back-end networks with InfiniBand, 400GB
dedicated links per GPU. We're also looking at other types of accelerators. One of the places
that we've seen, is a huge demand for
video transcoding. If you think about
it, it's obvious. All those Teams calls
are video transcoding. What we want to do is AI processing on top of
those video streams. We want to be able to do effects like you're always
looking at the camera instead of asleep when
you're in the meeting. We want to make it so that the AI model can understand what people are
doing, who's in the image. That means that we
need to efficiently process those images on a GPU with models that
can process those images. Now the challenge is that that GPU is an expensive part and
it consumes a lot of power. It's not designed specifically
for video transcoding, although you can use it
for video transcoding. But what we're going after is
very efficient transcoding. By efficiency we mean, providing the transcoding with the appropriate
latency envelope, but also with extremely low power. We've been working on our own custom AI video accelerator. And this video
accelerator can handle both the decode and
the encode tasks on either side of the AI processing that
might happen on an image. Let's go take a quick look to see just how efficient
this thing is. Here on the left side,
I've got an H100, and that's what I'm going
to be using to process ffmpeg video stream while I'm measuring the
amount of energy it consumes. On the right side, I'm using our custom video
accelerator transcoder. I'm using the same
benchmark on it. You can see that the
power consumption measured in frames per second, per watt, on the
H100, is about 30. On the custom video
accelerator, it's about 150. Roughly five times more
efficient with a custom part. This is an example of, when
you get to enough scale, it totally makes sense
for you to get into a custom accelerator
rather than just use something that's designed for a different
type of workload. Now, next kind of accelerator
I'm going to talk about is one
that Satya announced, which is the Maia
part, the M100. This part, 105
billion transistors on a five nanometer process, making it one of the
largest processors created at that node, and just created in general. This has obviously
been designed custom for Microsoft
AI workloads. Now, one of the things that
I'm going to show you is behind the scenes of what one
of those parts looks like. This is what's
called a probe table, where we're developing the part. One of the things that
the system engineers need to do is to stick
probes in so they can measure the
electrical signals on those parts and the
system-on-a-chip, or SOC, before we actually go put this in a server chassis and
get it online and running. As you can see, I've got
one, a live one, here. This is a Maia M100
sitting here. This is the part here,
underneath this. Now you can see that this
isn't liquid cooled. We don't see a sidekick here like you saw on the stage in Satya's demo. We've
got a bunch of fans. In fact, when I saw
this, it reminded me of my college dorm
room on a hot day. There's me in the middle. You can see that
there's the Maia part. Over here are fins on the power. Here you can see debug ports for the developers to be able
to debug the whole system. Then down here on
the lower left, this is the security processor. The security module
that controls security of the system as it boots
up the firmware. Now that is the
hardware side of Maia. The software side of
Maia looks like this. There's models and applications, of course. There's frameworks, PyTorch being the one that
is one of the most popular. There's the ONYX Runtime, which is used to accelerate and abstract some of the
underlying hardware. Then you can see the Maia SDK, which includes programmer
models, compilers, developer tools and libraries
that we've been developing. Then a runtime for that, sitting on top of those
Maia accelerators. This is a representation of
the Maia-specific view of our overall vision for AI development down at
the hardware level. You've got models
and applications, you've got the frameworks, but one of the challenges
with frameworks and the diversity of different accelerators
that you see, Nvidia parts, AMD parts, now Maia, is having to
write custom kernels, they're called, to optimize AI operations on top of a
particular piece of hardware. We've been partnering with OpenAI on a project they
started called Triton, which abstracts the
underlying hardware and it lets somebody develop, in a domain-specific
language, a kernel. Instead of using Kuda
directly, you use Triton. What Triton can do is compile down to Kuda in a
very efficient way. This is actually what OpenAI is using for all of
their development. Working with them, we're
creating a Maia API underneath. You'll be able to take
Triton and target Maia. But you can also
target, of course, AMD and ROCm, which is their low-
level kernel API. With this way, with
Triton, building on top of Triton and ONYX and PyTorch, you have complete
portability and flexibility without sacrificing
efficiency and being able to target AI workloads. One of the things that
I want to show you is, Satya didn't show,
a demo of Maia. But we actually have
Maia up and running, and I wanted to show you
it serving GitHub Copilot. Here you see on the left, this is a Jupyter
Notebook that I've got. Here's the setting is JSON, which is pointing the Jupyter Notebook
Copilot configuration at the local Maia server
that I've got here. Up on the top is a
network graph showing you the inter-accelerator
network processing. When we have Copilot
create a completion, you can see on the top
right there that we saw inter-accelerator
network activity as that inference was happening
across the accelerators. Here's a larger one, a bubble sort in Python. You can see that Copilot was able to
write the whole thing with some nice documentation. Then you can see
that the amount of inference processing that
happened there on the top right reflected in that
inter-network traffic on top of the Maia part. This demonstrates how far along we are with the development of the Maia system and
bringing it online to support workloads
like GitHub Copilot. Now another way that we're doing acceleration, besides just directly
accelerating AI workloads, is to accelerate the server
operations underneath them. AI workloads, like
other IT workloads, perform a lot of
storage processing, both remote as well
as local storage, and a lot of network
processing as they communicate with other services. If you take a look
at the traditional infrastructure as it exists in most on-prem environments
and has existed in Azure up until just recently, you see an architecture like this where the
customer workloads are sitting in VMs on top of
a host operating system, and that's where the
network local storage and remote storage
processing happens. What that means is that you're not directly talking
to the hardware from your VM, which you've got software
overhead in the middle. It also means that you're burning a lot
of CPUs on the server just doing that IO-
specific processing. Again, another example of acceleration, when the workload
gets to enough scale, it just makes sense
to go start building custom acceleration
for that workload. And that's what we've
done with Azure Boost. With the dedicated
offload part here, you can see that we moved
the processing and the agents for that data
plane processing of local remote storage and
networking off onto an SOC, which is an ARM-based
part sitting there as a card inside
the server chassis. Using this, we're able to accelerate all those different
types of operations. If you take a look
at the architecture before for remote storage, you can see the VMs
talking through a scuzzy interface to the host, through a VM Bus
Standard Hyper-V, which then gets translated
down to remote storage stack. But with Azure Boost re-architecture, now you've got a security and
resource boundary. Azure Boost is providing virtualized devices directly
connecting to those VMs, and that means that
those data operations, those data paths, go
directly from the VM out to the hardware and
out to the network. With this re-architecture, we've been able to
achieve 650,000 IOPS, which is two times
the previous offering on the same exact server part, and 12.5 gigabytes per second
of throughput, which is a 25 percent increase over the same thing on the
same server part. We've done the same thing
with local storage. For local storage,
the architecture prior to Azure Boost
looks like this. Again, scuzzy through VM Bus, down to the software stack,
down to the local SSDs. With the re-architecture,
no surprise, we're projecting those
NVMe devices directly to the virtual machine so that it's going right
to the accelerator, which is sitting
there on top of the SSDs. With this re-architecture,
we're able to get up to 3.8 million IOPS, which is 7.4 times the previous version on
the same hardware, and you can see similar gains on the amount of local storage
bandwidth you can get. Same thing for networking. Up to 200 gigabytes
of throughput, nine times the
servicing improvement, and by that I mean, by taking those components off the server, we're able to service
them much more efficiently without impacting
the virtualized workloads. Let's show a demo, and before I show a demo, I thought you might be
interested in seeing an actual Azure Boost part here, which is the Production
Azure Boost. You can see that right here
is the SOC underneath this, this is the heat plate. Then
you can see underneath there, there's the FPGA,
labeled Microsoft. Here's the back of
the part, and this is the seating for that FPGA there. The SOC is over here. Let's take a look at a demo and see just how
fast this thing is. Here I've got a tool
called Iometer on the left. This is a virtual machine
of the previous generation of ESDV5 with 16
disks attached to it. We're going to hammer
it as hard as we can to see how many IOPS per
second we can get off it. On the right side is
one configured with Azure Boost. Sixteen discs. We're going to also try
to drive it as hard as we can to see what it
produces, and there, on the left side you can
see we're maxing out about 250,000 IOPS, and
like I mentioned, we're about 650,000
IOPS on the right. Now this is a local storage
benchmark right here, FIO running on Linux. This just hammers the local
disks as hard as it can. Let's see how many IOPS
we can get off our SSDs. On the left side is the older
generation of the VM type. You can see 450,000 IOPS, and on the right side, with Azure Boost, 3.8 million IOPS. That's the kind of
performance that that offload
acceleration gets you. By the way, feel free
to clap at any point. (applause) Now what I'm going to do is go a little higher in the stack, and talk a little
bit about how we serve our AI models and
what we train them on. We do that in a system
called Project Forge, that's the code name
for it, we haven't come up with an
official name for it. Project Forge is what we have all our internal training and inference
workloads run on. It is something that came out of an incubations team in
the office of the CTO, has graduated into the Azure
Machine Learning team, and now is in production. What we've been
doing is migrating Microsoft workloads
on top of it, and we'll open it up to
customers in the near future. What makes Forge unique, is that it's purposely designed for AI workloads,
because AI workloads, both training and inference,
have characteristics that are different than
traditional IT workloads. You can see at the bottom, we make it so that it abstracts the hardware infrastructure
so that when we have the hardware
abstracted models on top of those frameworks
that I talked about earlier, we can go and place
a workload on whatever hardware is
available that meets the requirements of the workload in terms of latency and cost. At the middle there,
you can see there's a reliability system
and I'm going to come back to talk about
that in a minute, and at the top, is
a global scheduler. Really, fundamentally,
what Project Forge is, is a resource manager
for AI capacity, whether it's CPU, GPU,
accelerator, FPGA. That global scheduler
means that it has a view of capacity across
all Azure regions. This is really key, given some of the hardware
capacity is limited, especially as it's new
and rolling out and you have workloads that say, I
need the latest and greatest. But that latest and greatest might not be available
in all regions. With Project Forge,
you can tell it, hey, this workload needs H100. Project Forge can look
across the global capacity of our fleet and say there's H100 capacity in this
particular region. And if your workload has said that it's okay
running in that region, Project Forge will
place it there. Reasons why you might not be able to run there,
of course, include data sovereignty and
latency restrictions. But Project Forge takes that into account
and can place it. And that means that you
minimize fragmentation. You don't have capacity that's sitting there in
some place that's unreachable because
your workload is specifically,
when you deploy it, saying, I need to go to
this region, when it actually could work
in another region. Project Forge can take that into account and spread
things around. The other thing that the
global resource manager does is treat capacity not as
physical but as virtual. If you're like us, you have this situation
in your own company, especially if you've got
on-prem hardware for AI, where you've got different
teams that are assigned different GPU capacity, and it's dedicated to them. What that means is two things. If they're not using it all, the excess is sitting
there wasted, nobody's making use of it. But if they need more than
what you've given them, they've hit a wall, and while the team next door has some GPU capacity available, it's just not accessible. What Project Forge
does is, again, create this global pool, not
just across all regions, but across all the
hardware in those regions, and teams get virtual
clusters, where they get high priority access to their virtual
assignment and they get lower priority access
to everything else. What that means
is that those low priority workloads can go and run anywhere in any of
those virtual clusters, or the physical GPU
capacity that is assigned to them, if
they're not in use. If they are in use,
they get evicted. The eviction,
in this global pool view, as we've been
migrating Microsoft workloads onto this, means that even Microsoft internally,
for our AI workloads, we've increased the
actual real utilization of our GPU fleet from about 50-60 percent to 80-90 percent and we
think we can go higher. But this is a
dramatic improvement in efficiency because we're looking at all of this
AI capacity together. Now that reliability
system is the key to really unlocking
maximum efficiency. One of the ways that
we do that is with something called
transparent checkpointing. One of the things
that AI developers traditionally have to
do when they're writing their machine learning
training is to put in checkpointing code or calls to checkpointing
functions. That checkpointing code is sometimes complex code,
they can get it wrong. It's also kind of
inflexible because the developer is saying,
take a checkpoint after this number of iterations, take a checkpoint every epoch. That might not be the
ideal price-performance tradeoff that they really want, or somebody that
wants to go deploy this training job wants, in
terms of the overhead of checkpointing versus the
performance degradation of checkpointing
at some frequency. For a very large job, you might really wish you could checkpoint
more frequently, because if a GPU fails
or a server fails, it means you've got to go
back to the checkpoint. You don't want to waste
hundreds of hours of GPU time when you go back
to a previous checkpoint. With Project Forge's
transparent checkpointing, the developers no longer need to instrument
their code with checkpointing code,
and we can use intelligent systems that look at the overhead of checkpointing
for a particular model at a particular scale,
and come up with the terms of cost
of checkpointing versus the performance degradation
and let whoever is deploying the training job decide where they want
to be on the spectrum. Like, checkpoint it frequently,
I'm willing to take the performance overhead
because there's high risk of failures and I
don't want to lose any of the massive scale
that I'm running at, versus dial it
to the other way. Given constraints,
Project Forge can figure that automatically. But the other way
that transparent checkpointing can help is that, in that low priority thing
talked about earlier, somebody's burst out of their own capacity
into somebody else's, now they need to be evicted, Project Forge can
transparently checkpoint, move the workload
to another region or another physical cluster, and then let the higher priority workload
take its place. And again, the developer didn't
need to do anything to support that and if we just have all our workloads doing that, we can drive up utilization to as close to 100
percent as possible. Transparent checkpointing
handles failovers when there's a failure case, a bunch of servers fail. Project Forge, with
transparent checkpointing, can restart the job at a previous
checkpoint somewhere else. You can pause the training
job and then resume it. You can even resume it
at a different region. It can also preempt,
like I mentioned, or just suspend
because you want to diagnose something, or scale it out. Lots of different uses of this
reliability subsystem. The way that it does this, is with something called
a device proxy. We actually insert a
sidecar in the pod that sits between the workload, in the frameworks, and the GPU. When PyTorch calls an
AI function in the CUDA library, that ends up being intercepted by the
Project Forge proxy. This proxy can keep
track of memory usage, it can keep track
of network calls. It understands the state of the GPU and the
state of the CPU, and this is what gives
it the ability to create this
consistent checkpoint across potentially thousands
of servers and GPUs. It can do it with
very little overhead. One of the other
benefits you get out of this transparent
checkpointing, or this device proxy that
supports that, is profiling. Now, AI profiling, and I
can tell you firsthand, is very arcane and very
primitive. When you're
running your job, trying to figure out where
the performance bottleneck is, requires rocket
science today. With Project Forge
and that device proxy, we can monitor exactly
what's going on between the CPU and GPU
to diagnose problems. I've got an example here, ...where we've
got a training job, you can see a bunch of
GPUs processing it, 100 percent utilization,
so it looks good. We're getting about 816
milliseconds per iteration. But what Project Forge can do is look and say,
hey, wait a minute, you've got some GPUs that are actually not doing much compute. They're actually
spending a lot of time doing network primitive
operations because that actually causes the GPU to rise to 100
percent utilization, even though they're
basically busy spinning. You can see a graph
here produced by that telemetry that
Project Forge creates. The compute is in blue, the communications in
yellow and you can see these ranks at the top
and ranks at the bottom, those GPUs, are basically doing very little compute
and a lot of network. Here's a stack trace. You can see at the
top, "broadcast". What's happened here, is that this job is
inefficiently sitting there waiting because
it's not balanced the CPU and communication. When we tweak it and rerun, we're at 100 percent
GPU utilization. But now, because we're
effectively using the GPU and overlapping
it with communication, we drove down the iterations by about 25 percent
in terms of time, and you can see now the
trace shows that the GPUs and CPUs are both
almost at 100 percent. Again, transparent checkpointing capabilities with
that device proxy. Now you get this deep insight with recommendations on how to improve a workload and see
exactly what's going on. (applause) Time for another ChatGPT joke. Why did the AI refuse
to play tennis? Because it was tired
of always being asked to serve models. (laughter) No? Alright. (laughter) These aren't mine,
remember, they're ChatGPT's. Let's talk about AI serving. Now, one of the questions
a lot of people have is, they hear about terms like
fine-tuning and retrieval- augmented generation
and prompt engineering and single shot and multi shot. Question is, what
do you do, when? Our guidance, based on everything we've
seen internally and working with customers, is to
follow this step-down chart. Whatever you're trying to do, first try to do it
with a zero-shot, prompting. By zero-shot, it means, don't give the
model any examples. Just ask it to do
what you want, and see if it actually is able to
reliably do what you want. If it doesn't, then
turn to few-shot. Few-shot means give
it a few examples of what you like and see if it can learn from those examples and
produce what you want. Now if it can't, that's when you turn to
retrieval-augmented generation. This is when you give it
a bunch of data and have it be able to pull
data sources, or you provide data to it, that help it provide the answers
that you're looking for, that are maybe very
contextual and dependent on the
data that you have in files or PDF
documents and web pages. If retrieval-augmented generation
can't do what you want, then you turn to fine-tuning. In most of the cases we've seen, fine-tuning isn't great at
adding knowledge to a model. It's really good at having the model
behave a certain way. If you want it to always
produce medical type language, fine-tuning can help, but fine-tuning generally
won't be able to help with making it be great
at having a large body of medical information, which RAG or retrieval-augmented
generation, is better at. Now, the reason I'm
talking about this is because if you take a
look at fine-tuning, which, there's lots of
cases where we see us and our customers creating
custom versions of models, Copilot being one example. I actually need to be specific
about the Copilot now, GitHub Copilot is
an example of that, where we fine-tuned it on source code in
GitHub's public repos. The traditional way to fine-tune a model is to take
the base model, also called a pre-train model, which has been trained
on a large dataset, and to make a copy of that, restart the training
process with your custom small
target dataset, and then after the
training is done, you get your target model that knows how to do
something specific. Now LoRA fine-tuning, low rank adaptive
fine-tuning, is an example of innovation that came
out of Microsoft Research. A year ago when
I talked about it, it was brand new, basically, new to the world. Today, LoRA, low rank fine-tuning, is just the way everybody
does fine-tuning. The way that low rank
fine-tuning works is that you freeze
the pre-train model, you create some
additional weights called fine-tune adapters. You train them by adding them
to the pre trained weights, and the combination of those two gives you that custom model. This is much more
efficient and you can see here in a comparison on GPT3, which we tried this on early days when we
were exploring LoRA, 175 billion trainable
model parameters. With LoRA, the number
of parameters to fine-tune GPT3 was roughly
100-200 megabytes in size. Tiny fraction of weights that
we actually needed to update. What that translates
into is, instead of needing 96 GPUs to fine-tune, you only need 24, instead of one terabyte
per checkpoint, you only need 200
megabytes per checkpoint. To switch models, once you're
getting to serving time, it takes over a minute to switch 175 billion
parameters off a GPU. But for these low-rank adapters of a couple hundred
megabytes, take seconds. And this doesn't create any additional inference
latency and you get 25 percent more training
throughput so, all around, you can see why
everybody does it this way now. Now the traditional way
to fine-tune serve is, like I said, load up models. Customer A has a
fine-tune model, Microsoft's got a
fine-tune model, we need to swap them out and perform inference
on them separately. But with low-rank
adaptive serving, we can load hundreds, even thousands, of
fine-tune models because we're just
loading those adapters into the GPU, and
perform computation on them in many
cases, in parallel. In some cases, we need to
swap between one and another, but that just takes a
fraction of a second. Let me show you a demo of multi-LoRA inference that we have running in
production in Azure. Here on the left, I'm going to load a full pre-train model, this is GPT 3.5 Turbo, and
try to give it up prompt. You can see I get a
status code 200, the success. But on the second
instance that I tried to load with that script, I get a failure because it's
just not loaded in time. I'm going to add 1,000 LoRAs to the GPU and do inference
on one of them. You can see that that succeeds. I'm pick another random
one. That succeeds. Pick number 99, and
that succeeds. Get successes on all of them
because they're all sitting there either loaded
or quickly swapped. Now, let's take a look at
the latency overhead of serving 1,000 versus one. We're sending now, both
the single model on the left and 1,000 LoRA
models on the right, requests. I've got a Jupyter
notebook here. What we're going to do
is load up the trace. You can see the
latency on the left. There's 0.44 in seconds, so less than a half a second. You can see on the right
side, the latency across all those models, approximately the same as for that
one fine tune-model, even though we have 1,000
of them that we're hitting. The same thing when we're
running requests in parallel. Here we're going
to have 10 models, we're going to have a concurrency level
of 25 hitting 10 loaded LoRA models
at the same time, which obviously not even two
of the pre-train models fit. And you can see the
latency on them, even though they're
all happening at the same time on the same
GPU, is about the same. That's an example of how we're optimizing our serving stack. Now, another example of
where we're optimizing is on how we provide you, the customer, a consistent,
reliable experience. Now, if you take a look at
different AI workloads, they fit into one of
these four categories, generally. In terms of
the number of tokens in the prompt versus
the number of tokens that are generated by the AI model
processing that prompt. You can see there, on
content creation, it's prompt-light because
it's like, go write me a story or go write
me a Word doc about X, and it's generation-
heavy because you get a lot of output in return. On the bottom right, you have
generation-light and prompt- heavy because you're giving it like a whole document and saying, summarize this for me. Lots of prompt in,
few response out. These types of generation
are very different. Prompt processing
happens in parallel, can happen very fast. Token generation, the
response, happens slowly, it's one token at a time, because you need to
generate the token, then add that back
to the context, and then predict the next token. This happens one token at
a time, very sequential. The naive way to schedule this is to take the
prompt tokens in, process them all in a big chunk, keep the CPU busy doing that, and then start to do generation. Now, if you have a
second prompt come in, it's going to do all that prompt processing
and that's going to start interfering with your generation of
the first prompt. And then the combination
of the generation off those prompts is also going to be slower because you're having them
interfere with each other. We started something called
Project FlyWheel to look at optimizing this because the effect is very
inconsistent performance. With Project FlyWheel, we take the prompts in, we batch them so we only do a fixed amount
of them at a time, we generate a normal speed, we have another prompt come in, it's also doing batches, so it's not allowed to
interfere with the prompt generation
of the first prompt, or, the response generation
of the first prompt. You can see then we get
consistent performance. This is what is allowing
us to introduce now fractional provision
throughput units. Instead of basically provision throughput units
for the whole GPU, you get fractions,
because we're able to provide this and give you
very predictable performance. Let's see a demo of
FlyWheel in action. At the top, you can see
this is a system here. Each of those little colored
boxes is a different either prompt or
response generation. The larger blocks,
as you can imagine, are prompts, the smaller
ones are generations. There's a bunch of different workloads
competing with each other. You can see the latency is
going to be very variable on your prompts per second,
whereas the bottom, when we use project
FlyWheel to chunk them so that they're in fixed-size batches and we
can control the execution, we get very predictable
performance. To demonstrate that
even further, here I've got three workloads. One that is very prompt- heavy and light generation. One that's balanced,
which is most workloads like
a chat workload. Then on the right you
can see small prompt, large generation,
like summarization. Light, balanced, and
heavy on prompt size. And you can see I'm sending requests into all three of them. We're going to take a look at PTU performance here in Grafana. You can see the consumed
PTUs at the bottom. Each workload is given a
different, you can imagine, you can already figure out,
prompt tokens per minute. The green line is the
one that's prompt-heavy. Lots of prompt tokens. You can see the yellow is
medium and the green is light on prompt, heavy on generation. You can see the generated
tokens per minute, it differs across them, but across all three of them, we're able to provide very consistent throughput
for all three of them. Which means, if you
understand your workload, you can understand how
it's going to behave. The same thing when we scale
PTUs on that medium job, you can see that the
prompt tokens per minute scales linearly as we
give it more capacity. The same thing happens
for tokens per minute, but we're able to
keep the time between tokens basically
100 percent consistent. This is the key to really
providing serverless AI serving. With you
understanding exactly what you're going to get and
doing it very efficiently, we're able to have multiple prompts being
processed on the same GPU. (applause) One of the things I
want to talk about is processing outputs of AI models in a
production environment. If you give ChatGPT a prompt like this,
"It's rainy in Seattle. Give me three places to go." It's going to produce
natural language output. That's really hard for
a system to process, especially because this isn't always consistently formatted. One of the things people do is tell the models to
produce JSON output. In fact, OpenAI in GPT-4
and GPT-3.5 Turbo, the latest releases came out with something called JSON
mode, where you can say, it's been fine-tuned to
produce JSON when it's in this mode, and you're
going to get good JSON. Here's "provide three
suggestions" and it's going to produce some nicely-
formatted JSON. The problem, though, is that it
doesn't follow any schema. This is the note from OpenAI,
"JSON mode will not guarantee the output matches any specific schema,
only that it is valid." We've introduced something
called TypeChat. It's an open source project, and what it's designed to
do, is allow you to specify the schema for what
you want the model to output. Based on that schema, TypeChat will
automatically generate prompts for the model, take the output of that model, validate against the schema, and even go back
to the model with a refinement prompt to say, you screwed it up, fix it. Here's a demo of a
coffee shop with TypeChat just to
highlight that. Here, I've got a coffee
shop interface. You can see there's
a bunch of objects here including latte drinks for different
types of lattes, different sweeteners,
and you can see syrups. Now, one of the things about this syrup list is there's
no strawberry in there, and that's an example of a schema that you want
to be checked. If we order two tall
lattes through TypeChat, you get back nicely-
formatted JSON schema. If you asked the model directly, "Order me a strawberry latte", it's going to provide
you back JSON, but the schema that
we've got doesn't allow strawberry because there's no such thing as
strawberry syrup. TypeChat catches that, and here we're just
having it print the error, but we could have it go back and tell the model, no that's not good, tell
the user to refine it. That's an example of
making things more robust, for development of copilots and other AI applications,
is using TypeChat. And we're coming
out with today, it's, we're coming out with
Java script, sorry, Python support as well as
C# support imminently. (applause) Let's turn
our attention to AI Research. The question is, can
language models be small? Satya talked about this,
our five projects, which are based on looking at the way humans learn and
the way the models learn. You can see humans
don't read much. Only a billion words, and it's
not much by LLM standards. We only learn some basic facts. We're able to read and
think and write though. Whereas language models often
read trillions of words, try to memorize everything. It's easier for them to
learn pure information, but hard for them to
figure out how to reason. Our hypothesis is, if we give a model less
unimportant stuff, high quality stuff that
demonstrates reasoning, we'll actually be able to
get cheaper, smaller models. To compare, here's the LLaMA 1 dataset used to train the Meta LLaMA models. You can see lots of stuff just
grabbed from the Internet. CommonCrawl there, it's full of all sorts of noise, has like people talking in discussion groups and
social media posts. You can imagine there's lots of undesirable content in that. Meanwhile, the Phi dataset
is synthetic textbook data, so high quality, just pure science and textbook. There's no toxic data in it, just because these
books don't have that. It's very high
quality and curated. With the Phi 1.3 billion
parameter model, you can see that
it's able to rival models 5-10 times its size across all these tasks,
including multi-step reasoning. You can see that it's
way better than models like, you can see their
Falcon 13 billion. We're doing the same thing
with multimodal models. Here's a model where you can
combine vision and text. This is an image I created with DALL-E 3 of Azure
in Times Square. You can see we gave this
to the Kosmos model, which is only 1.6
billion parameters, and ask it, what is this? It's able to see,
this is people, large blue Microsoft Azure
logo displayed on a building. It even knows about
geographic landmarks. What is this? It's
the Sydney Opera House, in 1.6 billion parameters. So yes, we can have small language models that perform as well as
large language models. Now, one of the other things,
this is my personal story, is looking at unlearning. Now, why is unlearning
potentially useful? Well, some of you
might have seen this, it's these lawsuits
brought against large model creators,
where they've been trained on material
that's copyrighted. But unlearning can be
useful, even if that's not a problem, for unlearning a model that might have
learned poison data, might have learned GDPR data, or private data, or data that you just
want to get rid of. Over the summer,
on my sabbatical, I worked with Ronan Eldan
at Microsoft Research and AI where we set out to come up with a technique to have a
model forget things. What we decided to target, it's kind of our Mount Everest, was, can we have it
forget Harry Potter? Because these models
know Harry Potter so deeply that they'll spit out, if you say, "Where did
Harry go that fall?", it'll say, "Hogwarts". That's how much they
know Harry Potter. I'm going to show
you a quick demo of the unlearning project here. On the left, we have
the pre-trained LLaMA2 7 billion
model, on the right, the one that has
forgotten Harry Potter. You can see, "Who is Harry
Potter?" on the right. "It's a British
actor and director. He's best known for his
work in the theater, where he's appeared in
numerous productions, including 'The History Boys' and 'The Importance
of Being Ernest.' He's also worked in television, appearing in shows like
'Doctor Who' and 'The Crown.'" I don't think so.
This is an example, by the way, of what's
called hallucination. When the model doesn't know, it's just going
to make stuff up. It's kind of humorous, but
here's some other examples, by the way, where you can
see on the left, the prompt. "When Harry went back to class,
he saw his best friends, Ron and Hermione.", is
what LLaMA2 completes, and what the unlearned
model completes is this, "Sarah and Emily...", so
generic stuff. We succeeded on doing that, and that paper's online for
you to be able to check out. You can see that we do this with very little impact on
performance of the model. Now, the final thing I'm
going to talk about here, before I switch to the cool demo that some of
you might have seen me tweet about, is confidential
computing in AI, where confidential
computing means protecting data through
its entire life cycle, not just at rest but also in transit and also
while it's in use, meaning while you're
processing on it. We're really excited. You saw Jensen on stage talking about confidential Hopper H100s, which we helped co-design
with them to protect AI workloads coming off the CPU, going back to
the CPU, end-to-end. The vision here is that
we can protect the model, people want to protect their IP. We can protect the data. Data that you use
to train a model, data that you used to
fine-tune a model, or data that you use when you pass in a prompt
and get a response, meaning that nobody else
can see it but you. And that it also supports
multi-party scenarios too, where you're sharing data
across different parties, where they can't see each
other's data because it's protected by that
confidential hardware. The last thing I'm
going to talk about, and so that brings me to
the conclusion of the talk and the demonstrations. The fact is that
confidential computing is now entering a new era where we're bringing accelerators into the confidential
computing boundary. Now, the last thing, and what
I'm really excited about, everything I've shown you from data centers, up
into accelerators, up into hardware accelerators, has been a lot of fun and
really cool innovation. This one is just kind of
ridiculous innovation. We've been creating larger
and larger servers in Azure. Some of you might have seen
my previous demonstrations of machines like Godzilla, which had 512 megabytes of
memory back in 2014, Beast, which had a terabyte
of memory, Beast V2, which had 4 terabytes of memory, Mega Beast, which
had 20 terabytes of memory and hundreds of cores. I'm proud to announce Super Mega Godzilla Beast,
the latest generation. (applause) If you take a look here, this is Super Mega
Godzilla Beast. How much RAM does it have? Yeah, you read that
right. That's not the disc. That's RAM. Thirty terabytes of RAM. (applause) Ready for the CPUs? (laughter) How many CPUs is that? 1,792 CPUs. This is obviously something
we need to take advantage of. (laughter) Let's play a game. Here, what you're
going to see here in a second, is the start of a scroll of the name
we're going to play. Now, it's a Star
Wars-type scroll. My wife forbids me from singing, otherwise I would sing the
intro theme to Star Wars. But because I can't,
I'm going to invite Seth Juarez out to stage. He's volunteered to sing. SETH JUAREZ:
Yeah, let's do this. MARK RUSSINOVICH:
Thanks. SETH JUAREZ:
Are you ready? This is why we get
advanced degrees in computer science, to sing for Mark
Russinovich. Are you ready? MARK RUSSINOVICH:
Let's do this. SETH JUAREZ:
Hold on. Before you go, if you want to join
in, you should. I'm just saying. Are you ready? MARK RUSSINOVICH:
Here we go. SETH JUAREZ:
Chun, chun, chun, chun, chun. (laughter) Chun, chun, chun, chun, chun, (audience singing) You-all are the worst.
I mean-- MARK RUSSINOVICH:
They're tired. It's been a long invite. SETH JUAREZ:
This supercomputer has 1,796 cores? MARK RUSSINOVICH:
Ninety-two. Yeah, get it right. SETH JUAREZ:
Gosh. What do we got going on here? MARK RUSSINOVICH:
Here's Azure Pong. Now, what makes this
really extra cool is-- that's me on the left
moving the paddle-- is that the right-- here I'm going to
release the ball-- you know who's playing
on the right side? Not you. GPT-4. SETH JUAREZ:
That's probably smarter. MARK RUSSINOVICH:
GPT-4 is actually moving the
paddle and playing Pong with me. You can see down there at the
bottom, it says, "assistant", and it's telling us where
it wants the paddle to be because we're telling it
where the ball is going. SETH JUAREZ:
This tracks with what we should be using advanced AI for and
a supercomputer. This is awesome. Can I? MARK RUSSINOVICH:
Yeah. SETH JUAREZ:
May I? MARK RUSSINOVICH:
Sure. Don't mess it up. SETH JUAREZ:
Is it W and S? MARK RUSSINOVICH:
Yes. That score is there just to make it more exciting. That's the number
of times that you hit the ball back successfully. SETH JUAREZ:
Oh, shoot. MARK RUSSINOVICH:
Come on. SETH JUAREZ:
I'm not--oh. I'm sorry. MARK RUSSINOVICH:
Well, alright. I guess that's it then. We lost to GPT-4. I think it was because of you. SETH JUAREZ:
Yeah, I'm so sorry. I--can I stay? MARK RUSSINOVICH:
No, it's time for you to go. I was going to let you stay,
but you can leave now. SETH JUAREZ:
Thank you-all. (applause) MARK RUSSINOVICH:
That brings us to the conclusion of talk. Hope you had fun at Ignite, hope you saw some really cool
stuff, learned some things, and were inspired by a lot of the innovation that we've got going on underneath the hood. Thanks very much. Have a
great travel back home. (applause)