Microsoft has built the
world's largest AI supercomputer that's already exponentially
bigger than it was just six months ago, capable of training and inferencing the most sophisticated
large language models at scale on Azure, including things like
Microsoft Copilot and ChatGPT. And based on training innovations
for Microsoft Research, we've also built some of
the world's most compact small language models with
Phi-3 that can run locally and offline even on a mobile phone. And today we're joined by
Microsoft Technical Fellow and Azure CTO, Mark Russinovich, who's going to help us demonstrate and unpack what makes
all of this possible. So welcome back to the show. - It's good to be back,
thanks for having me. - And thanks for joining us again. You know, since last time you were on, about a year ago in May, we went into the mechanics
of our AI supercomputer built in 2020 for OpenAI
to be able to train and run GPT-3 at the time. Now that system had actually comprised of 10,000 networked Nvidia V100 GPUs. And it's not an exaggeration to say that a lot has changed since then. - Yeah, actually that size,
that system pales in comparison to the one we built in November 2023 to train OpenAI's next
generation of large models. That one was independently
ranked by TOP500 as the number three
supercomputer in the world and the largest cloud-based supercomputer. We secured that place with
14,400 Nvidia H100 GPUs and 561 petaflops of compute, which at the time
represented just a fraction of the ultimate scale
of that supercomputer. Our AI system is now
orders of magnitude bigger and changing every day and every hour. Today, just six months later,
we're deploying the equivalent of five of those supercomputers
every single month. Our high speed and InfiniBand
cabling that connects our GPUs would be long enough to
wrap around the earth at least five times. - And to me, that just kind of sounds like
a cable management nightmare. - Well, nothing like the
cable management nightmare under my desk. The point here is that
not only can we accelerate model training for OpenAI
and our own services, but where this makes a huge
difference is with inference to run these models as part of your apps. And inference is where we see
the most growth in demand. In fact, we've optimized
Azure for inference. We run our own commercial
services like Microsoft Copilot, which is used by 60% of the Fortune 500, along with copilot experiences
in Azure and GitHub, all at massive scale and high performance. And with our model as a
service option in Azure, you can use our infrastructure to access and run the most sophisticated AI models such as GPT-3.5 Turbo,
GPT-4, Meta's Llama, Mistral, and many more. - This makes a lot of sense because most organizations
are probably going to be using existing models with their own apps versus building out and training their own large language models. So it's really great to see the diversity of large language models that we have now. At the same time though, there's this world of
smaller small language models which some people see really
as the future of generative AI. So how are we looking at that area? - Well, this has been a focus of ours to try to get models to be
as efficient as possible and we now have achieved
getting a small model to be the equivalent
in reasoning capability as ones five to 10 times its size. We recently announced the Phi-3 family of small language models or SLMs based on the work of Microsoft Research. Those have fewer parameters because they're trained
on filtered web content, high quality data and synthetic data. Depending on the scenario, these SLMs have similar capabilities to those found in large language models and require less compute. They can use the ONNX
Runtime for inference, which makes them portable and they can even run in
your device's local NPU. And they're a great option
when you have limited to no connectivity like with IoT devices or on a mobile device. In fact, I've got phi-3
Mini running right here on this iPhone. - Wow. - I'll start by putting
it in airplane mode and I'll make sure WiFi is also disabled, so it's running offline and there's no data
being sent to the cloud. Now I'll open my app and when
I move into its settings, you can see that the
model is Phi-3 Mini-4K and it's a standard Hugging Face format. I can also see some of the
other settings for prompt format and prediction options,
I'll close those out. Now I'll paste my prompt to give me a chocolate chip cookie recipe with lots of sarcasm in the
tone, which should be humorous. And now you can see that it's
starting out pretty good, chocolate chip concoction. It's listing out the
ingredients with jokes that only a professional
greeting card writer would love. And this is impressive because
it's done reasoning to merge baking instructions with sarcasm and it's also running
pretty fast on this phone and those look like legitimate
baking temperatures. Then there are proper instructions for mixing everything and baking. And at the end it seems to know that I'm actually not going to bake them. The real test would be trying it out but I'm not a baker in my spare time, I actually prefer to draw. - Yeah, I've seen that, I actually was a big
fan of your Grogu sketch during the pandemic. - Yeah, that was a really popular one. It kind of goes with the theme here, small packages having lots of power. And by the way, these SLMs
still contain billions of parameters ranging from 3.8 billion for the Phi-3 Mini model to
14 billion for Phi-3 Medium, but they're still significantly smaller than a large language
model like Meta's Llama 3, which is up to 70 billion parameters, and GPT-3, which has even
more, with 175 billion. - And those just keep
getting a lot bigger. So how do you make sure that
you make the right choice then between the different
small language models and maybe using a larger language model? - Well, like we just said,
a small language model won't have the same amount
of inherent knowledge contained within it. For example, GPT-4 knows
the detailed history of ancient Babylon, it knows
chemistry, it knows philosophy. It's been trained on
significantly more information and can understand more
nuanced language patterns to generate more contextually
accurate responses. Small language models
just simply can't get to that kind of knowledge. And so the choice will be task-specific based on the level of sophistication
and reasoning you need, the amount of knowledge you need the model to
inherently have in it. For example, for general chat, you want it to know about all those things and it'll also be resource
and latency-specific. Like if you need to run it on a phone, it can't be a very large language model. - Right, so the SLMs then
have a lot more specificity in terms of what they're
bringing to the table. They've got a different level of quality and efficiency as well. It's going to be interesting, I think, to see their impact then on AI PCs with more scope
degenerative AI experiences. But why don't we move back
to large language models because to use sophisticated
reasoning that they do provide, how do you even begin to use
them efficiently and at scale? - Well, this is where our experience in developing these systems
over the last few years really pays off. A single server can cost
several hundred thousand dollars like the price of a house basically. So we want to make sure that
we aren't wasting resources. As I mentioned, Microsoft runs
inference at massive scale. There are aspects of
inferencing that benefit more from high bandwidth memory
versus pure compute power, and that helps with faster data transfer, better performance, and
more efficient data access. And we've been working
with our hardware partners to evolve their GPU design. For example, we partnered closely with AMD as they designed their MI300X GPU. That's optimized for AI with 192 gigabytes of high bandwidth memory. And we were the first
cloud provider to offer VMs with MI300X GPUs. But in parallel, we worked
with Nvidia on their GPU design for high bandwidth memory. Their H200 chips will have 141 gigabytes based on our work with OpenAI. And their Blackwell architecture, which is coming after that, will increase that up to 384 gigabytes. - And that's really a lot because just to put that into perspective, given that we just saw
high bandwidth memory being 80 gigabytes just a year ago, and at the time, that was
more than respectable. - Well, yeah, the speed
of innovation we're seeing in AI hardware is like
nothing we've seen before, it's a really unique moment in time. The newer NVIDIA Quantum
and InfiniBand switches can connect network GPUs
at 800 gigabits per second, so the port speeds have already doubled compared to when we talked last year. And to take advantage of
the best cost performance, our systems support a modular approach to deploy whichever GPU demand calls for. We can already use AMD and NVIDIA GPUs on the same InfiniBand network. - So we've heard you refer
to this as the AI system, which refers to the specialized
hardware and software stack behind our AI supercomputer. So beyond those individual
hardware components, what are some of the
things that we're doing at the AI system level? - Well, so there's the stack
we built with AMD and Nvidia but then there's our
own silicon innovation. We've taken a step back to
think about the ideal hardware and software architecture
and what we'd build if we had no preexisting
dependencies or constraints. And that's where the
work on Maia comes in. Maia represents our
next generation hardware and software reference architecture designed for one purpose alone, to run large scale AI workloads like Microsoft Copilot more efficiently. Maia vertically integrates what we've learned across
every layer of the stack; from the silicon with our
Maia 100 AI accelerator, the Maia kernel library and API that lets us squeeze as
much performance as possible from the infrastructure
while running AI workloads to the custom backend network and that is deeply
integrated into the chip. Maia uses an ethernet
based network protocol as opposed to InfiniBand
for high speed transfer to connect with other Maia
accelerators on the network. - So this work also then
impact our data center design for physical components? - It actually does, this is brand new technology we're landing in our data centers. One of the areas of data center design that we're evolving is cooling. For example, when you're running
GPU clusters at this level, they produce a tremendous amount of heat. Not only do you have to cool
the data center environment itself to keep ambient
temperatures as low as possible, but GPUs like NVIDIA's
H100 use air cooling so you need a lot of fans to keep the GPUs operating
within their target ranges. That also means more power consumptions. So we've instead taken the approach to design the Maia system
with liquid cooling for more efficient heat transfer. Maia's our first liquid cooled
system for AI workloads. We've also built a dedicated
liquid cooling tower as a sidekick to the Maia server. Those match the thermo
profile of the Maia chip. This is a rack level, closed
loop liquid cooling system for higher efficiency and we
expect to see liquid cooling incorporated into the GP
designs of our hardware partners coming up in the near future. - Right, but this does
beg the question though, as we build these bigger
and more powerful systems, how's this going to impact our
sustainability commitments? - Well, as we design these, we're still committed to meeting our goals including being carbon neutral by 2030. Our Maia architecture, for example, has been developed to meet
our zero waste commitment and by design we're optimizing
for running Maia servers within our existing
data center footprints. - Right, just to be clear here,
you know Maia is being used for Microsoft services initially but is it possible then to
have maybe the software stack and resource manager that
abstracts the silicon models for people to be able to pick the workload and kind of the compute they
need without changing any code? - Exactly, that's exactly
what we're working on is to make it so that code can run across different GPU architectures without you having to
change your code each time. Let me break down how this works. At the top of the stack, you've got your models and
application you need to run. Under that are your AI
frameworks like PyTorch or the ONNX Runtime. Those will often communicate
directly to a GPU or accelerator kernel library and SDK. And this is where each
manufacturer has their own. Nvidia has CUDA, AMD has ROCm,
and we're using the Maia API and these interact with the GPUs directly. Now typically, you'll need deep knowledge of the underlying GP
architecture for each GPU to write custom kernels for
your app to be portable. So to solve for this,
we partner with OpenAI to build a Python based
interoperability layer called Triton to work across Nvidia,
AMD and Maia silicon. Triton will make it possible
to move between different GPUs without the need to rewrite your code or to build custom kernels. - So do we have a running example then where maybe we've built
something for one set of GPUs where we want to bring
it to another stack? - Well, so as a proof of concept, we've taken the model
underneath GitHub Copilot and ported it to the Maia accelerator. Let me show you that running. I've got my desktop set
up with three windows, on the left is Visual Studio
Code to interact with Copilot. On the top right is network
traffic from our Maia machine. On the bottom right is our command line to look at the accelerator topology and you can see there are four devices running with inferencing. I will start on the
code and write a comment to create a Python
dictionary of six countries and their capitals. And just based on that code comment, GitHub Copilot goes
ahead and writes the code using the model on Maia. You can see the network
traffic spiked on the right as the orchestrator sent
that traffic to the model and returned our code. Now I'll clear that example
and start a second one. This time I'll say, write
bubble sort in Python. You'll see that the network
spike lasts a bit longer because it wrote more code this time. So it's possible to run
the same underlying models and code on Maia silicon and there's no noticeable trade
off on speed and accuracy. And once we have Triton running, you'll be able to just run
your code on different GPUs without porting the model. - So is it possible then to also swap out the underlying large language models to more specialized ones in Azure? - Yeah, that's actually
something we've had for a while in Azure AI services where you can deploy the
models you want first and then switch between them. And once you have a few running, you can either select the
model you want for your app in the Playground or
the same works in code. It's just a matter of
changing the endpoint to the model you want to run. - Great, so this means
it's really easy then, effectively, to spin up
the model that you want. And something else that
we've had in Azure AI Studio is model as a service for
pay as you go inference and fine tuning models with your own data. So are we having to spin
up then different models and different compute every time somebody
fine tunes their model? That seems expensive. - Yeah, well spinning up
your own model instance and infrastructure would be too expensive. Last time I introduced the concept of Low Rank Adaptive fine tuning or LoRA, where you can add new
skills by fine tuning a small set of parameters instead of retraining the entire model with a targeted dataset. So you're only adding
just 100 megabytes of data to a base model that's several
hundred gigabytes in size, for example. - Right, we kind of compared
it to Neo on "The Matrix" learning a new skill like kung fu. - Right, so now imagine
experts being able to teach Neo multiple new skills simultaneously. For fine tuning LLM, we can achieve this using a
multi-serve model instance. With an approach called Multi-LoRA, where we can share one base
LLM on the same server cluster, we can let different customers
fine tune the base model specific to their needs and
have it be isolated and used only by them. We're able to attach hundreds or thousands of fine tuned models as
adapters that run simultaneously and isolated from each
other on the base model. This gives you a secure
way to fine tune an LLM with additional skills
without having to spin up your own compute intensive infrastructure, which is a massive cost savings. - Right, so now we've
covered all the major updates since the last time you were on the show, now if we were to look at
one or two years from now, given how available AI is now becoming, what do you think the future looks like? - Well, I think two
things we're seeing happen is agentic systems are going to evolve where you've got a high reasoning LLM kind of as the core brain
talking to lots of other LLMs and SLMs that are task specific, including multimodal models
each performing their own tasks as a part of a larger workflow. The other thing you're going to see is Azure just continue to
always offer the best, latest and greatest frontier models,
as well as small models, as well as open models and
closed models on infrastructure that is continuously
improving in efficiency. - Really great to hear the vision from the man himself, Mark Russinovich. Always great to have you on the show. Hopefully, next time you're
on a year from now or so, we'll even have more momentum to share with everyone watching. So until then, keep
watching Microsoft Mechanics for all latest AI updates. Thanks for watching
and we'll see you soon.