What runs GPT-4o? | Inside the Biggest AI Supercomputer in the cloud with Mark Russinovich

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Microsoft has built the world's largest AI supercomputer that's already exponentially bigger than it was just six months ago, capable of training and inferencing the most sophisticated large language models at scale on Azure, including things like Microsoft Copilot and ChatGPT. And based on training innovations for Microsoft Research, we've also built some of the world's most compact small language models with Phi-3 that can run locally and offline even on a mobile phone. And today we're joined by Microsoft Technical Fellow and Azure CTO, Mark Russinovich, who's going to help us demonstrate and unpack what makes all of this possible. So welcome back to the show. - It's good to be back, thanks for having me. - And thanks for joining us again. You know, since last time you were on, about a year ago in May, we went into the mechanics of our AI supercomputer built in 2020 for OpenAI to be able to train and run GPT-3 at the time. Now that system had actually comprised of 10,000 networked Nvidia V100 GPUs. And it's not an exaggeration to say that a lot has changed since then. - Yeah, actually that size, that system pales in comparison to the one we built in November 2023 to train OpenAI's next generation of large models. That one was independently ranked by TOP500 as the number three supercomputer in the world and the largest cloud-based supercomputer. We secured that place with 14,400 Nvidia H100 GPUs and 561 petaflops of compute, which at the time represented just a fraction of the ultimate scale of that supercomputer. Our AI system is now orders of magnitude bigger and changing every day and every hour. Today, just six months later, we're deploying the equivalent of five of those supercomputers every single month. Our high speed and InfiniBand cabling that connects our GPUs would be long enough to wrap around the earth at least five times. - And to me, that just kind of sounds like a cable management nightmare. - Well, nothing like the cable management nightmare under my desk. The point here is that not only can we accelerate model training for OpenAI and our own services, but where this makes a huge difference is with inference to run these models as part of your apps. And inference is where we see the most growth in demand. In fact, we've optimized Azure for inference. We run our own commercial services like Microsoft Copilot, which is used by 60% of the Fortune 500, along with copilot experiences in Azure and GitHub, all at massive scale and high performance. And with our model as a service option in Azure, you can use our infrastructure to access and run the most sophisticated AI models such as GPT-3.5 Turbo, GPT-4, Meta's Llama, Mistral, and many more. - This makes a lot of sense because most organizations are probably going to be using existing models with their own apps versus building out and training their own large language models. So it's really great to see the diversity of large language models that we have now. At the same time though, there's this world of smaller small language models which some people see really as the future of generative AI. So how are we looking at that area? - Well, this has been a focus of ours to try to get models to be as efficient as possible and we now have achieved getting a small model to be the equivalent in reasoning capability as ones five to 10 times its size. We recently announced the Phi-3 family of small language models or SLMs based on the work of Microsoft Research. Those have fewer parameters because they're trained on filtered web content, high quality data and synthetic data. Depending on the scenario, these SLMs have similar capabilities to those found in large language models and require less compute. They can use the ONNX Runtime for inference, which makes them portable and they can even run in your device's local NPU. And they're a great option when you have limited to no connectivity like with IoT devices or on a mobile device. In fact, I've got phi-3 Mini running right here on this iPhone. - Wow. - I'll start by putting it in airplane mode and I'll make sure WiFi is also disabled, so it's running offline and there's no data being sent to the cloud. Now I'll open my app and when I move into its settings, you can see that the model is Phi-3 Mini-4K and it's a standard Hugging Face format. I can also see some of the other settings for prompt format and prediction options, I'll close those out. Now I'll paste my prompt to give me a chocolate chip cookie recipe with lots of sarcasm in the tone, which should be humorous. And now you can see that it's starting out pretty good, chocolate chip concoction. It's listing out the ingredients with jokes that only a professional greeting card writer would love. And this is impressive because it's done reasoning to merge baking instructions with sarcasm and it's also running pretty fast on this phone and those look like legitimate baking temperatures. Then there are proper instructions for mixing everything and baking. And at the end it seems to know that I'm actually not going to bake them. The real test would be trying it out but I'm not a baker in my spare time, I actually prefer to draw. - Yeah, I've seen that, I actually was a big fan of your Grogu sketch during the pandemic. - Yeah, that was a really popular one. It kind of goes with the theme here, small packages having lots of power. And by the way, these SLMs still contain billions of parameters ranging from 3.8 billion for the Phi-3 Mini model to 14 billion for Phi-3 Medium, but they're still significantly smaller than a large language model like Meta's Llama 3, which is up to 70 billion parameters, and GPT-3, which has even more, with 175 billion. - And those just keep getting a lot bigger. So how do you make sure that you make the right choice then between the different small language models and maybe using a larger language model? - Well, like we just said, a small language model won't have the same amount of inherent knowledge contained within it. For example, GPT-4 knows the detailed history of ancient Babylon, it knows chemistry, it knows philosophy. It's been trained on significantly more information and can understand more nuanced language patterns to generate more contextually accurate responses. Small language models just simply can't get to that kind of knowledge. And so the choice will be task-specific based on the level of sophistication and reasoning you need, the amount of knowledge you need the model to inherently have in it. For example, for general chat, you want it to know about all those things and it'll also be resource and latency-specific. Like if you need to run it on a phone, it can't be a very large language model. - Right, so the SLMs then have a lot more specificity in terms of what they're bringing to the table. They've got a different level of quality and efficiency as well. It's going to be interesting, I think, to see their impact then on AI PCs with more scope degenerative AI experiences. But why don't we move back to large language models because to use sophisticated reasoning that they do provide, how do you even begin to use them efficiently and at scale? - Well, this is where our experience in developing these systems over the last few years really pays off. A single server can cost several hundred thousand dollars like the price of a house basically. So we want to make sure that we aren't wasting resources. As I mentioned, Microsoft runs inference at massive scale. There are aspects of inferencing that benefit more from high bandwidth memory versus pure compute power, and that helps with faster data transfer, better performance, and more efficient data access. And we've been working with our hardware partners to evolve their GPU design. For example, we partnered closely with AMD as they designed their MI300X GPU. That's optimized for AI with 192 gigabytes of high bandwidth memory. And we were the first cloud provider to offer VMs with MI300X GPUs. But in parallel, we worked with Nvidia on their GPU design for high bandwidth memory. Their H200 chips will have 141 gigabytes based on our work with OpenAI. And their Blackwell architecture, which is coming after that, will increase that up to 384 gigabytes. - And that's really a lot because just to put that into perspective, given that we just saw high bandwidth memory being 80 gigabytes just a year ago, and at the time, that was more than respectable. - Well, yeah, the speed of innovation we're seeing in AI hardware is like nothing we've seen before, it's a really unique moment in time. The newer NVIDIA Quantum and InfiniBand switches can connect network GPUs at 800 gigabits per second, so the port speeds have already doubled compared to when we talked last year. And to take advantage of the best cost performance, our systems support a modular approach to deploy whichever GPU demand calls for. We can already use AMD and NVIDIA GPUs on the same InfiniBand network. - So we've heard you refer to this as the AI system, which refers to the specialized hardware and software stack behind our AI supercomputer. So beyond those individual hardware components, what are some of the things that we're doing at the AI system level? - Well, so there's the stack we built with AMD and Nvidia but then there's our own silicon innovation. We've taken a step back to think about the ideal hardware and software architecture and what we'd build if we had no preexisting dependencies or constraints. And that's where the work on Maia comes in. Maia represents our next generation hardware and software reference architecture designed for one purpose alone, to run large scale AI workloads like Microsoft Copilot more efficiently. Maia vertically integrates what we've learned across every layer of the stack; from the silicon with our Maia 100 AI accelerator, the Maia kernel library and API that lets us squeeze as much performance as possible from the infrastructure while running AI workloads to the custom backend network and that is deeply integrated into the chip. Maia uses an ethernet based network protocol as opposed to InfiniBand for high speed transfer to connect with other Maia accelerators on the network. - So this work also then impact our data center design for physical components? - It actually does, this is brand new technology we're landing in our data centers. One of the areas of data center design that we're evolving is cooling. For example, when you're running GPU clusters at this level, they produce a tremendous amount of heat. Not only do you have to cool the data center environment itself to keep ambient temperatures as low as possible, but GPUs like NVIDIA's H100 use air cooling so you need a lot of fans to keep the GPUs operating within their target ranges. That also means more power consumptions. So we've instead taken the approach to design the Maia system with liquid cooling for more efficient heat transfer. Maia's our first liquid cooled system for AI workloads. We've also built a dedicated liquid cooling tower as a sidekick to the Maia server. Those match the thermo profile of the Maia chip. This is a rack level, closed loop liquid cooling system for higher efficiency and we expect to see liquid cooling incorporated into the GP designs of our hardware partners coming up in the near future. - Right, but this does beg the question though, as we build these bigger and more powerful systems, how's this going to impact our sustainability commitments? - Well, as we design these, we're still committed to meeting our goals including being carbon neutral by 2030. Our Maia architecture, for example, has been developed to meet our zero waste commitment and by design we're optimizing for running Maia servers within our existing data center footprints. - Right, just to be clear here, you know Maia is being used for Microsoft services initially but is it possible then to have maybe the software stack and resource manager that abstracts the silicon models for people to be able to pick the workload and kind of the compute they need without changing any code? - Exactly, that's exactly what we're working on is to make it so that code can run across different GPU architectures without you having to change your code each time. Let me break down how this works. At the top of the stack, you've got your models and application you need to run. Under that are your AI frameworks like PyTorch or the ONNX Runtime. Those will often communicate directly to a GPU or accelerator kernel library and SDK. And this is where each manufacturer has their own. Nvidia has CUDA, AMD has ROCm, and we're using the Maia API and these interact with the GPUs directly. Now typically, you'll need deep knowledge of the underlying GP architecture for each GPU to write custom kernels for your app to be portable. So to solve for this, we partner with OpenAI to build a Python based interoperability layer called Triton to work across Nvidia, AMD and Maia silicon. Triton will make it possible to move between different GPUs without the need to rewrite your code or to build custom kernels. - So do we have a running example then where maybe we've built something for one set of GPUs where we want to bring it to another stack? - Well, so as a proof of concept, we've taken the model underneath GitHub Copilot and ported it to the Maia accelerator. Let me show you that running. I've got my desktop set up with three windows, on the left is Visual Studio Code to interact with Copilot. On the top right is network traffic from our Maia machine. On the bottom right is our command line to look at the accelerator topology and you can see there are four devices running with inferencing. I will start on the code and write a comment to create a Python dictionary of six countries and their capitals. And just based on that code comment, GitHub Copilot goes ahead and writes the code using the model on Maia. You can see the network traffic spiked on the right as the orchestrator sent that traffic to the model and returned our code. Now I'll clear that example and start a second one. This time I'll say, write bubble sort in Python. You'll see that the network spike lasts a bit longer because it wrote more code this time. So it's possible to run the same underlying models and code on Maia silicon and there's no noticeable trade off on speed and accuracy. And once we have Triton running, you'll be able to just run your code on different GPUs without porting the model. - So is it possible then to also swap out the underlying large language models to more specialized ones in Azure? - Yeah, that's actually something we've had for a while in Azure AI services where you can deploy the models you want first and then switch between them. And once you have a few running, you can either select the model you want for your app in the Playground or the same works in code. It's just a matter of changing the endpoint to the model you want to run. - Great, so this means it's really easy then, effectively, to spin up the model that you want. And something else that we've had in Azure AI Studio is model as a service for pay as you go inference and fine tuning models with your own data. So are we having to spin up then different models and different compute every time somebody fine tunes their model? That seems expensive. - Yeah, well spinning up your own model instance and infrastructure would be too expensive. Last time I introduced the concept of Low Rank Adaptive fine tuning or LoRA, where you can add new skills by fine tuning a small set of parameters instead of retraining the entire model with a targeted dataset. So you're only adding just 100 megabytes of data to a base model that's several hundred gigabytes in size, for example. - Right, we kind of compared it to Neo on "The Matrix" learning a new skill like kung fu. - Right, so now imagine experts being able to teach Neo multiple new skills simultaneously. For fine tuning LLM, we can achieve this using a multi-serve model instance. With an approach called Multi-LoRA, where we can share one base LLM on the same server cluster, we can let different customers fine tune the base model specific to their needs and have it be isolated and used only by them. We're able to attach hundreds or thousands of fine tuned models as adapters that run simultaneously and isolated from each other on the base model. This gives you a secure way to fine tune an LLM with additional skills without having to spin up your own compute intensive infrastructure, which is a massive cost savings. - Right, so now we've covered all the major updates since the last time you were on the show, now if we were to look at one or two years from now, given how available AI is now becoming, what do you think the future looks like? - Well, I think two things we're seeing happen is agentic systems are going to evolve where you've got a high reasoning LLM kind of as the core brain talking to lots of other LLMs and SLMs that are task specific, including multimodal models each performing their own tasks as a part of a larger workflow. The other thing you're going to see is Azure just continue to always offer the best, latest and greatest frontier models, as well as small models, as well as open models and closed models on infrastructure that is continuously improving in efficiency. - Really great to hear the vision from the man himself, Mark Russinovich. Always great to have you on the show. Hopefully, next time you're on a year from now or so, we'll even have more momentum to share with everyone watching. So until then, keep watching Microsoft Mechanics for all latest AI updates. Thanks for watching and we'll see you soon.

Info

Channel: Microsoft Mechanics

Views: 70,461

Rating: undefined out of 5

Keywords: ai infrastructure, mark russinovich, AI, Supercomputer, large language models, small language model, small language model vs large language model, AI supercomputer, Phi-3, Maia, OpenAI, gpt 3, nvidia h100, generative AI, Copilot, Microsoft Copilot, microsoft ai supercomputer, Python, AI workloads, Azure AI studio, LLM, SLM, Azure, Chatgpt, hardware and software solutions, microsoft ai, openai gpt 4, artificial intelligence, maia silicon, hardware and software, apple intelligence, gpt-4o

Id: DlX3QVFUtQI

Channel Id: undefined

Length: 15min 23sec (923 seconds)

Published: Tue May 21 2024