How to self-host and hyperscale AI with Nvidia NIM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
recently somehow I got access to an overpowered h100 GPU and on it I was able to self-host and scale my own Army of AI agents thanks to a new tool called Nim 10 years from now the workforce will look nothing like it does today Bill Gates once said most people overestimate what they can do in one year and underestimate what they can do in 10 years AI models like llama 3 mistal and stable diffusion have already changed the world but they've barely even penetrated the mainstream Consciousness over the last year in today's video we'll fast forward 10 years into the future to a magical time when any job that can be done by a robot will be done by a robot some experts think that we'll create a Sci-Fi AGI an all-in-one jack of all trades in a black box that can do every intellectual job better than we humans do but that's highly speculative perhaps a far more realistic vision is a network of Highly specialized AI agents running on kubernetes if you're an indie hacker entrepreneur or even a massive Enterprise and you want to build an AI Workforce that includes a doctor a lawyer and a programmer you'll quickly run into a massive technical challenge even if your AI models are smart enough to do these jobs a model is quite literally just a file with weights and biases AKA numbers but in order to run inference with it like generate text and images you'll need a massive amount of RAM and the parallel Computing magic of a GPU to do all that linear algebra and if your app ever goes viral it'll quickly grind to a halt because scaling up this technology is extremely difficult well not anymore thanks to Nvidia Nim the sponsor of today's video Nvidia was kind enough to give me access to an H1 100 GPU to try out their Nvidia Nims which are inference microservices what they do is package up popular AI models along with all the apis that you need to run them at scale including inference engines like tensor RT llm as well as data management tools for authentication health checks monitoring and so on all these apis along with the model itself are containerized and run on kubernetes that means you can deploy it to the cloud on Prem or even on your local PC and that's going to save you weeks if not months of painful development time well at a real example in just a minute but what's cool about this platform is that there's a playground where you can play around with these Nims right now it has all the popular large language models like llama mistal Gemma and so on it can do image and video with stable diffusion and others along with a bunch of other specialized models for healthcare climate simulation and more these models are hosted by Nvidia and you can use them right now in the browser or you can access them via the API and they've been standardized to work with the open AI SDK in addition because it's containerized you can also pull it with Docker and run it in your local environment or configure it in the cloud to scale to any workload and now we can start to see what the future Workforce might look like imagine you work for dinosaur Enterprises and your CEO chainsaw Jeff wants to cut down the human-based headcount by 90% is so it can increase his bonus by 4% how is he going to do that for shareholders so first let's get rid of customer service agents by deploying one Nim that can recognize speech along with a large language model to generate text we might also want to replace warehouse workers with Superior autonomous forlift drivers and a custom trained Nim hosted on Prem is perfect for that we also have hundreds of worthless product managers who do nothing but post Day in the Life Tik toks so let's add a stable diffusion Nim to generate product mockups and website designs to get rid of them now these websites aren't going to build themselves well no actually they are if we deploy a Nim that can code and then finally for the last 10% of humans working here we can deploy a mental health Nim to ensure their continued well-being now obviously I'm joking here and humans will continue to thrive thrive in the artificial intelligence age but the main takeaway here is that Nims allow anyone to scale AI in any environment and it's all about augmenting human work as opposed to replacing it my personal goal is to someday create a billion- Dollar business as a single solo developer and Nims are the perfect tool to make that dream a little more realistic they simultaneously reduce development time while facilitating the deployment of tools that augment my own limited human capabilities but now let's take a look at how it works from a programming perspective like I mentioned before Nvidia gave me access to an h100 for a few days which is their 80 gb GPU used in data C these things go for about 30 grand on the street if you can even get your hands on one and it was just way more horsepower than I even knew what to do with as you can see here I have sshed into a server which is also conveniently running vs code in the terminal you'll notice we've pulled a Docker image and I'm also running Nvidia SMI to check the status of the GPU there's also a running process for kubernetes that will allow this microservice to automatically scale when traffic increases and automatically heal when things break most importantly though everything is configured to work out of the box you don't actually have to touch kubernetes yourself all we have to do is write a little bit of python to run the model you could do this in a python notebook but I'm just going to write a python script here in this app.py file the actual API to access the model is running on Local Host 8000 we can use the request library and python to send HTTP request to it like the first thing we might want to do is see which models are available in this environment we currently have access to llama 3 now that we have that piece of information we can make a post request to the chat completions endpoint and most importantly we have an array of messages here which provide the llm with context for the conversation in my case I want to ask it the question of what is the best JavaScript framework of all time then from there we can Define some configuration options like the model name max number of tokens temperature Etc and then finally to get a response all we have to do is make a post request with this data now let's go ahead and run this code by pulling up the terminal and entering python app.py you'll notice we get a full response almost instantaneously under the hood these n use tools you would expect like pytorch but also other tools you may not know about like Triton to maximize performance on inference which is awesome because that means you don't need to figure out how to make things fast on your own and I would say latency is probably the number one killer for people starting their own AI SAS products oh and in addition we can also monitor the hardware like here I can see the GPU temperature jumped after I asked it that question and we can also keep an eye on the CPU and memory usage now of course when I ask llama 3 for the best JavaScript framework it's going to respond with react even though that's clear clearly a lie so I changed my prompt to ask it what the worst JavaScript framework is and of course it threw shade at its arch nemesis angular which is the real best JavaScript framework ever invented but one other thing I'll mention in the code here is that instead of request we could also use the open AI SDK which is extremely popular and has become somewhat of an industry standard the bottom line though is that we now have an API that can scale up to an infinite number of gpus those gpus could live on AWS they could live in your own data center or it could be the one in your PC right now but pretty awesome and if you want to try out Nims for yourself I'd recommend going directly to the API catalog at build. nvidia.com where you can easily try them out or check out Nvidia AI Enterprise if your goal is to operate at a massive scale thanks for watching and I will see you in the next one
Info
Channel: Fireship
Views: 282,534
Rating: undefined out of 5
Keywords: webdev, app development, lesson, tutorial, nvidia, nvidia nim, nvidia nim explained, nvidia nim tutorial, ai, coding, programming, slef hosting, self hosted, ai agents, nim, web development, nim lesson, nim explaned, nim tutorial, nvidia coding, fireship, Nvidia H100 GPU, How to scale AI models, containerization, nvidia nim self hosting, nvidia nim ai, self hosted ai, nvidia nim demo, artificial intelligence, nvidia ai, nvidia artificial intelligence, nvidia nim ai deployment, code
Id: ZgGss27IfwA
Channel Id: undefined
Length: 6min 43sec (403 seconds)
Published: Tue Jul 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.