Setup vLLM with T4 GPU in Google Cloud

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

mastering the art of easy fast and cheap llm serving for everyone easy fast and cheap that's like the perf the Flawless Trifecta but let's see how this works or what I'm doing wrong maybe you guys can help me out here so I'm just going to create a VM instance on Google Cloud we're going to add some gpus to it a couple of t4s we're going to choose a system with some resources who shorten our lifespan or increase our lifespan by shorting the time we're wasting on it give myself a little bit of space to work with and choose one of these Debian deep learning going back to this installation we're gonna look we need python38 and Cuda 11. so let's see here something that says Python and Cuda or python 310 and Cuda 11 so there we go select allowed HC because I want to use this for the open AI API drop in plug-in and go now that we have an instance we'll grab the IP address and SSH into it it's probably not ready quite yet but we'll try go go gadget my patience for this is infinite there it is all right we're gonna go ahead and let it install the resources that it thinks it needs to install perfect now let's see we need Cuda 11 so nbcc version Cuda Tools 11 3 that should be fine because it's between 11 0 and 11 8. get our python version 3.10 it's higher than 3.8 so let's roll a copy and paste because I'm a super hacks or wait it said python 3.10 I bet we should change that to python 3.10 let's cancel it die yeah see if that works jump into my environment and then pip install vllm it says takes five to ten minutes start your clocks yay it's ready to go theoretically so now we have got this built so let's try out the example for offline inference hello my name is Joel I'm from Massachusetts and live in Melbourne these seem sane let's take the next step this is what I'm looking to accomplish we're going to skip the API server and go straight to the open AI compatible server the model here is uh the same one that's used in the example so we're just going to paste this in here model we're actually going to do port 5152 for fun post 0.0.0 seems to be running we're gonna bring our Postman collection which is just the open AI Postman collection go to our VM instances grab the IP address of this instance go over to the collection there's a base URL in there I'm going to edit that this IP address needs to be with Port 5152 I just chose that because that's what I predetermined to be passed through the firewall save changes to that and then do a completion write a limerick about apis oh yeah the text DaVinci an incorrect model of course that's probably the default model in the postmano open API collection uh you'll get more useful Insight than writing one I agree with this oh that's actually not horrible it's not horrible all right well let's try uh chat completion users who won the World Series all right this must be the default so let's just ah fast chat sound so it's install fast chat run our open API open AI API server it's running and now who won the World Series in 2020 Dodgers did where was it played the answer is that was my kids show last year last year I earned the Alec Baldwin I learned that Alec Baldwin was a World Series champion with 10 second swim what is this the Mariners won last fall where was it played the Orioles won last year what is this what this is complete gibberish change the temperature to 1 or 0.1 excuse me it's ridiculous this is ridiculous I mean it's fast okay it's easy and cheap isn't right I don't know if you hear that YouTube shorts guy that ain't right this is office rocker so what am I doing wrong here why is it not just giving me a simple answer or am I just trying to oversimplify this tool or should it be simple maybe it's not simple maybe they didn't put easy oh easy is the first word capital e that's not easy or maybe it just doesn't know that it was played or was it played Dallas something like that Texas somewhere what is all of this what is this response I'll give it this not slow but what is this what am I doing wrong help me out guys

Info

Channel: CodeJet

Views: 4,234

Rating: undefined out of 5

Keywords:

Id: XKxGWN7BlMs

Channel Id: undefined

Length: 9min 30sec (570 seconds)

Published: Thu Aug 10 2023