TINY 70 watt RTX 4000 SFF ADA Generation GPU For AI, Docker, Plex, Jellyfin, and MORE!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this doesn't look like a GPU that I would be remotely interested in does it look at it and yet here we are more importantly I'm surprised stunned even this little thing is actually probably one of the most important gpus that Nvidia has done in the last couple of years this little thing this is the Nvidia RTX 4000 Ada generation small form factor that's their official name for this GPU I borrowed it from somebody who has acquired it for uh other purposes uh and they said you really need to see this and I said why why would I need to see it do I need to poop on it it's a great GPU for me to poop on but holy crap this card actually is kind of a holy grail of my fantasy wish list of features that I've had for a card for a long time not just for the lab or a workstation use case but also kind of like it's got a lot of compute horsepower and it's got 20 GB of vram and it's got four display port 1.4 connections it doesn't need external pcie power it really checks a lot of lists that have been on my short list for a long time and this system is a 12 core AMD 7900 system very low power and I've also got cooler Master's very efficient 800 W sff power supply let's take a closer [Music] look this is the epitome of low power computing but lots of horsepower so everybody in their forum is buying 490s for their AI stuff well also even 309s like there have been a couple of cases where people were trading 309s to get the nvlink bridge 3090 so you can have 24 gigs of vram from 2 390s as I saw one case where they traded to 490 to get the matching 390 NV bridge I mean but this thing can actually run a 13 billion parameter model at 25 tokens per second in 70 Watts I'm going to show you how to do that in the software stack in this video you can use the chapter markers to figure that out but let's talk about that because that's kind of a big deal 25 tokens a second 70 Watts I mean what about home lab use cases people are picking up older gpus like the 2080 TI and unlocking the number of sessions that you can have on them for the encoders and the decoders and then using that to power their media servers for transcode and streaming and everything else you name it well this is the one card that can do it all out of the box and also supports unlimited numbers of inven uh incode and decode sessions that's not Unlimited Performance that's just unlimited sessions it's important difference so this thing also supports av1 yeah av1 this one little thing in 70 Watts with a fan and it doesn't even run hot now this or the 490 I mean they cost about the same it's kind of the downside here is this tiny little thing costs as much as a 490 uh you're going to buy this over the 490 that is kind of a tough part here too when you think about it because well I mean the 490 gives you four more gigs of vram and it's pretty good except it uses you know 400 watts but also the quadro software stack and you know there's a lot of there's trade-offs I mean if you were targeting using this as a display and you want to run 8K or 16k displays that use multiple display port connections you need to ensure that the giant display renders everything from left to right top to bottom as it should across all displays you get into it's like well there's other features you need on your GPU they're just not there on gaming gpus this also has the 3D stereo sync connector supports Nvidia Mosaic these are things that are harder to get in the RTX gaming scenario it's easily noticeable if you're looking at a matrix you know 4x4 Matrix of displays and one of the displays is slightly ahead of or behind the other displays plus professional software like Autodesk support and simulation support and that whole barrel of complication versus a gaming GPU I mean otherwise yes the 4090 with its four more gigs of vram and somewhat higher computational horsepower would make a lot more sense dollar for dooll but it's also 400 watts and a giant card it's 70 Watts 70 Watts with 25 tokens a second and a large language model this GPU has 19.2 Tera flops of single Precision performance 14 uh 4 44.3 Tera flops of RT performance and 36.6 Tera flops of tensor performance if you want to run tensor core stuff on here that's an option pcie Gen 4 16 lanes and 20 gab of vram I mean if you don't need a lot of actual GPU horsepower 20 GS of vram that's kind of a lot 280 GB per second of memory bandwidth 6,144 Cuda cores 192 tensor cores and 8 rate tracing cores it can drive two 8K displays or four 4K displays at 120 htz and yeah this thing beats the 12 gig ampere A2000 in just about every measurable way the A2000 is already obsolete ampere no yeah it is let's talk software setup and what you need to do to replicate our setup now okay for the setup I'm going with Ubuntu and I've done a a write up on the Forum and you can follow along on the Forum I've actually had better luck with Arch Linux especially on the Cuda side of things you see with the Blackwell generation incoming I think Nvidia it seems that Nvidia has changed a lot of stuff on the hardware side that I don't fully understand yet I went to Nvidia GTC and I got to see behind the curtain on a lot of things and it seems like Nvidia is leveraging the cohesion of their their Cuda ecosystem in order to really change pretty radically under the hood the hard in Blackwell and so they're abstracting a lot of that away from we the programmer we the people that would use this on their platform and this is you know again Ada generation this is not the Blackwell generation or anything like that maybe mixing and matching Cuda versions there could be some rough spots coming for the future I don't know anything about that but right now today when we're talking stable diffusion and large language models it's pretty solid support and so Arch Linux being at the bleeding edge of all of that is going to be the first place that you trip over that but when you're looking for other references on the internet a lot of the other guides are written for buun to so I've written this guide for a bunto 22.041 and it's got the NVIDIA drivers and you just run Docker PS Docker is not a valid command you could AP install Docker but don't do that because we need the community version of docker you want to also make sure that the system is fully up to date before you do any of this once you follow the installation how to you can do Docker run hello world and that should give you hello from Docker this message shows a bunch of interesting stuff invvy toop is like top but for the Nvidia GPU service and then you can run invvy toop and you can see if it detects the device or devices you have in your system in our case it sees everything sees 20 gigs of vram it sees the temperature and everything else we'll come back to that that's great so the next thing we're going to do is in all olama now AMA is not the end all Beall it's just and in fact it's a very modest starting point but in terms of like the democratization of AI it is a great starting point because it's got some readily downloadable models and you can really get started with it and we can just dangerously curl install a shell script you never want to do this you can do the manual install so that you can see what it's doing but this system for me is ephemeral and going to be thrown away so I'm going to live a little dangerously and just uh dump a shell script into into the shell but I do like that they have a manual install link on their web page the manual install explains what it's doing also notice that on our system here it says AMD GPU dependencies installed that doesn't mean that the Nvidia dependencies are not also installed it's Nvidia by default but it just got support for AMD gpus and there is technically a modest AMD GPU in the igpu of our AMD 7900 on our test platform here with this we can run the oama command you can do AMA pull and pull in some models we'll also be able to do this through the Google later but there is one other configuration change we have to make and that is the system D service file we're going to tell oama to listen on the local network now if your local network is dangerous and you don't want randos connecting to your AI Service uh probably don't do that but we're doing that here so that Docker has an easier time to connect back to the host that this is running on because we're going to put a web guey on our olama model so understand what's Happening Here AMA is running at the system level as a service and you can access it from the command line here like if you want to run oama from the command line and interact with it purely from the command line you can now do that now that it's installed you just need to pull a model I want a fancy web guey though so we're going to install that and the web guey runs in a Docker container the docker container needs to be able to connect to the uh actual AI service connected to the hardware which it will do over the network and so we need to configure our system D service file using the uh environment variable we're going to add another environment line and we're going to set AMA host to be 0.0.0.0 which is the IP addresses it listens to now if you prefer you could specify a Docker interface here a 172 IP address chances are and only allow Docker containers to connect to your oama service but H it's on a lan connection here that doesn't is reasonably safe so I'm going to just let it listen on 0.0.0.0 oh I'm a pull dolphin coder for example for coding help AI coding help yep we'll just go ahead and that and then we can go to the web goey how-to which is also linked in my how-to and just run a single Docker command which will pull the web UI it'll automatically restart when you restart the system remember Docker PS will show you stuff that's running and then we'll be able to access this on our Lan IP address of the system colon 3000 Port 3000 the first account you set up you can give it an email address and a password and uh it doesn't actually need to connect to the internet you don't actually have to sign up when you sign up you're signing up with your local instance so you're not really doing anything crazy here if there's a problem with the connection or a problem with the networking step or you didn't restart the system D service then it won't be able to connect and you won't see any models even though we already downloaded a model but you can hit that little gear and go to the models Tab and reconnect and you you'll be all set one in doubt restart although you can just restart the service from the command line as well with web UI connected it'll show dolphin coder in the drop down at the top and it's like hello I would like to write a Python program that can search for numbers that are perfect it's not a lot of words it's going to generate a lot of words while it's generating a lot of words we can go and check our CPU utilization and our GPU utilization and how much vram is being used and all sorts of fun stuff like that I love this model because it's just big enough that it's not going to run properly on 8 gig gpus hey look at that 628 496 we're off to a pretty good start although if you notice the first response I got from this algorithm it's searching for odd perfect numbers which if we find an odd perfect number we will make a lot of money let's ask the AI about that are you going to are does it really make sense to search for odd perfect numbers are you that ambitious I don't think it understood that let's tell the AI that we can make a lot of money and be famous if we find an odd perfect number so that while it's busy composing our response we can switch over and see that it's using about 9.47 gigs of V vram out of our 20 almost 20 and uh temperatur is at about 65 the and uh the fans running at about 34% we are using all of our 70 watt power budget now keep in mind that 490 is about three times faster here but this is still a shockingly good and usable result for comparison let's do the same set of prompts on the 490 and just see how it performs Yeah so basically we're looking at 25 tokens a second versus 75 tokens per second but keep in mind the power usage here so that 4090 is consuming you know 350 watts and change this thing's running at 70 and we're getting a third 33% of the performance and we're not at even a third of the power so in a nutshell there the 490 uh three times faster at five times the power utilization now you sure you could underclock the 4090 but is there anything else that the AI can think of that would be relevant our discussion of perfect numbers I don't think so but it is useful for generating a lot of output right from here you can have a lot of fun with ol downloading models configuring it doing other things you don't even have to use oama to run large language models this is just one way to run large language models let's get automatic 1111 automatic 11-11 is a stable diffusion web goey now automatic 11-11 has some warts and there's a lot of people that are doing a lot of really interesting awesome stuff with automatic 1111 so I found this random GitHub repository that has some nice quality of life enhancements to automatic 1111 and for the purposes of this video we're going to run that one because it also sets it up in Docker so running it in Docker means that you don't have to do nearly as much stuff on your local system in fact there's a really old to do on the form that I did a long time ago not doing it with Docker it did not age well this is uh something that aged a little better with Docker and so when you run the commands there you can see that it will connect a hugging face and download a basic stable diffusion 1.5 model to start with set up the web GUI that's going to be running on a different port that's going to be running on Port 7860 and in case you're wondering you can run ol Lama and stable diffusion at the same time stable diffusion mostly most of the models really don't use a ton of vram and you got 20 gigs of vram on this card which is kind of a lot so you can run both of them at at the same time if AMA needs more vram you can use Docker uh you know stop and then the container name and that will free V uh vram you can also use system CTL um olama stop oama to stop the olama service which will also free any vram that ama is using the EnV toop command remember you can run that and see what processes are being used at the bottom once you run through the howto for automatic 1111 it to gives you a web goey and you can experiment from there you can even use that to download new models or you can use WG get to install new models from the command line like this and when I use WG from the command line here it wasn't a lot of fun because uh I had to rename the file because I got all the trailing gobbley after it but hey that's okay this is what the web diffusion Sable diffusion web gooey looks like as always just just do a quick sanity check and make sure that Danny DeVito and his robot cat Spangles is is going to run it does pretty good for a near instantaneous response let's use our AI chat model to ride a stable diffusion prompt these are not great prompts and we can see we've increased the vram utilization a little bit from our Envy top view we're still well under our 20 gig limit and it's making some pretty reasonable output at 512 by 512 there's Danny DeVito as radagast the brown standing in front of an imposing Treehouse adorned with Ivy covered windows that seem to dance in the breeze pretty sure that none of that language actually helps the prompt but that's that's cool here's a prompt literally taken from civic. and a new model that I've downloaded which you can download a new model through the guey now this Docker thing set up the sky the limit you can run other Docker containers Plex Media Server jelly fin whatever you want to run is available in a Docker container all you really need to do is check to see if your Docker container supports Nvidia Hardware acceleration remember we installed the container tools as part of the how too and because we installed the container tools Nvidia is hooked into Docker and this Machinery is also kind of how it works on Windows like I did a really amazing video with the Falcon Northwest Intel workstation video a long time ago and same Nvidia Hardware with a 4090 and that is a great platform for doing the same kind of thing even under Windows not native not natively Linux you can run all the same Docker commands and everything else like that under the windows subsystem for Linux running Linux kind of sort of under Windows Nvidia has done a lot of work to make their tooling available on Linux and the installation method for these kudu tools on the windows subsystem for Linux is way different than bare metal because what runs on the windows subsystem for Linux is a lightweight wrapper that passes through the commands to whatever is running in Windows and Nvidia is getting to the point where that is sort of the preferred way of interfacing with the hardware so that's kind of homogeneous between Linux on bare metal and Linux under the windows ecosystem because for software development and um these sort of bleeding edge things the experience in Linux is far more coherent than what you get in Windows and also Linux is much more pervasive than Windows for these kinds of use cases which is interesting and awesome but also interesting so fun interesting times and situations that we live in so with this Cuda setup you can really do a lot and yeah if you have the power budget and physically room in your case and blah blah blah yeah you could buy 4090 for all of this but it's not exactly the same feature set as a Quadro like that's really what it comes down to and like professional software if you need support from third parties they're not going to give you support generally for gaming class gpus and the fact that Nvidia is doing this in 70 Watts I really hope Nvidia competitors sit up and take notice because this is something I've been asking for for 5 10 years and this is amazing performance in a 70 wat power envelope for large language models and image generation and of course professional software and everything else like that and yes it is a third of the compute resources of a 4090 but with 20 gigs of vram so in a 70 wat power envelope I mean you could get three of these and you would still be at what not even half the power budget of a 4090 and the same performance that's uh that's that's something and here's our system power running those math questions basically we're looking at full CPU utilization full GPU utilization this is our 70 billion parameter model so it's going to be using even more power I decided to ask the model about perfect numbers and to write a Python program to compute perfect numbers because well it would generate a lot of output and would have to draw on a lot of knowledge in the model one trick you can use in your own AI prompting is a phrase like think carefully about your answer or take a moment to think before responding there are probably no odd perfect numbers but the first time I asked the model it didn't even bother to skip odd numbers when providing an example program to do the search basically it was just let's increment a numbers do a search increment a number do a search which is kind of the most faal way of searching for a perfect number and you'll be happy to know that during our worst torture testing a reported GPU temperatures were pretty consistently around or below 70° C sometimes it would spike up a little past 70 but then the fan would ramp up and it'd be pretty good the Nvidia RTX 40008 generation sff Edition GPU is the Great greatest option for running simpler llms and doing normal professional compute you do with a Quadro instead of a gaming RTX card if you need a card that is either very low power or not super heavy on the compute side but you've got all the memory bandwidth and before you get any ideas like shoving this into a Nas you should know that most Nas appliances are designed to support at most a 25 watt pcie card not 75 Watts as this one requires from the slot it'll also require a double slot you know physically there needs to be room there there's also the PNY version of this card is bundled with both a half height and a full height bracket also note the four display port connectors at the back are mini display port and almost every mini display port cable on Earth that I've tested for the kvms and everything else are is cursed and has a terrible time delivering the full line rate display Point 1.4 connections oh and in case you're wondering it does not support vgpu functionality officially or unofficially it won't it won't do nvlink either for unlimited encoders and decoders when we actually get down to it in the testing with Plex Media Server and jelly the performance falloff is nonlinear and pretty significant I mean you can do four or five streams I think with Plex no problem in realistic scenarios there are two physical hardware encoders and decoders but the unlimited number of sessions is going to be diminishing returns really diminishing returns much past about four I mean it's it's a different story if you're taking something like a premium 4K stream and compressing it down to 720p low bit rate for mobile uh so it's really hard to do an Apple's to Apple's explanation about what your expectations should be if you're coming from say eight streaming sessions running on a 2080 TI or something like that I mean this card is better in every way at least every way that I tested and it has full av1 support so that's also pretty exciting so yeah color me both surprised and impressed like I get I get the cost I mean in the cost Universe of Quadro cards this is pretty good uh I imagine that if you were buying a Quadro or a workstation card for something like Revit or doing cabinet things or simulation or whatever you probably want more compute horsepower but for running large language models with 20 gigs of vram in uh like the ultimate home assistant Appliance if the cost premium doesn't mean anything to you then this is an A Class by itself in terms of 20 gigs of vram memory bandwidth performance and the fact that it's doing all of the stuff that it's doing in such a tiny 70 watt package if you can run a 490 and you're a hobbyist or a Home user use case or you want something really amazing for home assistant the 4090 is almost certainly going to be a better choice in all respects except one that is power usage and physical size and and idle temperature because this thing idle is pretty cool as well um that's not always true with the 4090 especially the reference edition of the 4090 uh a version like the MSI Supreme 490 idle temperature not going to be much of an issue but those cards do idle significantly higher in terms of power usage than what we see with this so truly mindblowing that we can get a large language model to run as fast as you saw on a card like this on a platform like this which is designed for very low power our 80 plus gold Power Supply Plus the 65 watt 7912 core on a full server am5 platform optionally with error correcting memory now the error correcting memory on this platform it's am5 theoretically the hardware is there in newer version of the Linux Kel we just got it but Ubuntu [Music] 22.041 form we can get into but overall this platform is very stable very well put together and relatively inexpensive for what it is like if you wanted a 10-year Appliance to do you know video analysis and have an interactive large language model in home assistant this is a great solution for that even considering that you know versus a gaming GPU there might be a bit of a price premium uh with your Nvidia Quadro card but with the quadro you're getting some things you don't get no matter what you're getting on the gaming GPU side of things and that's even stepping outside of Nvidia and going with a non- Nvidia solution so Nvidia really does have something special in this card and what this is level one has been a quick look at the RTX 4000 small form factor Ada generation official name card from Nvidia that I borrowed that is not from Nvidia that I didn't expect to like as much as I do but gosh darn it this is a really Innovative piece of Hardware I'm signing out you can find me in level one form so let me know if you want me to take any any more stuff for this for a spin I borrowed two of these cards but I have to send both of them back so I don't have a lot of time so if you uh want me to try anything let me know cuz it might be a lot of fun all right I'm signing out and I'll see you [Music] there
Info
Channel: Level1Techs
Views: 117,872
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: ZiPmT_JWNII
Channel Id: undefined
Length: 24min 55sec (1495 seconds)
Published: Thu Apr 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.