RTX 4090 Performance in 3D and AI Workloads

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're taking a tour of the Silicon doorstop known as the RTX 490 and see what it's like in blender and AI related tasks more importantly like all my videos I care a lot more about the experience and usability than raw numbers and so that's what we're going to focus on I really want to give you a sense of what it's like to live with this card to get work done now considering how hard these are to find anywhere near MSRP I guess I'll consider myself lucky to have found one at MSRP a PNY BTO epicx at just over $1,989 all in now from my understanding just about every 490 is within spitting distance of each other performance-wise this really comes down to Aesthetics and possibly any brand loyalty you may have speaking of which feature wise every 490 is identical they all feature 24 gigs of RAM just over 16,000 cicor and like all aircooled versions this one has a 450 W TDP now being a 40 series card also means we have d lss3 frame generation but that's something we're going to dive into in a future video finally this particular card is super quiet in fact it's totally silent in desktop mode and emits only a hushed were when being driven hard but do keep in mind that from what I've heard pretty much all 490 cards are efficient and cool so we already know these cards are fast and expensive what's it like actually using them well let's start with 3D via blender all right for our blender demo honestly we're going to keep it simple here the bottom line is the 4090 is an absolute monster when it comes to 3D and it kind of doesn't need to prove itself it is pretty much the fastest consumer card you can buy right now and its performance is absolutely backed up uh uh with the price that you pay for it it is an absolutely phenomenal machine for this and so I've created uh a very simple scene here just to show what I think is one of the strongest aspects of that power and that's the realtime Cycles preview so I basically generated a flat plane gave Lucy uh two lights to to shine on her here and then added a subsurface scattering Shader to this and the real benefit here when you have this much Hardware power is that when you switch over to Cycles you get this realtime preview of your work and what's really nice about the way blender allows you to turn this on is it defaults to your CPU and I have an okay CPU in this machine but as you can see as I move around here I get this sort of low pixel preview which allows me sort of put the camera I want and then as soon as I stop moving it's then going to start rendering out the scene and you can see the sample count slowly go up here as it's sort of building this preview out for me when I switch over to GPU compute the same sort of pixelized version occurs at first but as soon as I snap done the rendering is basically done instantaneous if you kind of look up there it does all those samples in a Flash and so this as a workflow enabler then is incredible because it makes it really easy to look around your scene preview your Lighting in real time and not have to worry about a really slow sort of update process as you move your scene around and so it's not all roses I find in inid land so just in the for the sake of sort of like fairness one thing I have seen is that when you have some particular scenes there is a large amount of prep work that happens before you actually get to the rendering and it's almost as if the 4090 is being I guess held back here so I've turned on GPU computes we'll go ahead and start a render here and what you're going to notice is that there's a ton of sort of like synchronization in Preparatory work the actual render itself happens incredibly quickly now to be fair this is all part of the rendering process just cuz we're not seeing an image doesn't mean it's not rendering things out but is something interesting that I've seen before with Nvidia Hardware on this particular scene but that's pretty much the only downside to to Nvidia Hardware that I've seen or at least an area where other rendering engines like my M3 Max MacBook actually gets close to the rendering speed that this 490 has other than that though the 490 is again an absolute monster when it comes to rendering and it is something that is going to make you faster at this particular application okay let's move into AI with stable diffusion and llms the headline here is fast but with some caveats first though stable diffusion so using the bass 1.5 Model A 20-step 512t image takes around 3 seconds however nvidia's optional tensor RT extension drops to to a somewhat astonishing 7 seconds but this is where our first caveat comes in see the 7 seconds is with no negative prompt anyone doubles generation time to around 1.2 seconds of course that's still super fast but highlights the key trade-off Nvidia made with some of these optimizations in other words we trade some flexibility in e in ease of use for Speed so so for example in order to use tensor r t we have to generate specific engines for each model size class and bat count this can take some time to do and adds complexity to the overall process upscaling is a good illustration of this see most of the time a five2 pixel image isn't quite large enough for most output tests larger base images take more memory in time though so a clever workaround is to use a secondary upscaling pass however in order to upscale with tense Rarity we have to generate two engines one for the base resolution and one for the upscale sadly your first attempt to generate that second engine May Fail like it did for me with a Cuda memory exception to fix this I did some research and found that I had to create a specific Cuda system fallback policy for the python executable and rerun it this now worked but if you're like me you'll be faced with a somewhat really strange circumstance image generation with upscaling is now no faster than without the tensor optimizations worse negative prompts now cause an error at least they did for me the larger Point here is that while raw numbers can be great what actually matters is output quality and to a lesser extent ease of use to go fast we end up jumping through some pretty obscure Hoops in other words well the raw performance of tensor RT is great I'm not really sold on its overall user experience that's not to say that your experience will be the same I just want to put that out there just just in case you see some of the numbers and want to replicate that same experience on your end okay so to really drive this home I'm adding my M3 Max MacBook Pro as a comparative device I want to be clear here a 4090 will mostly TR this laptop and a run numbers it most certainly does but in practice a 409s value proposition is more complicated than I expected and this laptop helps us understand why so on the Mac using an app called SII I'm able to gener generate 1024 pixel output in around 6 seconds recall the 490 takes about three so yes twice as fast but out of the box stable diffusion web UI makes getting usable images much harder than this app you have to find models wait through a complex interface deal with upscaling and generation Oddities fast is good but ultimately not as compelling to me if it's harder to get great results bottom line I hope Windows develops a more diverse ecosystem of native apps because right now to my surprise I believe the Mac offers a more compelling stable diffusion experience for non-experts there's one last thing I want to show for stable diffusion and that's real-time generation via comfy UI and EXO turbo this is something that really makes a 490 shine and I'm excited to show you so if you haven't seen this particular interface before this is called comfy UI and this is a node-based interface that allows this workflow to really shine first things first though we did download Lo and install comfy UI we then added the stable diffusion XL turbo checkpoint and we also loaded the real-time prompting workflow into this uh user space here we'll have links to all of these in the video description uh to actually use realtime prompting then it's pretty simple you go to the Q options over here and then you just check extra options in Auto q and then as soon as you press the Q prompt button right here it will start looping the Q prompt so that as fast as we type it's going to try to generate images for us so we could just start typing uh what's nice about this workflow in case this is something you're like oh I could actually really use this is it very conveniently puts the noise seed up here for us so that as we are typing in maybe we don't like the uh overall field that we're getting here we can very easily just change the seed value so that we're getting something perhaps more in line with what we want and it'll then make variations of this based on the prompts that we have here all right let's check out some llms so the llm story is similar to stab diffusion fast but again with some caveats now for inference we often rate llms on tokens generated per second with a token roughly mapping to a single word the average adult reads between 5 and 10 and on most small models the 490 generates 20 or more so yes more than fast enough that's honestly not that exciting or even shocking so the more important question is how large a model can we process with the 24 gigs of RAM this card has the answer if you want real-time generation stick to models whose size is less than the total vram of the card so for example cold llama 7B is 5.5 gigs and processes at around 36 tokens per second the much larger mixt will instruct with 4-bit quantization is 26.4 gigs and plots along at three tokens a second ram matters unfortunately while I don't have a prospec Nvidia card handy I do have that MacBook with 36 gigs of shared memory to help show why I loaded up LM studio with the same code llama model used above again the 490 was 36 tokens a second the Mac 33 so definitely faster next though I loaded up a more demanding 15 gig 2 bit quantized mix 7B on each to my surprise tokens per second now favored the M3 at 33 per second the 490 23 finally and honestly as expected the 36 gigs of unified memory on the Mac allowed it to run that 26 gig mixl model at 26 tokens per second recall the 490 was only able to manage three tokens per second of course this brings up the most important point and I truly do mean the most important point at least as of now the code llama model while fast on both machines produced significantly worse results than the larger mixture model let's take a look to see what I mean all right so here we are on the PC uh running the 490 and we have the excellent LM Studio loaded up if you haven't tried out this application and you're interested in large language models this is a fantastically easy way to get started we have three models on this machine we have the code llama uh which weighs in at 5.53 gigs and we have two variants of the mixl 8X 7B we have the 2 bit quantize which weighs at 15.6 gigs and then the four bit quantized which weighs in at 26 gigs this model fits within the memory space of our card this one doesn't and so performance-wise we start off with the code llama code llama 5 gig model fits very easily to the memory space and its uh token generation is very respectable 34 per second with a 39 seconds uh to time to first when we move up to to the 2 bit quantize mix roll we take a bit of a hit but it's still pretty performant it's 20 tokens per second with just under second time to First the real problem however is when we go to that full 4bit quantized model the model is now bigger than the memory space of the card and so therefore our tokens per second takes a massive hit down to four tokens a second and it takes nearly 40 seconds to get the output started from this and so this is what we mean when we say that uh the memory restrictions of the 4090 start becoming a prom let's now take a look at the Mac to see what happens when we have more memory available to you all right so moving on to the Mac then really the idea here is just to show you a sort of real world demonstration of what having more memory available to the large language models uh makes so we start off with the same basic chat prompts that we had before the first is running against the code llama instruct the response that we get here I would argue is basically not usable it's not really following our prompt and although it's fast at over 40 tokens a second which by the way is actually a little bit faster than the 4090 the result that we're getting here is not what I would consider usable it's only when we get to the mixol that we start getting results that at least to my eyes start resembling a little bit closer uh to something that we want the problem is although this is still fast 33 tokens a second as we'll see in just a moment the response that we're getting here is still not correct it's only when we go to the 4bit quantized model again recall this is the one that the 490 could no longer Run in real time it was four token a second here because we have the extra Ram we're able to still run it at 24 tokens a second this response is much much better than what we're getting before and so just to quickly put a visual to that here is that 2bit quantise model we'll just go ahead and paste it in the problem again with this is although it was fast it was generating code that had errors in it and in fact if I try to submit this form it even goes to a page that doesn't exist so this is not what I would consider usable if however we compare this to the first shot that we got using the 4bit quantize model the results are I would argue pretty much exactly what we asked for so I asked for a form with a single field called name that validates and that's exactly what this does it validates and when I go to submit it it prints out a message as we asked it to so that an excellent response right there just to show then the power then of this large model is now that we got our initial query correct we can start adding additional prompts to it so this case then I go one step further and I say well now that we have this form let's go ahead and make it look look nice uh basically add some CSS to it and change where the response message is printed and so when we take a look at this version you can see that it pretty much Nails this as well so it adds a nice full page responsive style to it it still validates correctly and when I submit this through it puts that response message in the correct location so again another excellent result I tried one more additional prompt though which basically saying hey hey let's add a field called email to this form so that's a pretty vague prompt but look what it actually does so it is going to go in and it's going to uh properly test the email as an email address so I didn't say hey validate this as an email I just said give me an email address and it understood that and it is going to correctly validate it as well so if I try to submit this without one it's going to complain if I put an invalid email address it's going to complain as well so again a really excellent result here and it even updated our response message to appropriately reflect that new field as well the bottom line for me is this for what I use LMS for the 4090 is a bit of a compromise card currently I simply can't run the size models I need for the quality results I expect of course this means looking forward to the 5000 series it's going to be really interesting to see how the Ram story evolves two quick points to close out the section first this is a rapidly evolving Space Case in point Apple released a paper a few weeks back detailing significant memory and performance optimizations for llms on low memory devices if this Tech makes its way to the PC it should allow those performance crushing models to much better on these cards speaking of which on the raw performance side just like stable diffusion's tensor specific code Nvidia recently released a library aimed at accelerating llms called tensor RT llm I poke and while I wasn't able to get a hard number on tokens per second from my reading of the documentation we should expect it to nearly double performance from everything that you saw here today so hopefully this gives you a bit of a feel for what this card is like for 3D and AI related tasks I think it's a clear winner on 3D but find the AI story a little bit more complex and nuanced
Info
Channel: Matthew Grdinic
Views: 1,411
Rating: undefined out of 5
Keywords:
Id: SMF0HjgpuhU
Channel Id: undefined
Length: 16min 54sec (1014 seconds)
Published: Fri Jan 12 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.