MI210s vs A100 -- Is ROCm Finally Viable in 2023? Tested on the Supermicro AS-2114GT-DNR

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] I Can't Believe It's Not Cuda [Music] let me set the stage for you one of the largest smartest tech companies in the world has an absolutely insurmountable lead in artificial intelligence I mean they've essentially got an unlimited budget and a license to print all the money that they could ever want even if that's not enough their products are good and you almost universally can't actually buy them even before their last generation products ever enter availability but they're already hyping the Next Generation products and the foundry's making nice storage products are already going Full Tilt competitors what hope do they have against that it's a David versus Goliath situation right we need a small thermal exhaust Port two meters wide to free ourselves from the death grip on the AI ecosystem right who do you think I'm talking about Nvidia no Google hello remember tensorflow um okay tensorflow is still a thing it's still a big thing and a lot of the math and libraries are still a big thing but more specifically tensorflow specific Hardware although Google's tensorflow YouTube channel is looking pretty dead these days it's not not looking great I mean a lot of interest around people Googling tensorflow I guess in the beginning you could watch some old videos from Google on tensorflow and tpus and their attitude was basically yeah you can run this on gpus but really you want to run it in the cloud on a TPU because it's just better in every way and a lot of googlers seem to have basically written off gpus for everything except teaching yourself about the wonders of tensorflow and tensor processing units and you can see that in a lot of their coverage from 2015 through 2018. I mean by 2016 Google was already demonstrating third generation water cooled tensorflow TPU Hardware which is pushing something like 100x the speed efficiency of you know the Contemporary gpus at the time and yet here we are in 2023 look around no one noticed that for the fifth generation of TPU which was actually due in June of 2023 and we barely heard a peep from Google about it where's where's my 5th gen TPU Google where where is it at wow I need it 5th gen was something Google was still pretty excited about even as recently as April April 2023 and for what Google uses them for at V4 TPU was between one and four times faster at half to two-thirds the power of nvidia's a100 and so Google had expected their V5 to compare favorably to nvidia's h100 that they just launched now Google was able to productize their tensor processing units I've got one here on this tiny m.2 this is a coral.ai accelerator that can do people and object finding and security camera footage for example it is powerful beyond my expectations this tiny little m.2 uses less than 15 watts and it can analyze a scary number of 4k camera video streams at once it really is breathtaking and you can do this yourself it's totally democratized no Cloud but you know what it the part of it was not breathtaking was to wait for it to get here it took over a year from the time that I ordered it to the time it made it here Google had nearly a decade lead in doing things with AI and machine learning but from a business standpoint all the stuff that fits under the AI umbrella is what has catapulted Nvidia to a trillion dollar company so how did Nvidia win and Google lose with AI the future hasn't actually even been written yet and the future can go in non-obvious ways is my point don't believe me that's exactly what Clement delaying was saying to a room full of us at the AMD Data Center event said he doesn't know what the future is going to be like maybe amd's vram is going to be an advantage maybe amd's Hardware process is going to be an advantage maybe Nvidia is going to be he doesn't know what the future is going to be I don't know what the future is going to be it's it's probably more bizarre than anything we can imagine but I can tell you that even Nvidia knows that h100 is going for over 40 000 each on eBay is ultimately no good for NVIDIA that kind of thing is just going to speed the adoption of something that's not forty thousand dollars for that level of performance and Nvidia is on top now because they put together a far more developer friendly ecosystem and the accessibility was way higher versus tensorflow in the early days of AI Nvidia is focused on accessibility and ease of use and taking care of students and Educators really it's paid off Cuda was nowhere near as open as tensorflow and still isn't and yet Nvidia is three quarters of a trillion dollars up on their market cap from all the way back then and now just a few short years later amd's Rock m is in a similar position as Nvidia was with tensorflow Rock him however is more open even than tensorflow which ultimately benefits everybody and amd's Hardware is maybe probably their Hardware process is ahead of the curve at least in my opinion and I'm not the only one that thinks so so today in 2023 AMD is already in a much better position than Nvidia was when the first and second generation tpus from Google had launched and everyone was super excited for AI then AMD is more competitive than most people realize and are only just really starting to wake up to that possibility I mean I really wouldn't bet against Lisa sue if it wasn't obvious from all the top 500 super commuters based on AMD products that AMD counts as wins come with me into the trenches and let me show you the state of Rock'em in 2023 I literally cannot believe how far Rockham has come since last year even it is it's kind of mind-blowing here's our super micro as 2114 gt- DNR it's two servers each with three Instinct Mi 210 gpus and mi210 is about half of an MI 250. check out our other videos on this system and the teardown that we did with Gamers Nexus it really is super interesting for this system I've already got to set up an up and running with rockem 5.5 and pytorch and this is based on the nightly for our AI demo we're going to be using automatic 1111. now if you're not in the know automatic 11 11 not only doesn't support AMD gpus at all they're just a wee bit hostile on the GitHub issue tracker if you come asking for AMD GPU support they don't have enough bandwidth to support all the implementations basically anything not Cuda okay I mean I guess that's fair but it actually works fine anyway though I mean technically automatic 1111 is just a web GUI front end for stable diffusion that doesn't even really technically support multiple gpus although you can kind of hack that in there by running stuff in parallel on two different gpus I've downloaded a model from hugging face and I'm running it with this prompt we're going to generate 32 images 16 on each of two mi210s on our AMD system and then just a single thing running with our h100s and by way of a direct Apples to Apples comparison in the output this instance of automatic 1111 is connected to our comparison system but it's really a similar config we're running a single Nvidia a100 GPU which has 80 gigabytes of vram and it's running at the full 300 watts and to be sure the a100 is a higher class of GPU than the mi210 but this super micro system with this multiple mi210s is the only Instinct Hardware that I have available and I only enable two of the mi210s for this demo I mean a100 is faster more power efficient but more than a few people have noticed just how much the Gap has narrowed between these cards since the instincts launched in 2021 and this is basically a direct like for like demonstration with Danny DeVito of course now I've downloaded a model from hugging face to both of them that I've called I can't believe it's not photography.safe tensors and that's set up literally I just copy the file from one system to the other system we're going to generate 32 images in total we've got 16 images being generated on each of two mi210s on the left system and on the right system we've got our setup for our single a100 our single a100 is 80 gigabytes each of our mi210s have 64 gigabytes of vram it's not Apples to Apples but this is literally the only Instinct system that I have I mean two mi210s is very close to an MI 250 I mean Mi 250 is a dual chip design and the mi210s are a single chip but the mi210s have a higher power budget the power budget on the Mi 250 is lower so the mi250 is much more efficient it just depends on which thing you want to buy which thing you want to optimize for and to be sure the a100 is a higher class of GPU we're not you know it's not Flagship for Flagship it's not really exactly like for like that's not the point of this video the point of this video is just is viable and how far does it come exactly in eight months because we looked at things eight months ago but now we're looking at this in our super micro system and things have come a long way in eight months so I've set this up as a like for like demonstration and we're going to do it with Danny DeVito of course Danny DeVito explaining how his car ended up in the ditch while eating cereal [Music] [Applause] we'll run that on both systems roughly the same time now we can see back here at our terminal Windows I've got NV top going and I'm going to ramble a little bit and our Mi 210 system is going to take about twice as much time remember it's two instances doing 16 images each versus a single instance doing 32 images the the time will come down it'll end up being about a minute and 20 seconds for the mi210 versus about 48 seconds for the a100 to give or take we need a longer run time and depends on the prompt and some other stuff but we can see you know we're using about 300 watts 279 280 Watts something like that over here for the h100 or the a100 I'm sorry we're using 300 watts sometimes it'll go a little over 300 watts the highest I've ever seen it was I think 340 watts give or take but mostly it's around 300 watts this is pretty good it's updating a second I'm not really using a lot of video memory or anything like that this is NV top on this side this is AMD GPU top on this side we can see that our tools from from the AMD GP side of things are pretty mature that gives us a lot of telemetry into what's going on inside the GPU if you're a programmer everything that you need to do any kind of troubleshooting or figuring things out or anything like that is already right here in whatever you need and with the work complete and now I use the same seed and that sort of thing you'll get the same result you know this AI stuff is not as random as it seems It's actually quite deterministic meaning you'll get roughly the same result ah but if you dig into it and look at it it's not quite exactly perfectly the same well the second image in our set is the easiest to begin to notice differences between the two systems you take a closer look at the spoon and the buttons on the shirt there are subtle buttons there I don't know if the YouTube compression algorithm is going to destroy it but there are black buttons on the shirt in from AMD and there's silver buttons from Nvidia no it's like what are the different they should be exactly the same there must be a bug right ah amd's not ready for the prime time no actually in fact when you look at the specifications for gpus and it talks about floating point 64 performance for example it's like oh you get so many trillion floating Point operations per second with this GPU unless you're doing AI top operations and then it's insanely way faster well the insanely way faster is that it's it's doing compression there that's kind of lossy you know your your tensor cores are not working with a full floating Point implementation and most of the time it's fine like these in these images are substantially the same AMD is not taking the same shortcuts with their implementation at least not yet or at least they haven't reached that level of optimization depending on how you want to spin that you could say that nvidia's software solution is a lot more mature and because of that you get you know radically better performance but you could also say on the AMD side that what it's doing under the hood is more accurate because AMD is not as sure of its solution yet or it's its solution is not uh as tried and true as uh nvidia's solution and so they're not willing to make those kinds of optimizations because it'll make future debugging a little more difficult however you want to characterize it however you want to spin it under the hood the math that's happening on the AMD side is more accurate and so you get little subtle differences like that in the images as a result of that the spoon looks more like a spoon the buttons look more like buttons very very subtle little details in these images same with this we've got four buttons and maybe another one peeking out and then on this we've lost a button and they're not exactly in the same spot there's other little subtle differences in every single one of these images again subtle differences around the hand and some of the background stuff that's in the image uh-huh this image shows us some large differences we ended up with a bowl of cereal I guess on in this one whereas there's a there's a car and something else going on over here I'm not really sure the car in the background is also a little different again little details little details in the headlights little details with the car and the background and what's in the hand again the buttons are a little different between the two systems not just that but even the the cloth texture of the shirt ends up being just a little different between the two systems and it's a little more open a little more uh buttoned down from Team Green interesting subtle differences as a result of just the loss of just a couple of the most insignificant bits you could ever think of the angle of the spoon in this image just and other little subtle details end up being just a little different between the two platforms and this is not a negative this is just a difference in how the accuracy is carried forward and certainly the AMD Dev teams are aware of this and they're working on this and it's like you know Implement Cuda inaccuracy is probably a memo on somebody's desk just so you don't have data scientists that are freaking out because there are subtle differences between this system and another system that they're using in order to do this but the fact that amd's performance is as good as it is without taking those kinds of shortcuts really should tell you what you're dealing with under the hood here in terms of Hardware the hardware is Second To None the hardware engineering is second to none and we've seen that with the AMD Data Center event that just happened in San Francisco and what AMD is promising with the upcoming Mi 300. It's actually an MI 300a and a 300X an APU and a GPU but before we talk about that we need to talk a little bit more about the performance side of that the performance of automatic 1111 shouldn't be something that you hinge you know your thought process entirely on it is the best worst example of something for AMD and and even the being the best worst example it's still pretty darn good the fact that this is up and running on Pi torch with with very very small changes to automatic 11-11 in order to be able to produce basically an identical result uh really is truly breathtaking especially considering where we were just a few months ago if you look at the other benchmarks that are in the ml perf Benchmark Suite resnet 50 50. AMD was already basically there at parody remember their launch slide delivering uh performance records in high performance compute delivering performance records in high performance compute it was already outperforming an a100 the Mi 250 or the mxm version of the 250 basically and AMD wasn't just making that that's just the raw math the raw compute performance for my own tests running Bert which is another one of the ml perf benchmarks running a single a100 so this is language manipulation basically it's about 2100 examples per second between steps 100 and 200 on that single a100 for this dual mi-210 system I can achieve about 17 to 1800 examples per second between steps 100 and 200. that's like 85 the performance of the a100 in a similar similar ish footprint again viability and huge strides in software I mean that's multiple benchmarks in the ml perf Suite where AMD is within 15 of Nvidia that's pretty darn good now this really just shows that amd's software ecosystem is finally becoming mature I wouldn't expect these kinds of software gains for 300 series Instinct over the last time of the product I'll take it more of a sign that AMD is leveraging what they're learning from their super compute customers for their whole software stack everyone benefits basically in just February there were headlines suggesting that we should worry about AMD and AI performance is it time for AMD to throw in a towel and here we are in July and the headlines are saying something completely different oh we might have looked a over AMD for AI and machine learning performance because the AI and machine learning companies are starting to use these libraries I mean look the AI stuff is just math there's Hardware to do the math and the hardware to do the math has already been here it's it's been here the whole time but the software path at least for the non-pds working somewhere other than Oak Ridge okay maybe that part was lacking a little bit but the hardware was here but just like what Mosaic ml was saying I'm seeing basically 75 percent of the performance of an a100 which gen on Jan the Mi 200 series and the a100 series those are roughly the same generation I mean that's pretty good oh and speaking of AMD hesitancy there is also this recent event kerfuffle news thing around George hot he's a very famous software engineer brilliant mind and has been working in AI for almost 10 years he founded a new company comma.ai which is working on making AMD gaming gpus work with AI because they're so cheap and because AMD is already more open than anybody else and because amd's Hardware performance is already there they've already got the raw performance compute lead that they need they just need the software to connect it up and he wants to do that with comma.ai but Nvidia has done such a good job with their software and Hardware ecosystem his expectation was that the gaming and compute focused products from every company are largely the same and that's not really true with AMD AMD has rdna for gaming and cdna for compute and someday they might merge back together but that day is not today I mean I'm showing you the Instinct cards this demo from today that's cdna compute DNA they're what's in the super computers they're already viable for supercomputers they're the building block for what AMD is doing and they're more and more viable for more stuff with each passing day in the context of this exchange with George Hots and Lisa Sue was that for what George sees is the future he wants rdna3 to be where cdna2 is today in terms of the software stack and yeah that'll be good maybe those products will merge together in another generation or two or three that that might be great meanwhile a lot of people have remarked that uh it seems like Nvidia has forgotten Gamers because they're able to sell their stuff somewhere else but one thing I don't think Jensen realizes is that any success that Lisa Sue and AMD have it doesn't really take anything away from Nvidia the aforementioned forty thousand dollar h100 it's not good for anybody the future hasn't been written but I think AMD is perhaps best positioned more so probably than any other company to incorporate and productize hardware and software breakthroughs not just in AI but anywhere and the proof I have of that is in the upcoming Mi 300 it can be the fastest compute GPU the world's ever seen maybe maybe not or they can swap in some chiplets and have an Enterprise class Apu a single chip that has hbo2e x86 cores and GPU cores all in one package I mean AMD makes that look effortless a computex Nvidia stole the show with Grace Hopper a CPU and a GPU together they're gonna bring them together and it's going to be something amazing finally CPUs and gpus together like nuts and gum but AMD is over here and they added their Mi 300a and the Mi 300X one is a pure GPU as we've seen and the other is mixed x86 and GPU chiplets with hpm2e on the same package isn't it obvious that that's the more advanced Hardware I mean even with Intel Ponte Vecchio is more advanced with its tiles and packaging and everything else that's more advanced than what Nvidia is showing off am I am I the only but notice that I mean I mean yay Nvidia has arm that's good that's going to help them with their product ecosystem but the software is still closed and how much of an advantage is closed software really today when you see all this momentum building behind literally anything but that and just to give you another idea of how much the future hasn't been written yet uh check out this submission from neural Magic on ML Commons for that ml perf Benchmark we've we've demonstrated some real serious gains in the component benchmarks that make up minimal perf for rockem and AMD gpus but this is a result running purely on CPUs this is cpu-based inferencing meaning you take a neural net and you sparsify the Matrix that's that's their word you really should check out neural magic and it's possible to get gpu-like performance from multi-core CPUs for neural networks using this approach this might be the future like I said the future hasn't been written yet but but I also wouldn't bet against Lisa's Sue because she's got all the stuff her CPU this thing is more advanced amd's packaging thing is more advanced and no matter what happens in the future AMD is certainly going to be responsible for leveling the playing field by playing Fair no matter what actually comes about that I can promise you for AI and machine learning and even hybrid Hardware approaches AMD has been there done that they've got far better capitalization and far better recognition they're just they're just killing it meanwhile nobody even noticed that the Google's V5 tensor processing unit is Mia and you really won't believe it's not Cuda although it really did take the 264 iterations of Rock'em to get there so congratulations to the AMD team on launching Rock and 5.6 but uh don't let off the gas now I'm Wendell this is level one I'm signing out you can find me in the level one forums if you got any experiments you want me to run let me know because this Hardware is amazing thanks to Super Micro for letting me borrow this system you should definitely check out our benchmarks and other coverage and maybe pick one up for your lab if no other reason just to experiment because oh boy it's fast [Music]
Info
Channel: Level1Techs
Views: 43,658
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: IhlL1_z8mCE
Channel Id: undefined
Length: 23min 30sec (1410 seconds)
Published: Mon Jul 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.