Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this episode is sponsored by leno the world's largest independent cloud computing provider stick around today and get a hundred dollars in credit on a new lynnode account hey there my name is gary sims and this is gary explains and today i want to look at how you can build a gpu accelerated supercomputer cluster in your home so if you want to find out more please let me explain now hopefully you've had a chance to watch my raspberry pi supercomputer video where i built a cluster of four raspberry pi boards showing you the basic architecture of how a much much bigger supercomputer works but actually you can practice and understand the the architecture using just four raspberry pi's the difference is there is that was just about cpu calls there was no gpu stuff and today what i want to look at is how you can build a similar kind of supercomputer cluster but using nvidia jetson boards which means you can also use gpu acceleration and in fact actually most big supercomputers actually have a combination of cpu and gpu resources so that the work can be done as fast as possible now when you write a computer program it's the first simplest ones are sequential programs step one load five into the value of a step two add ten to the value of a step three and you kind of go on step by step and of course these programs can reach millions of lines of code however they are executed sequentially with the advent of multi-core processors the concepts behind multi-threading programming became much more important they have existed for many years before that but they became much more important now i have a whole video on multi-processing multi-threading and multitasking and i really would suggest you go and check that out but basically with multi-threading you can try to do two things at once now the problem is not everything can be done in this parallel way for example if you wanted to cook some fried potatoes the potatoes need to be peeled and chopped and then they need to be fried you can't fry them while you are peeling and chopping them but if you wanted to have some eggs alongside your potatoes then the eggs can be cooking while the potatoes are cooking so two things can happen in parallel so the real secret to get the most resources out of modern computers is not to just do things sequentially but to also do them in parallel using multi-threading now of course a normal computer let's say a desktop might have four cores you might have eight cores 16 maybe but you're kind of around there a smartphone eight cores a raspberry pi maybe four cores and the problem is even if you pick one of these really really big servers and i've got a couple of videos on on server chips that maybe have up to 80 cores at some point you limit the number of calls now super computers want to use thousands of cores so the only way to do that is to join multiple computers together and get them to work together and that's what i showed in that raspberry pi cluster video but now we're going to do the same thing but using a gpu now gpus are really interesting when it comes to parallel programming because a gpu might have you know hundreds of calls even thousands of calls so even on an nvidia jetson nano there are 128 gpu cores which is greater than even some of the biggest server cpu chips you can get we've already got 128 cores here on a nice small board 59 for the two gigabyte version and of course if you go right up to the big offerings from nvidia then you can get thousands of cores inside of a gpu now the way gpu programming works this if i have let's say 128 values and i want to take the square root of every value but in a normal program you might just say well let's take the square root of value one let's take the square root of value two let's flip the screen and you just go on and on through the list if you were doing it multi-threaded then you say well i've got four cores i've got 128 values let's just split this up into four lots of 32 and then each thread says well let's do the first one let's do the second one and you get four times the benefit but still it's doing it sequentially now because a gpu has got so many chords you can say let's do all 128 in one go and so when you write a gpu program what you're actually saying is please do this one operation the same operation on every single value of this block of memory and today what i'm going to show you is a program that takes the square root and then doubles it and then takes the square root again and then doubles it and does that a few thousand times now of course that can show how the inaccuracies of a square root and then doubling it can show the divergence in fact i'm just doing it because i want a gpu load that i can load onto the gpu and then just show that the gpu has done some work obviously there's lots lots of complicated things you can do super computers and this kind of modeling get used for weather forecasting and a molecule mapping and for nuclear explosion stuff and for all kinds of you know the medicine all these kind of things that happen in parallel we're gonna take the square root because we're simple folks here so i'm gonna be using is four nvidia jet symbols i've got two nvidia jets and nanos with two gigabytes of ram i've got an nvidia jetson nano with four gigabytes of ram and i've got an nvidia jetson xavier nx i've done a review of all of those boards here on this channel but the great thing about nvidia's board is they're all compatible from a software point of view so i can use all of these together now in this super computer cluster with gpu acceleration and the software works the same on all of them just each board brings more resources to the game and allows you to up your performance just by using a different board or by adding more boards this episode is sponsored by linode the largest independent cloud computing provider whether an experienced developer user or just starting out you can build on lynode start from scratch and fully customize your server for any application or use their one-click apps to deploy game servers websites personal vpns and much more whether you just need a basic website for your portfolio or a beefy gpu instance for ai scientific computing and computer graphics projects linode has the flexibility and the scalability to meet your needs if you run into any trouble during setup lenode comes with amazing 24 7 customer support by phone or ticket along with hundreds of guides and tutorials to help you get started sign up today at lynnode.com gary explains and get 100 in credit on your newly note account the link is in the description the problem with supercomputer programming is there is an overhead to distributing the load to the computers obviously a cp and a gpu working locally in its own memory with its own caches can work very very quickly to send a job over the network and then for it to get processed and send it back actually takes quite a long time in relative terms so if the controller noted to node number one please calculate the square root of 49. but it makes a network connection sends it over there that program fires up starts to do the square root of 49 and sends back the results the controller node says okay thank you very much well that's that's a long time in computing terms so the art to supercomputer program is to make sure that the load is sent out and the nodes are busy much much busier than actually it takes to just send uh the overhead to send the data across and start the work off so in this example what we're going to do is we're going to start with thousands of numbers and we're going to say to one node please use your gpu in parallel to work out the square root of all these numbers and then as we add more nodes we're going to say well now that we've got two nodes we can split the number of numbers in half send one half to one computer node one half the other computer and then they can run in parallel on their gpus and we'll keep doing this until we get all four boards running we can see the difference in performance as we spread this loadout across our supercomputer cluster and just remember these calculations are actually happening on the gpu so the way it works is that the cpu basically says okay send off this to node number one node number one the cpu receives it networking memory allocation all that stuff it then says to the gpu hey gpu go and do all the square root of all these things in parallel using your 128 cores in the case of the jetson nano and then give me the results and the cpu takes over and sends it back to the controller node okay enough talk let's actually go over to my gpu accelerated supercomputer cluster and actually see this in action okay so here we are i'm showing an overview of all the boards in my little cluster here so each one is showing the resources and this is using a little program called j top like jetson top which gives you kind of an overview everything that's going on very good program highly recommended around the top left-hand corner here is the jetson xavier nx then this one here in the top middle is the jetson nano with four gigabytes and then the other two are the jets and nano with two gigabytes now be using mpi which is the message passing uh interface and that's a very simple library that you can use to write a program that talks to many many nodes in a cluster and i use that also when i did that raspberry pi uh program so what we're going to do is first of all i'm going to show you i've got a thing called a cluster file which shows me which files which clust which nodes it wants to talk to so we use cluster file one because it's just got one now all of my uh nodes are in 51 52 53 and 54. so this is going to talk to one node in the cluster so what we're going to do is we're going to run time command because that allows us to see how long something in how long something will take mpi exec is the program for executing an mpi program and we're going to say to it please use cluster file one and what's the program want to run simple mpi a small program that talks to the other uh the other node and says here's a gpu program for you to run please run this using nvidia's cuda so if we run this now what we're going to see first of all let's change all of these to show the gpu activity because that's the important thing here gpu activity across the board at the moment is uh pretty minimal let's fire this off and as you can see a couple of things now quickly the gpu here is being used only on this one the other ones there's nothing happening we notice that we're dealing with three million eight hundred and forty thousand lots of numbers and if we quickly go here to the cpu we can see the cpu's hardly being used just the gpus being used now this will finish after about 27 28 29 seconds we'll see what happens and then we're going to now try try this on multiple nodes and see what is the final outcome and there we have it 27.1 seconds and you can see the gpu now is stopped being busy and that now disappears off the edge of the graph okay so that's how long it takes when you run it on one gpu so it connected to that board and it said run this on the gpu please didn't really hardly use the cpu just for the communication part now if we look at cluster file two where you'll be shocked it's got two nodes in it 53 and 54 we can see here so now we're going to run exactly the same program but now we're going to say please do this on two boards okay and let's notice a few things running on two nodes it says down here now set also it's now reduced the amount of data it has to send to each node because we're splitting the job in half so it's one million nine hundred and twenty thousand and now you can see also the two gpus are very busy on two boards and look at that fourteen point eight seconds so roughly half the time to do that which is what you expect half the data half a data going to each node and the gpu being used in each one and next we can look at cluster file four of course this is going to use all of our boards 51 52 53 and 54 so exactly the same thing again and we will see all four being used and you see running all four knows now only 960 000 data to be said but look at this one on the top left-hand corner here it's finished already on the xavier because the xavier is much more powerful gpu than you get on the jets and nanos and overall it finished now in eight seconds so we've gone down from 27 seconds to eight seconds now because the xavier is actually so powerful compared to the other three what we can actually do there's a another cluster file here and i've called it cluster file six because what you can say here is this one here 51 give it three slots pretend it's three machines so in total it'll be six machines and one of them will be this one and it's because it's got the much more powerful gpu it can handle three times the amount of work so again same thing run the program but now let's say use cluster file six and we'll fire that off and we can see now that this one is a bit busier than it was before it's actually got that extra but it's still finished earlier and now we're down to 5.6 seconds and all the other balls finish so we've gone down from 27 seconds to running something on a single gpu on a jetson nano to 5.6 seconds running across four boards and utilizing the extra power inside the jet sink xavier now of course you can imagine if you had four jets and xavier's or you had eight nodes or you had ten nodes of course you can just keep multiplying up in and there is a rule that uh tells you how far you can multiply that's called amdahl's rule and we can do a video about that if you're interested please to leave a comment in the sec in the comments below and i'll have a look at that but look we've gone from 27 seconds to 5.6 and imagine if it was 27 minutes to do whatever calculation it is that you were doing down to five years 27 hours down to five hours obviously these are significant changes uh let's just run it again exactly like seeing all those gpus running and notice again here it's not really the cpus being used it's the gpu okay look at that 5.6 seconds that's actually fantastic so there you go your very own gpu accelerated supercomputer okay so there you have it there is a gpu accelerated supercomputer cluster using four jetson boards now of course you could do it with eight you could do it with 16 you could do it with as many as you want to and of course the same topology the same idea could be used you know with pcs with much bigger gpu cards in them of course you've got the difference in price cooling and power requirements and that's of course why things like the jets and nano are great because what's 59 to buy one of the boards you don't have to worry about the uh cooling because it's all passive cooling with the big heat sink on it there you have to worry about using too much electricity you're not going to annoy your neighbors with a huge server farm sitting in your living room but you've actually learned the basics of how to use jeep accelerated supercomputing in the comfort of your own home okay that's it my name is gary sims this is gary explains i really hope you enjoyed this video did please do give it a thumbs up if you like these kind of videos why not stick around by subscribing to the channel okay that's it i'll see you the next one [Music] [Music] you
Info
Channel: Gary Explains
Views: 101,445
Rating: undefined out of 5
Keywords: Gary Explains, Tech, Explanation, Tutorial, Supercomputer, Cluster, GPU, Jetson Nano, NVIDIA, NVIDIA Jetson, Jetson Xavier NX, MPI, Message Passing Interface, Supercomputer programming, CUDA, Maxwell GPU, GPU cores
Id: 3IQuuX68-g8
Channel Id: undefined
Length: 15min 2sec (902 seconds)
Published: Mon Nov 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.