High Performance Computing - HPC - and GPU Intro - GPU Computing Tutorial Step 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody welcome to this introduction to high-performance computing on the graphical processing unit I will be reviewing the main concepts and aspects of high performance computing these underlie the techniques and patterns that are used for speeding up your calculations in particular we're going to focus on how the graphical processing unit the GPU can be used to accelerate your program at the end of this video you should be able to broadly understand what goes into optimizing code and how to keep view compares to the CPU in that respect as the name suggests the graphical processing unit is historically primarily used for generating and rendering images to your screen and the majority of customers for those units have traditionally been gamers so why do we want to use it for calculating stuff don't you have the central processing unit for that well in essence the GPU does the same things that the central processing unit can do it can take two numbers and add or multiply them also can subtract one from another and perform divisions every other operation on the computer is some combination of those elementary instructions the GPU however has many more cores than the CPU has and the major difference between them is that on the GPU groups of course perform the exact same sequence of operations at the exact same time and each core within that group performs those instructions on a different set of data before we further dive into the hardware aspects of performance let's see how we arrive there from the top down because performance doesn't start with hardware architecture the first thing to take into account is what the user and the workflow need in order to run fast or reduce time if you are calculating and showing results in real time that poses different constraints then when a single image is produced you will need to continuously update the result and render it to whatever your user is looking at also think about whether large parts of the algorithm we need to be recalculated or if it's possible to read standard solutions from a database the second level is the most fundamental one to the problem that you're solving this involves listing the quantities that describe the system and writing down how they're related to the result the potential performance of the program depends heavily on how much detail you're trying to capture which approximation you make and the calculational scheme that you use finally once that is settled it will be clear which operations need to be performed you can start to think about how to complete these operations in the least amount of time on a given hardware architecture the first question you might have is why does the core perform fewer instructions than it is capable of doing to understand that we need to look at how a processor works for that I have two diagrams on this slide that include the parts of the CPU and GPU architecture that you need to know about it starting with the cores the blue boxes which are called a rhythmic logic units or al use these take two numbers which are stored in the register file their values are in turn loaded from the cache memory if a variable isn't available in cache it needs to retrieve the value from the main memory or the global memory on the GPU each layer of memory register cache and main memory takes longer to load from or to write to as long as the numbers that you need are not present in the register file your core will simply be waiting for input the CPU is designed to reduce waiting times as much as possible refer to as low latency design the GPU architecture relies on what's called high throughput this means that for efficient operation it requires a new instruction to be available to perform while it's waiting for memory or lie waits waiting for another operation to finish keep in mind so this implies they use many threads per core even though there are many more cores on the GPU that are available on the CPU now that we covered the basic architectures let's look at what we can do to optimize our programs there are two ways to do so by reducing the number of operations and by improving the efficiency of your code the efficiency is the number of operations your program actually performs divided by the number of operations your device is capable of performing in the time that your program runs so I could count the number of instructions that my program executes and measure the time that it takes to finish suppose I do that using a GTX 1080 and mint amounts to roughly a thousand Giga floating-point operations per second or gigaflops now let that number sink in for a moment that's a 1 with 12 zeros of instructions every second which is around 35 times faster than a fork or CPU could ever deliver but what's the efficiency of this performance on this GPU the 1080 is 2560 cores and the base frequency of 1.6 gigahertz we multiply that by 2 since the core can finish a pair of instructions multiplication and an addition every clock cycle and we yield a peak performance of 8192 gigaflops that you evaluate through about 12 percent efficiency and this number suggests we could do much better can we speed up our calculation by factoring well that depends on the type of calculation that you are performing to find out we need to take a better look at the mechanics of the GPU and classify our problem appropriately understanding what the fundamental limitations of the GPU are and how well an algorithm can perform will be the topic of the next video for now just take into account that for each algorithm there is a performance plateau a maximum efficiency anybody could ever wish to achieve now having said that let's look at the actual ways in which the program execution is optimized when we have several threads running on the CPU after reducing the number of instructions as much as possible you can streamline the memory retrieval but loading memory isn't the only thing that takes time loading the instructions to be performed can slow down your program as well so try to prevent the program from having to load new instructions by optimizing the control clock as a general note just don't put if statements inside your loops just like they give you each core on most processors today can actually perform typically up to four identical instructions with each clock cycle this is a school vectorization and it provides a secret performance boost to some programs out there finally you can write your code in a way that the core is provided with a new instruction wow it is still working on the previous instruction if on the other hand the next instruction depends on the result of the previous one then the core will have to wait until that operation is completed on the GPU side we need to make sure that each core has enough work to do if you have 2,560 cores and only a hundred threads the most of the GPU is just running idle not doing anything then it makes sense to break down each thread into smaller tasks and assign new threats to these tasks sometimes the internal GPU is waiting for input to arrive from the CPU so if we can minimize that communication that could reduce running times drastically another way to achieve that is to overlap the memory transfer to the GPU with another calculation on the GPU once the basic setup is in place the one thing that will bring you closer to maximum performance is to figure out how to optimally use the memory at each level to squeeze out the last factor two to four you will still need to make sure that each thread can run smoothly by itself okay in general terms these are the ways to tweak performance and get the most out of your hardware so let's see how we can find out which of those areas need attention there's only one way to get there and that's by measuring the running times suppose we have two functions running on the CPU one after the other they are called F and G G takes much longer on the CPU so we decided to implement them on the GPU the much smaller blue boxes on the right the diagrams that are plotted are called the time line and it helps visualize us how long each component takes and what that means for our overall performance in addition to F and G we need to transfer the input to the GPU and the result back to the CPU this communication is indicated by the orange boxes let's compute the speed-up it's the ratio of the original CPU computation time simply the sum of F and G running times divided by the new computation time of the single iteration in this example function G has been sped up with a certain factor let's say large end as we do a better job at speeding up G our GPU implementation of the full iteration will only take longer than the running time of F and the CPU since we haven't transferred that computation on the GPU yet without doing that our overall speed-up will always be greater than the inverse portion of L this means that if F would take as little as 1% of the time the maximum speed-up we could get is only a factor hydrant even if we have thousands of course available to us this principle is known as unlosable and you're bound to run into it at some point now let's fill in some numbers and complete the actual speed up in this example Earth takes one millisecond on both implementations G sped up by factor 10 and only lasts half a millisecond compared to 5 milliseconds on the CPU but the communication the data transfers double the effective running time of G so RN equals 5 we end up with a speed-up of 3 as our timeline suggests since we needed to plot 3 iterations to fill up the space of one CPU iteration what does the timeline tell us to do next first implement F on the GPU so that we can unleash the power available on it then try to eliminate or reduce the data transfers and after that it makes sense to worry about the poor performance of function G on the GPU since the factor 10 doesn't seem to cut it at all just because a certain performance is possible on the GPU in principle doesn't mean it's an easy goal let's assume for argument's sake that the maximum performance speed-up that can be achieved for our program is a factor 700 hundred compared to a single thread indicated by the dashed line this slide roughly shows how much effort and skill level are required to achieve a certain result just the technical challenge of installing the drivers setting up the tools performing the tasks and making your program function on the GPU needs some perseverance even without any performance considerations whatsoever though it is definitely doable task for most people who have some programming experience the speed-up of such a program will be quite horrible and might even slow down your program but after reducing the communication and implementing most parts on the GPU a factor 10 to thirty can be achieved with relative ease at that point we've applied the concepts in this episode and arrived at the second level but there's still a long way to go to get to the third proficiency level you need to be able to reuse memories that has been read from the GPUs global memory and reuse it among the cores this is done by addressing cache level memory that is called shared memory and the fourth proficiency level in the bus require you to apply memory optimizations on each memory layer and know about the intricacies of old memory and latency aspects as I mentioned we now cover the fundamental concepts to reach proficiency level to the next video on memory bandwidth and latency will explain the concepts involved with proficiency levels 3 and higher if you want me to complete this series and take you through all these stages with hands-on examples where it can take a focus on scientific applications you can help me to do so by liking subscribing and sharing this video to end our discussion let's summarize the pros and cons of using the GPU to speed up your calculations obviously it has immense complete potential and it's done head and shoulders above the alternatives available at this moment albeit it's mostly suitable for single precision as an academic that might hold you back as it initially did for me but more often than not single precision is in fact within the validity of your approximations and the possible gain of reducing 15 minutes to many seconds was something to take into account double precision is possible most cards but the performance impact is affected 24 or 32 just terrible in my view there's only one card available at the moment of this recording that has sufficient double precision performance the P hundreds by Nvidia but the basic is just astronomical good news is that with AMD competition is on its way and that might make a GPU for computing something you simply cannot ignore please like share and subscribe and I'll see you in the next video so that I can get you up to speed
Info
Channel: Daniel Abel
Views: 7,461
Rating: 4.9666667 out of 5
Keywords: algorithms, hpc, gpu computing, cuda, optimization, Amdahl's law, high performance computing, gpu tutorial
Id: r4HLolhkhuI
Channel Id: undefined
Length: 15min 54sec (954 seconds)
Published: Fri Mar 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.