High-Performance Computing with Python: Numba and GPUs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so far we have only used the CPUs pit Stein gets most of its powers from its GPUs though and we can use Python to program them number cannot just compile to CPU binary code it can also actually generate a machine code for GPUs the easiest way I already mentioned earlier when we did the number vectorize we can use you funks like our escape time and we can vectorize that and give it target and if you replace the target parallel with the target CUDA it will actually perform the function on the GPU and will do some speed comparisons in the notebook but you can do more Kirra for python is one of the officially recognized dialects for CUDA just like QR C++ or CUDA Fortran beautiful Python is officially recognized by Nvidia as one of the dialects for programming their GPUs but before we go into how to use this let me tell you a little bit about GPUs and how they are different who has used the GPU before who has programmed for a GPU before okay then for most of you this is just a repetition but just bear with us CPU versus GPU well a CPU is kind of like a motorcycle you can get things done quickly you have low latency to get from A to B but that's only true if you have a small number of people that you need to transport one person great two person okay if you want to get 50 people from Lugano to milano you'll have some problems with the motorcycle a GPU on the other hand is more like a bus it is slower will take you longer to get there but in the end if you fit 60 people in it your overall throughput might very well be better with a bus CPUs are actually optimized for latency so CPUs try to get or try to execute a command that you give it as quickly as possible they do a lot of optimizations for that caches are actually part of that optimization gpus on the other hand are optimized for throughput GPUs are made for doing graphics if I'm rendering an image for a GP for a game for example at 60 frames per second nobody cares how quickly I can calculate a single pixel I want to calculate my 2 or 4 million or 8 million pixels if I update individual pixels that usually just creates a mess so GPUs were designed for workload that does a lot of the same on a lot of data so originally they were designed for data parallel workloads these different design decisions these different goals actually led to different architectures and different usage of the die on CPU for the longest time they were actually trying to hide or the parallelism that was on their CPUs can do out of order execution to use the different arithmetic and logic units efficiently even if your program code doesn't provide for the needed parallelism a CPU will execute an if statement or part of an if statement even if it's not even didn't even reach that part yet doesn't know if the condition is true or not and if it fails then if you didn't do what it thought you were - it just has to run it again but it did already perform what you wanted to do it can skip that step and go ahead and gave you basically internally more parallelism that requires quite a bit of control logic the caches having these different levels of caches going almost up to the speed of the processor also take up quite a bit of die space on the GPU on the on the cpu on the GPU on the other hand we have a lot of arithmetic and logic units with very little control and cache so little room for the control units that actually all these arithmetic and logic units have to do the same thing so rather than every thread doing one thing um your whole set of them has to do the same thing this is the same thing as if we're using semi instructions on a CPU so these vector units within the CPU where I said they can do for double precision numbers at the same time in many respects these are vector units that can 32 double precision numbers at at the same time this different approach that I have on the CPU where I want to be able to execute a single instruction verse so it's the GPU where I want to execute a lot and get to throughput also led to different kind of execution model gpus use very lightweight threads and a lot of those on a cpu if you well CPUs here have 12 cores 24 Hardware threads and if each of these 24 cores were starting 24 threads it already gets messy for the CPU on the GPU doesn't bother at all it's actually way too few threads on a GPU we high latency rather than avoid it what do I mean by that when I have some tasks that needs some data then on the GPU I start the data request then put this task aside and run the next one then the next one will start its data request at least at the beginning the first one still probably doesn't have the data so we continue this basically is sending more and more requests to the memory transfer engine until the first one actually got its data now it can work on the data by the time it's done the second task has its data and can do its work so once we have filled the pipeline with data load commands we get to the point where we can actually execute data one after the other so we're not trying to get rid of this latency you are just trying to hide it by being able to keep a lot of threads and flight thousands of them being in flight here means that it keeps all the registers and all the internal information in memory and can access it within one cycle one to two cycles so thread switching on a GPU is extremely fast due to its origin GPUs work very well on grid data so if you can map your problem to a grid it will usually also work well on the GPU modern GPUs like the newest Nvidia generations can actually even handle data that has rather random access patterns quite well but it didn't use to be the case so on the Pascal cards that are here it wouldn't work so well on volta cards that always also exist in the systems here it works very well also with irregular data when we run on a GPU we usually talk about kernels kernels are small routines that are done by every thread so just like our MPI program is actually executed by every MPI task and we have to decide depending on the rank the same is kind of true for the thread for the kernels so let's see what the kernel might be so here we have our Mandelbrot set again and look at these two loops here I and J um each IJ pair actually creates a particular on point in the complex plane and then for each of these points I calculate the escape time and assign it to m IJ now these two steps are independent for each of the iterations so I could also just pass these put these two in a function and call that function and this would then be the kernel that I call to organize these threats I said I can have thousands of the millions of them cuda uses some organizational scheme based on threads blocks and grids we start with a single thread each of these threads has execute our kernel and each of them has an ID the first grouping is a thread block and in many respects this is the relevant group a thread block always runs on the same streaming multiprocessor that's the course within a GPU not to be confused what Nvidia called CUDA core which is actually one of these lanes one of these vector lanes the streaming multi processes are what's most closely comparable to CPU core a thread block will be scheduled on a on one of these doing multi processors and will stay there so it's pin to one of the streaming multi processors that means that all the threads that live within this thread lock actually can share memory share a local fast storage called shared memory they can easily be synchronized it is possible in the meantime to synchronize across threat border threat block boundaries but this is the easiest and fastest one and it's really the group that you want to use for the parts of the algorithm that to communicate so that's not just ideally parallel but actually has some interactions now thread box since they have to be scheduled on a streaming multiprocessor also have some limit for example a thread lock can only have a thousand threads oh that actually depends on the architecture and can vary from generation to generation so I can have thread blocks one two or three dimensional so I could say I have a thread block block of 1024 threads just in one dimension or I could say I have a thread block of sixteen by sixteen by four and three dimension or a PI by 8 or whatever I want depending on my problem I can map the thread block so that it fits my problem if I want to do a matrix-matrix multiplication I'll probably do a two dimensional thread block because then I can easily tile my matrix the next level up is grids since thread blocks are limited in size if I want to do millions of threads I have to somehow organize those two and that's where the grid comes in so we have a grid of thread blocks each thread block looks the same each thread block has the same dimension I could have well in easiest case I have one dimensional thread block and one dimensional grid I could have what two dimensional thread block and two dimensional grid I could also have a two dimensional thread block and a one dimensional grid it really depends on my problem how does my problem map to a grid most of the time and practical applications what I've seen people use the same kind of dimensionality for the grid and the thread blocks so how do I tell CUDA then within this grid of threats that I now have each thread has an the thread ID this ID can be one two or three-dimensional and in Kedah for Pathan I can get it through CUDA grid so here I make sure getting the three dimensions if I put a two and I just get XY if I put one in I just get X which is actually slightly easier than in um QR C++ where you have to calculate these global IDs so what does the colonel look like in Python so I do from number import CUDA to get the CUDA features that I need and then I use curator cheat so rather than using JIT I now use Q dot argit it's my compiler otherwise this is a regular function takes several arguments I can write regular Python I can get the shape here I assign I and J to the grid to get my position of my Mandelbrot set up here I'm generating something that's called a device function device functions can be called from within a CUDA kernel they are not accurate kernel themselves they're actually functions that can be called from a CUDA kernel and you'll see that in more detail in the notebook okay so defining a kernel is not difficult it's just like we have done before with our JIT compiler so how do I actually call it I have to somehow tell it about all these thread blocks and grids and how do I do that first of all I have to figure out what block and grid size I want to use so here I am using an example with a 1024 by 1024 Mandelbrot set and I just decided to use 32 by 32 32 by 32 thread lock this is a rather explicit way of calculating the grid that I need so I'm taking my total shape and divided by the block size here in X and I say if it's not evenly divided visible I take one extra one do you know the syntax with the something equal if and else so in C C++ it would be the ternary operator in Python it's actually written like this so this defines then a grid that is that will hold at least as many items as I need in my M here or to execute all of em for this one it's trivial because I just know it's also 32 by 32 grid but what if I wanted to use a thousand by it or do a thousand by a thousand with 32 by 32 grid then these extra blocks come in the launch configuration is then passed in square brackets before the function call so the function call actually looks slightly different I have my manga broad GPU that was just my function name and then I passed the grid and the block and then I passed the remaining function arguments but that's it now you have the CUDA program for those of you used to writing CUDA C++ you might wonder where the memory transfer is coming from I wasn't doing any explicit manage memory allocations so it's not the runtime nor what was I transferring data it's actually handled by number and right now it's ended in a way that they explicitly transfer data back and forth so if you do this it will transfer every array that you passed to the GPU also back this can actually get expensive and especially if you have something like this where you don't really need to move the data in I don't care what was an M before the kernel I only care what's in there afterwards I might want to do explicit memory management so in I can transfer data explicitly with the command CUDA to device it's one of the typical patterns you would allocate memory on the GPU and then would transfer some CPU data to the GPU this command does it all in one go so it allocates data on the GPU and transfers it at the same time and then gives a pointer to that data back and we can copy the data that's contained here after we worked on it back to the host I can also actually do a cuter device array this is really just a cudamalloc so I'm just allocating some space on the sub memory on the GPU I can then run my kernel and get my data back no easy way I should transfer data into this variable I mentioned that the threads that are in the same block can also access shared memory shared memory is a user managed cache um actually it is physically the same memory as the cache and you can tell the GPU to use um how much of the memory you want to use as cache and how much you want to manage yourself as shared memory so the shared cache is also available from within Kirra for python through CUDA shared array here we allocate two shared arrays we give the shape and the type to be used in a matrix matrix multiplication you also see some other cuter commands here we already had the cuter grid you can get the thread index within a thread block through cuter thread index dot x and y and you can also get the dimension of the grid through cuter grid this is what you would usually use in CUDA C++ to then calculate your global ID this is another typical pattern I said earlier it's better to ask for forgiveness than for permission well on a drupe you kernel this is not true it doesn't forgive it doesn't forget if you don't check your boundaries it will write into the memory that you have allocated or not allocated so it will write for every thread that it does it will write in the appropriate memory location no matter if that was allowed or not so in CUDA it is perfectly valid to ask for permission the return statement here actually means that this threat that runs into this statement will be put aside and is done with that is one of the best things that can happen to a threat because it won't block any further execution if there's a whole thread block threats are executed in these groups if all of them run into this return statement which is often makes you the case or a large part of them then the execution only takes minimal time I then need to load data into my shared memory and down here there is a synchronized command so that I'm sure that all of the threads have loaded their data item into the shared memory so that's how you can then use shared memory from within Kirra for Python cuter for Python doesn't support everything that CUDA does for example texture memory is not supported but there are not that many items dynamic parallelism is actually also not supported but other than that most things work the exceptions are listed in the documentation

Info

Channel: cscsch

Views: 8,792

Rating: undefined out of 5

Keywords: CSCS, HPC, Python, Lugano, High-Performance Computing, Numba

Id: NQr3p7NWIq4

Channel Id: undefined

Length: 25min 28sec (1528 seconds)

Published: Wed Jul 24 2019