Make Python code 1000x Faster with Numba

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to another absolute minimum unlike all my previous videos up to this point this one is actually not going to be about a text editor and instead it's going to be about a library called numba which as the tagline reads here makes Python code fast numbah is a just-in-time compiler which means that it can take your existing Python code try to figure out what types exist inside it and then generate fast machine code from it looking through their website you can see a number of examples in the main page and all of them show that you can keep your Python code looking very much like Python code and then add a decorator that just says do just-in-time compilation this is able to create code that's very similar to the speeds that you can get with C or Fortran or C++ but without having to write code in any of those languages or without having to ship and pile code or expecting the user to have some sort of a compiler installed the website also demonstrates that number works well with numpy arrays which makes it really suitable for scientific computing and finally that it makes it really simple for you to parallelize your code and use multiple threads it also does a very good job at doing Sindhi vectorization to get the most out of your cpu and if you want to go to the next level you can even run number code on GPU while my video today is going to cover most of the basics things that I think that you'll need 90% of the time there are other videos and other tutorials out there that I recommend you look at and one of the ones that I really like is from one of the core developers of number called accelerating scientific workloads with number this talk is particularly good because it shows examples of other people using number for open source projects things that you may have already used for example if you've ever used data shader for visualization you've already used code written with number okay so without further ado let's dive straight in and see what number is all about one of the other thing that's going to be different about this video is that instead of working in an editor I'm gonna be working in a Jupiter notebook which is an environment that I feel is very useful for experimentation and for slowly discovering a problem or a new library so we'll go ahead and make a new Python 3 notebook and usually the first thing I do for exploration related tasks is if I need to plot things or use numpy I do the lazy thing and just run pylab in line this is a nice Jupiter magic that will import a lot of really useful functions as we'll see in a second so let's first go through the example that's given on the number website now normally when you're trying to accelerate a function using number you can use a decorator but since the decorator is just a function that takes its function as an argument and then spits out a function I'm gonna remove that here so that I can retain the main function and run this and then let's get a baseline for how long this function needs to actually run now the number of samples here you know let's use a let's use a high value like 1000 ok 1000 is not high enough let's use 10,000 ok so this is starting to get into code that takes a reasonable amount of time you know 7.41 milliseconds now let's make a jittered version of this and to do that we can define a new function and then run the timing again ok what's going on the whole premise of this video is that you can make your code really really fast using number and I go ahead and I run this function and it actually takes a hundred times more roughly than the pure Python version so what's going on well because number is a just-in-time compiler the first time you call the function in this case it's actually going to spend some time compiling the function now that you've run the function and the compilation has been done and all that code has been cached you can run it a second time and presto now it only takes 189 microseconds as opposed to 7.41 milliseconds so this is a good initial demonstration of what number can do but at the same time we can see a pitfall which is that the first invocation of a function needs the compilation time which means that there's gonna be a delay this is something that'll come up over and over again and a lot of people get tripped up over so I wanted to I wanted to cover this ok now let's move forward and one of the things that I wanted to do before getting into some serious optimization with number will show you all the various ways in which you can fail to do things right I think it's very important to see this with number because it is kind of easy to fail in many situations and if you don't know what screwing you over you could spend a lot of time just trying to figure out what's going on and you know waste a lot of time which is in many ways antithetical to what numba is trying to be so let's import pretty much everything that I think we'll need from number and in a moment I'll go through what each of these are so I'm gonna put together kind of a really dumb function so we can play around with it all this function is going to do is it's going to take in a list of numbers and create a new list that contains a - if the original number was even and contains a 1 if the original number was odd and let's give this function a test array to use you may have noticed I'm intentionally doing something weird inside this function where when item is even I'm appending the number two but when item is odd I'm appending the string 1 we'll see in a second why I'm doing this if we go ahead and run this function it takes about 17.5 milliseconds to run in python now let's make a jittered version of it so when we run this function we get this angry red box yelling at us it's a warning it's not an error so the cell runs and completes just fine let's ignore the error for now looks like it took 337 milliseconds to run but that's okay we know that the first run is supposed to do the compilation so let's run this cell again and what gives again a function that's been jittered using number is slow well we should look at that warning again what this warning is telling us is that there's something wrong with this else statement where we're appending a string essentially what's going on is that numba thinks that this output list is supposed to be a list of in 64 x' which makes sense because in the if statement we're appending a 2 to it and number is automatically assuming that that 2 represents in 64 because it doesn't know any better and then when it gets to the else clause it sees that the string is being upended and that doesn't work it can't generate efficient code for this because we're mixing types and that's not something that number I is able to do but it is something that Python is able to do just fine now in an effort to make number user-friendly the main JIT decorator is designed to not fail in these situations and instead it falls back to something called object mode object mode is not your friend all object mode is essentially doing is generating code that looks very similar to Python where it's trying to figure out what the types are or trying to be flexible in types and this actually ends up hurting you as we've seen before when we run this function a second time it actually takes longer than pure Python implementation because all we've done is created a function that is as slow as the Python implementation and then on top of that has Nimbus overheads on it we don't want this now the simple fix here is of course to change the string so that were no longer appending a string but before we do that fix I want to go over what I think that you should be doing in your number workflows which is to never use the JIT decorator by itself or just don't use the JIT decorator at all either use the enginetech rater or use an option called no Python equals true now if I run this function I don't get a warning I get an error it's not gonna let me proceed and that's good for development of really high performance code because if at some point you know here we I inserted an obvious bug but sometimes these bugs are not obvious especially if you're really used to working in Python it's easy for you to mix types or write code that wouldn't necessarily generate really efficient machine code so making sure that you always run in know Python mode makes it so that you always get an error as opposed to a warning and then you can't proceed and you have to go ahead and fix that error I find that the number error messages are usually good at telling you where to go they're not always necessarily good at telling you what's wrong so it can get a little bit frustrating but if you if you write your codes with small functions that are very specialized this problem happens less and less so essentially what the know Python mode is telling numba is that you can't turn to the Python interpreter for any help whatsoever and this essentially ends up being the most pure mode off number the number of developers have recognized this should be the correct way to do things so they already give you a decorator called njet which automatically sets know Python equals to true so if we just take this out and change the decorator to end it run the code now we get an error so let's go ahead and fix our code when I originally define this function I didn't say that it was gonna be appending a string one I said it was gonna be appending a one so this code is busted let's fix it running the original function in pure Python mode it doesn't really change anything it takes the same amount of time to run but now if we did it run it once we get a wall time that's way higher but we know that's fine there's compilation happening there's also another error that we get and this is gonna mess us up in like just a few minutes this error is telling us that we are passing a Python list to this jarred function and number has to do extra work to figure out what's going on in this list and for that it has to use reflection but that's something the number developers have realized is not very good to have as a supported feature so they're gonna be deprecating it soon so for the purposes of demonstration for now it's fine but I'm gonna move away from using a Python list in a second in this tutorial and you should also never use these anyways if we go ahead and run this again okay what gives it is still taking now it's taking even longer than it was taking before in Nats because the reflection problem even in this new python mode number is having to rely on reflection to figure out if the type of the input array is correct while it's running so this is even worse than before the moral of this little story is that you should try to avoid using Python lists with number whenever you can and that's fine because in most cases everything that you need to do will happen find with numpy arrays if you want to learn more about this I recommend that you read this list topic in the number documentation I'll link it down in the description but they explain all of this and they explain that this behavior is going away and they have introduced a typed list which is an experimental feature right now but that will replace passing Python lists and then this problem wouldn't happen so how do we fix this like I said it's better to be using numpy arrays anyway so let's redefine test array not as a normal list but as an empiric as a consequence this actually slows down the pure Python function I assume this has something to do with iterating through numpy arrays in Python something that I'm not really aware of and I'm not qualified to comment on it just happens however now if we run this function it is running at about 5 milliseconds compared to the 20 milliseconds that the pure Python implementation was with a normal Python list so this point I want to take a small detour to discuss another feature if number which is actually really nice but we're not gonna spend too much time talking about it because once you know the rest of number you can sort of just include it wherever you need to and that is the vectorized decorator what vectorize does is it allows us to rewrite this function as a scalar computation so we have rewritten the main part of the computation is in this original function but instead of having the function operating on a list we're having an operate on a single element of that list however the vectorized decorator has created a function that can accept both a normal number and our test array if we go ahead and time this this is interesting isn't it calling this vectorized implementation is taking less than a millisecond whereas what appeared to be the same function is taking four point eight three milliseconds this actually has to do with the structure of the original function where we are defining an empty list and then appending to it so there's no way for the compiler to know what the size of this list was supposed to be which means that it can't really do any pre allocations and we lose a bunch of time we can fix that by rewriting this function and now it takes about the same amount of time that the vectorized function takes so ultimately whether or not you use vectorize comes down to how you want to write your computations and if you want to explicitly or implicitly pass arrays to them vectorize is really nice for simple computations in the fact that it works both on scalars and arrays whereas the fixed function is only gonna work on arrays is a big boon so now that we've seen a number of different ways that we can fail to write good code in number let's look at just writing performant code in number and taking a simulation making it really really fast the motivating example for the rest of the video is going to be a spring mass damper system normal spring mass damper systems or harmonic oscillators are represented as a second-order system of linear differential equations ordinary differential equations and they have this really nice property where the initial condition does not change the overall behavior of the system I've prepared this example which is a similar dynamical system but whereas a normal spring mass damper system contains what's called wet friction which is friction that is a function of velocity I've defined a friction term that is both wet friction and dry friction where dry friction is a constant value as opposed to a value that depends on the velocity itself you don't really need to understand what these functions are doing other than the fact that they're simulating this funky spring mass damper system that I've defined for 10 seconds with a very fine time resolution of one tenth of a millisecond given some initial position off the spring mass system so if I simply call this function with some initial condition and time it it takes about 291 milliseconds if I were to plot it for different initial conditions these plots don't seem to converge at the same location this is very different from a standard spring mass system in addition to this the system can no longer be modeled with ordinary financial equations which means that it can't be solved using standard techniques that we use to solve the spring mass damper system which means that if I want to study its behavior for different initial conditions probably my only option is to do simulation now imagine for a second that I'm a PhD student and my advisor comes and it says hey you need to simulate the system for initial conditions going from 0 to 10,000 at 0.1 increments and then do some statistics on it so me as a grad student I fire up my calculator 10,000 divided by 0.1 times 0.3 that's 30,000 seconds divided by 3600 that is 8.3 hours of my life where I will just start the simulation and then come in tomorrow and look at it and hope everything worked out and then just hope that my professor doesn't come back and give me a different range to run again this is a long time and this is not ultimately a lot of computation the reason it's so slow is because we've written something in pure Python and then we're trying to run it over and over again with lots of loops numba exists exactly for this so let's go ahead and get this some things to note about how this function is written I have made sure that arrays for time as well as the position that we're gonna be tracking are pre-allocated in the beginning in addition to that when you're writing number code sometimes you have to give up some normal Python conventions so here instead of using enumerate I'm just using range passing it the length of times and then holding on to the counter variable so that I can use it to both retrieve the time as well as set the position in normal Python code writing this is very unpaid onic but in number you're gonna see yourself do this again and again because because maybe things like zip don't work correctly maybe they do now last time I used it they didn't etc so every once in a while you'll run into some issues where some Python feature is not supported and you'll have to write your code I like to say it kind of starts to look like if C looked like Python I think that's an okay compromise for the speed boost that you're about to see okay so now that we have these functions let's go ahead and run them first compilation takes 250 milliseconds second run it takes 1.6 3 milliseconds so as a grad student I go ahead and pull out my handy-dandy calculator again that I've taken a computation that was going to take eight point three hours and converted it to two point seven one minutes that's amazing well let's make sure the data looks correct so if you recover plots that look just like this then it's probably doing the job correctly and the plots do look like that life is great right well maybe life is great but maybe I need to keep simulating these systems with slight perturbations if I need to do that then 2.71 minutes actually might end up being quite a drag now my computer happens to have eight cores and many modern computers are gonna have a lot of different cores available to them so I'm this grad student it's gonna take 2.71 minutes I think that I can do better because my computer has eight cores I go ahead and from concurrent futures grab me a thread pool executors and then say with thread pull executors I have eight cores available take my function I give it my list I'm actually not gonna run it for up to ten thousand I'm only gonna run it up to 1000 and that's a ram consideration just as an FYI we'll just cut everything down by a factor of ten so I go ahead and run that with a time magic at the top of the cell and if I look at each top the activity in my threads hasn't really gone up not originally this operation was supposed to take one hundred and sixty-three seconds we cut the input size down by ten so it should be taking about sixteen seconds we're not really seeing much of an improvement well that actually makes sense because number by default does not release the global interpreter lock so using threads inside Python can't really give us any benefit here because each thread is still wanting to get the global interpreter lock before they can do anything one of the nice things about number is that it's very easy to instruct a function to release the Gil this function does not need to access the Python interpreter while it's running in fact we've made sure of that so we can additionally tell it to release the Gil and we can do that for the function that it's calling as well now if we come back here and run this whole mess again under H top you'll see all of my CPU saturate for a second and the whole thing now runs in to point seven seven seconds so we have now taken something that was going to take eight point three hours and is now taking two point seven seven seconds times ten that's about a thousand times speed-up so yeah that's my mic drop moment if you have heard of number if you're on the fence if you're curious if you have CPU intensive workloads if you're doing any kind of scientific computing or computer vision or a i/o anything where you maybe find yourself turning to C or C++ or site on give number a shot because as I've shown tonight it can give you about a thousand X increase with basically no additional work and that's amazing now before I finish this video there are two additional things that I want to talk about one is that you don't actually have to use the thread pool executed directly number actually makes it fairly easy for you to parallelize your code but we do have to define a wrapper function for that okay before we run this wrapper function a few things to note about it in the engine decorator we set a keyword parallel equals to true in addition to that we import something from a number called P range it just stands for parallel range it just instructs number try to see if this function can be parallelized and if yes schedule the computation inside in different threads in this case this is an embarrassingly parallel problem as we've shown before with the thread pull executors so so let's see what happens if you run these simulations again it takes about 2.5 seconds that's a little bit less than before in some cases that advantage of using native parallelization from number would be very helpful in other cases it wouldn't be it just comes down to the particular problem that you're trying to solve so I think up to this point in this video I have covered 90% of what you'd need to do great things with number yourself for the remaining 10% you can discover some deeper topics in the number documentation there's some really cool things like defining convolutions like you do in neural nets there is an experimental feature that lets you define classes but this is still experimental you can define number functions to be called from C and C++ which can have some really cool uses you can use something called fast math which is approximate mathematics and there's a number of other things all of these links will be in the description I'm just gonna pop off real quick what other things I think would be useful for people to go and study on their own it's definitely a good idea to look at what features of both Python and numpy are supported directly by number as well as some important differences from writing normal Python code and writing number code you can also define types in your decorator for the function definition which allows the function to be compiled when it is first loaded into memory as opposed to when it is first called it is possible to debug number code using gdb and you can read more about that in this article and then if you're writing a library that you're gonna distribute using pi PI or some other method then ahead of time compilation might be a good thing to have and then one of the standout features of number that I haven't covered in this video but might cover in a future video is that it makes it really easy for you to write kernels for CUDA or supported AMD GPUs so that you can massively parallel eyes your computations on a GPU okay and that's all I got for this absolute minimum about number and this was my first Python related video please let me know in the comments what you thought about it give it a like if it taught me something new if you like my content or how I do things please subscribe I'll be making more videos about programming and text editors and all sorts of things thank you very much for watching bye
Info
Channel: Jack of Some
Views: 332,772
Rating: 4.9694877 out of 5
Keywords: how to make python faster, how to make python program faster, how to make python code faster, how to make my python code run faster, how to speed up python program, python optimization, python performance, speedup python with numba, jit python, jit compiler python, python performance tips, high performance python
Id: x58W9A2lnQc
Channel Id: undefined
Length: 20min 33sec (1233 seconds)
Published: Sat Dec 28 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.