PyTorch 2.0 Q&A: Optimizing Transformers for Inference

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everybody welcome back to the pytorch 2.0 ask the engineer series if you've been following along uh welcome back if you're new this is the series where you get to ask your burning questions about Pi torch 2.0 uh ask our engineers and get them answered in real time and uh if you've been following this series you know the main topic or main theme is about performance and today is no different uh we'll talk about optimizing inference performance for Transformers and a lot more uh before we do that just a quick recap a lot of the uh we're doing a a lot more topics this is an ongoing Series so if you head over to pytos.org events you'll see all the upcoming uh talks so you can go ahead and register we'll be streaming on LinkedIn and YouTube so there are other places to catch them as well uh so before we dive deep in let's do a quick run of introductions here I'm Shashank I'm a developer advocate here and who wants to go next I can go next so yeah this is Hamid I'm part of the applied AI team here and uh mostly I have been working on uh you know model optimizations for inferences and distributed training and with Mark on one of the contributors to tour server as well yeah that's it to mark uh thanks amid uh hi everyone so I'm Mark I'm uh yeah so like over the last year I spent most of my time contributing to like open source ml Ops Tools in pytorch so I'm one of the main maintainers for torsor but I've also contributed a bunch of satori's data into our checks uh recently obviously like now that you know we're all excited about performance uh I feel like compilers are such a critical part so I've been spending a lot more time on the 2.0 stack recently awesome uh welcome the two point of spin as I mentioned about performance right we've had a lot of talks recently about compiler torch Dynamo um touch inductor uh we've had a lot of talks about performance and uh a lot of it has been about uh training and today we're gonna talk about something slightly different but I heard a few things there so there's Transformers you know language models are all the hot thing right now and I had taught served which is the official serving framework for uh python so I know we're going to discuss both of them so let's start by talking about language models and why it's important to speed them up um why I know they're they're computationally intensive and you want to read sort of the light right I guess Target latency and throughputee so how why is it important and how do we go about speeding up language models hey Mark do you want to take this one um sure yeah so I would say like at a high level like they're they're sort of like an interesting like representational property and Transformers where you know it's like sort of tensor and tensor out and you can solve like a wide variety of problems like whether it's like text to text whether it's like chat GPT whether it's things like text image like Dolly uh I I think it sort of feels like a like a mainstream moment like for machine learning where like like this is like stuff now you can talk about like with your grandparents and they can get excited about like what you're doing um and and so but like the the the the the sort of like problem with these models is that like one they need to be really big to work really well and the problem is if they're really big then they tend to be really slow and if they're very slow they're not particularly useful especially for inference right like you're like imagine if you wanted to do like autocorrect in Google and uh for like just autocomplete and that took like five seconds for example per character like the product would be unusable uh so so so at least like for us like like performance is like very much tied to like the usability of the tool like the faster you make it the more places you can integrate it in uh and I just think that's why it's like just so important to look at this problem unfortunately it tends to be a very complex problem how to optimize them but hopefully after the stock uh like our goal is to make this problem somewhat clear for everyone watching yeah that's a good point like use that experience is so tight the performance right I mean accuracy of the model is one thing but if it takes a few minutes to um you know I mean we think of accessibility features maybe text to speech or something like that right or speech to text and if it takes you um you know several seconds even even it has to be in milliseconds if you want a good experience so yeah that makes sense I would add that like uh you know think of it if you want to serve these models in production for a long time think what companies are doing that's going to be super costly so yeah we need to cut the cost over there to help and democratize these Technologies a bit more and give this opportunity to everyone to be able actually to afford and run these models in their production uh workflows yeah yeah we shouldn't forget cost right there's obviously user experience but if it uh it has to be under your target budget right then you want to have a balance that makes sense so um what what are the I guess how have we been doing this how how have we been accelerating uh these on pie torch today and and what are the things we're going to discuss moving forward um yeah maybe maybe I can take this one uh I can go and give a little bit of a background about what we have been doing so far what is new invitors that can help with that uh yeah let me actually share my screen uh Shashank rewind too yeah thank you so I added a few slides here just to go so originally if you want to start from what is the problem here is that as Mark mentions this Transformer language models or generally Transformer models even if they are using for uh Visions models which are Vision Transformers mostly they are very compute heavy and memory intensive it means that uh basically we have a component of computation in these models that it can grow quadratically as your sequence length is growing so if you are passing it do your sequence length you are gonna have the longer compute and you know more memory uh requirements as well so this is challenging for both training and inference so you are gonna run into this problem in both spaces and the open source community and many of the hardware renders they come up with these like different solutions uh to make these Transformer models more performance and but all of them are mostly uh on inference side so think of like uh if you are familiar with Nvidia faster Transformer Unix runtime tensor RT you know uh projects like C translate ipex from Intel so all of these projects help you to kind of uh optimize these Transformer models and make them run faster for inference although if you have work with them you know that probably it's super easy it's not super fun and usually to convert this model that's a painful process and so let's look at like what are they doing and why it's painful so I guess the trend I noticed and that is each of them are dedicated separate libraries for dedicated Hardware right Hardware some of them are Nvidia specifics and some of them are like so I guess a solution that work lives in the framework level is more desirable from a user experience point of view versus something that's a little lower level that's on the hundred percent agree yeah and that's that's our motivation actually here that uh we want to cut that painful process somehow with an easier user experience so all the libraries that we mentioned you can see it on this slides on the right side like from our own pytor sources script to you know honey from time tensor RT all of them the idea with them is that they want to capture a graph so pytors is uh has this Dynamic graph that builds on the runtime so we don't have a head of time graph a computation graph there so what they want to do they want to capture this computation graph and optimize it through their compiler stack and do some of the normal things that a compiler stack usually does and you know remove the Redundant Ops constant folding due and most importantly like do kernel fusions and some memory planning uh and some of them provides lower precisions as well like int8 you know and even lower precisions these days so uh this is what they usually usually Mark will graph capturing that we have uh and some of the problems and how mitoch2.0 is kind of solving that problem so uh this is the general idea here but as I mentioned there is no unified solution first out there so we have these tons of different libraries they come with different uh sort of uh configurations that you know to know the best to make it work really and at the same time not easy to convert so what we have in pytorch we thought that as you mentioned at the at the framework level it would be great if we have like a fast optimized uh component so now in pytors we have this yeah so in pie torch we have this accelerated Pi tours Transformers that provides you out of the box performance uh it works with eager mode so you don't have to do anything basically uh you don't have to capture a graph you don't have to go to the pain of you know converting your Transformer model to through different libraries and you know optimize it you know maintain that code so all that pain can go away although there might be some external I mean additional gains through some of those libraries uh but overall the main performance gain you can get it out of the box from by toshna so this happens to um clarify for our viewers right this is a new thing right The Accelerated P transfer is is the new thing that that uh gives you this out of box performance without having to go through all the compiler stuff that you showed which could make any person's head disease if they want to compiler expert and I I guess I know you'll address this but the obvious question people will have is how do I get like is this there today how do I get started uh if you're gonna ask them along the way that's fine yeah sure yeah I'm gonna get there very quickly so yeah basically uh it has been out there part of it has been out there from 1.12 uh which is for the encoder so the encoder layer the fast path for the encoder layer has been out from 1.12 we have done an integration with hugging face for that I'll go through it then for the decoder we have an mu optimizations for the scale.products that it actually supports both inference and training and if you look at here basically if you are using pi Torch from 1.12 onward if you use these and then Transformer encoder module then you are in Grid heads actually so what you get is an optimized performance from this one you don't have to do anything and on the pytor's Nikes right now we have this optimized skills.product which is living under uh and then multi-headed tension here so um yeah let's see what is better actually and then why they are performing better and what we have done with them so uh here for this Transformer encoder layer we make like a big uh code chunk uh written in C plus plus and Cuda so and we are introduced two optimizations over there one would be the padding the smart padding and the other one would be the fuse kernels so for a small padding is what we do we usually when you want to do the inference especially batch inference you pat your sequences to be uh you know the same length that you can pass it as a batch this happens in training as well but I would say that most of the issue could be like maybe uh on the inference side and so what we do here we do a smart padding we remove the padded tokens that we don't need them for the actual uh computations so uh and only do the computations on the real tokens and um and that actually kind of speed up a lot then the other thing we do we fuse the kernel the fuse bunch of math models and soft Max over there and use uh you know a better I mean diffused multiple kernels in to one bigger kernel to kind of address the memory uh bound issue that they will go through it later and as I mentioned we have this optimized scale.product so anyone who look at this uh Transformer block that you you should be familiar I guess for everyone we have these series of computation here which is basically two map models your uh doing a map between your query key and uh your value at the end and you run software this is a chunk of the computation that we have in a Transformer this is expensive and uh what we tried to do we tried to optimize this using the new uh options that we have the optimized kernels from flash attention and next former memory efficient accounts so if you are using pytor's night leads it's going to be released soon in pytorch 2.0 that is going out in early March I guess uh you are gonna get them over there but for now if you want to use them you need to run on the night list and under the hood if you use uh again here multi-head attention you are gonna get actually a flash attention or exformer memory efficient out of the box so you don't have to do anything you will get the performance over there and it's transparent to user and the good news is that it works out of the box with pytorch 2.0 post.compile as well yeah okay I mean uh just so I unders just to make sure I understood correctly I'm gonna say this again let me know if I understood correctly so there are um when you said optimize kernel so these are a few kernels so these are uh a bunch of Ops that are fused and they are written in a way that give better performance on the hardware than you would get by just using um or previously the you they would use different ops which were probably not optimized or not fused right so this is and yeah basically not kernel fusions are what it does it mostly say that you will have these memory bound issue to transfer your data from uh GPU Global memory to the shared memory and do the computation so if you each time transfer your data and just use and do a little bit of computation you are spending most of your time transferring data from these two memories so when you fuse the kernels instead of running a very small portion of computation you just string them together and make a bigger computation so you transfer your data once and you do a big chunk of computation and so you are saving this Trends right uh and that's where you get uh the speed UPS from these fuse kernels right okay so so how about I show you an example because I think this is like this is really relevant uh does folks like because okay like fusions are like in turn when you think about like the compiler optimizations for ML oh do you mind sharing my screen yeah sure that's right yeah so let me just show you this so this is the like documentation page for like or like the marketing page from Nvidia like for their a100s okay you look at this thing and it's like let's say your tensorflow 32 performance a little bit if you don't mind well of course perfect thanks so here for example you see like the fp64 or like let's say the most impressive one will be the like the BF float and the fp16 sorry actually the date and you're like seeing these like insane like numbers like 600 like like like teraflops right like 300 300 plus 300 plus teraflops yeah this is Terra floating Point operations per second so how many are floating Point operations it can do per second and so the and so so the metric of flops is like particularly relevant when it comes to like Matrix multiplications like almost imagine it like a like a big O like notation if you've like done algorithms or something but it's basically the number of floating Point operation the number of floating points that you need to multiply by each other individual elements to be able to multiply like a matrix so this tells you that you can like multiply 300 teraflops of such of of like like like you like you you you would be if if you had like a single Matrix uh that had like 300 uh like Terra elements like I guess like that's uh yeah so I guess like that would be like like a thousand like a thousand times a thousand you'd be able to multiply those are like in a single second so you can just like like think about it you can multiply these like absolutely like insanely large matrices like in a single size fast yeah uh but you look at this number and you're like huh like like what's going on here right like look like you have like this number that's like close to like a million x larger than this one uh and this is a problem like memory bandwidth is how fast you can get exactly yeah so it it's like like basically you can almost think of it like this like you'll draw a quick picture so let's say this is like your GPU right and it's like you know this is like you know just speed you know like Sonic level speed uh and then your memory bandwidth is basically here like the like you doing uh like let's say uh foreign uh dot uh to to a device equals uh Kudo right so so this becomes the bottleneck right and and not uh like torch Dot mattable like let's say this is the transfer and that is the actual computation so the bottleneck is in the memory transfer not in the competition because gpus are obviously very very powerful throughput devices but the bandwidth is what you highlighted is the bottleneck so and so let me show you an example of what people mean by like fusions in general so let's say uh you know like I'm gonna so I'm using copilot here so I'm cheating a bit bro let's say we we had like uh yeah we had like a module like this for example uh and let's say uh basically I had like two operations that looked like this that's a torch dot cosine uh George dot cosine and then I had cells dot cosine two equals torch dot uh eyeshadows it's going to be like this because we don't need to initialize these sorry so let's say this was uh you had x equals uh or it's not cosine of x and then you add x equals storage dot cosine of let's say sine of x and then you want to return X so let's say this is the problem we're trying to solve and you know obviously like imagine uh let me give it a bit more perfect thank you like basically uh imagine that X is very very good right so you're going to be transferring this to the GPU each time um and well so let's say let's look at this like so let's say okay you're doing the third stock cosine you have like one reads from memory you're reading the X and then you need to write it back to the value of x and then you have one right uh and then okay you look at the sign and it's like same thing you have like one read one right so total in your program you have uh two reads two writes right whereas like when it comes to like a fuse kernel you can basically imagine if you had something that looked like this instead where you set self of x and this was like let's say we created like a function let's just call this like fused uh cosine sine I'll explain how this I'll explain how this works in a second but you know let's say uh no actually yeah it's a good guess uh so it's going to be like this let's say it's just cheating a bit but I'll show you exactly how this is implemented in a second and then I believe yeah this is a sign and so now like you what what you would basically do is you're like when you're saying torch dot equals fused cosine sine of x the two reads uh the the two reads and two writes become one reads and one writes and both reads and writes like have a memory bandwidth application because when you're reading and writing you need to send the data back over the GPU bus so this is like I think the simplest example of like how to think about this but let me just quickly show you now like how to think about this part like how is this actually implemented uh so let's take this example here uh here so just to show you for example how this works in pytearch too like we have a very similar example so here it's two signs what torch compiled does under the hood is it will generate a single kernel for this and the kernel happens to be written in Python with a DSL called Triton which was invented by a team at openai and you can basically here you can imagine everything that shows up under a single kernel means it's fused and so you see here the sign and the sign are fused so great like now we went from you so you can directly visualize what gets fused and what doesn't so again I don't want this to be like Blackmagic this is really like at a high level why this is important and obviously the bigger your operations are the more complex they are uh the more relevant this is but this is sort of the simplest way to illustrate it is to reduce the number of back and forth strikes spend as much um gpus being that powerful you show the number of tops uh float now how powerful they are throughput calculating this large but you don't want to be going back and forth um yeah I've heard analogies about like School Bus versus race cars it's something like that right you can transport more people uh whereas race cars you have to do many trips right although race cars is faster a school bus can take more people together at the same time that was a good good uh good uh I guess fundamental explainer in between yeah yeah great great idea yeah so yeah can you actually pull up my yeah good uh okay so yeah as Mark explains so one of the main things that we did was this kernel Fusion that kind of speed up a lot and on top of that we did that uh smart padding for uh you know removing the extra padded tokens as well so uh we did so as I mentioned you don't have to do anything basically this is available in bytours you don't have to do any conversion or anything but we did integrate it into tour sorry hugging face and the reason that we did need to do this integration is hugging Trace has its own implementation of uh attention layer so we had to go do some mapping through the layers and ask actually hugging face models to call into NN modules and then Transformer encoder and uh so details are available in this blog post here and we have this uh collab notebook that I'm going to show you some of the speed UPS here that how it looks in the code so uh if you don't mind uh from a user point of view let's say I'm a hugging face user it's transparent right I mean in the sense it's uh um I don't need to worry about any changes from the user experience point of view I just that's what I want to show exactly if you are a hiding face user you need one liner API to code and that's about it so uh if you are a hugging face user you need to do that but if you are if your model have been written in a way that you are actually using NN Transformer modules you don't have to do it you get the speed UPS so if you are a hugging face user so this is the uh collab that came with the blog post you can go and try it for yourself and this is specifically on the encoder side so if uh using hugging face you need to install the optimum library and this is optimization library that hug and face offers so you can install that one and here is a list of the models that these Transformer encoder layer the fast path is supporting so you can see that there is a mix from different modalities from text to speech and uh even Vision models so what you want to do to really use that is you want to load your model so you have a model name if you are a hugging face user you know how to use it then you can load it with pre-train from I mean other models from pre-train and you pass your you know model name so you load your model normally what you do and then to convert your model to a faster implementation to a better trap I mean we call it here better Transformer but actually is accelerated byters Transformer so uh you import better Transformer from Optimum library and what you want to do you just call to the Transformer that transform my model and you want to keep the original model as well uh and that's it you got your new optimized models so that's so simple for hugging face users and what it does uh if you look at these two prints that I'm not gonna show you here but it basically changes your layer names from original layers to converted layers but let's look at some of the numbers maybe here that what type of the speed UPS we get so we actually loaded the Roberta model if I'm not mistaken from the top yeah we loaded the Roberta based model from having Face and uh so to look at the speedups we have a couple of you know helper functions here like a benchmark or printing device and let's see that what we want to do here so we are going to uh have batch size of one and sequence length of like 128 and uh yeah that's about it so we are gonna test to see that how much a speed up we get this one and this is a Google collapse so it was running on a uh T4 card so this is a T4 GPU and if you see uh if you're right I have run it before so just to save some amount of time here so uh you can see that you get uh about 2.1 x speed up using even for batch size one and batch size one is not exactly where you are using that smart padding or the sparsity that I mentioned before so this is a good speed up for one line of change right and it's eager mode there is no graph capture it's not gonna play so uh and if you look at the same type of um you know workload on CPU you'll again get somebody to speed UPS this is again back sizes are one and sequence length of like 64. so it is not super huge too uh and you get actually very a smaller speed up here and the reason is uh you are basically uh doing a very small batch size and a very small sequence length here so the main speed UPS from sparsity cannot be used so to show you how you can use that actually so if you just make a bigger batch batch for yourself and then you pad it so the way you Pat it it's gonna pad it to the longest sequence that it has in the patch and then uh from here if you run the benchmarks again and this is a steel on CPU if you run the benchmarks again you will see that 2.1 it's faster even on the CPU for a larger batch size and and this is the batch size that this is a batch that we have and it has longer sequences for sure so you can see that as much as let me actually show the similar numbers on GPU as well so we are doing some uh random you know input Generations here and we are doing different percentage of padding So based on the sequence we do like 20 50 or 75 percent of that sequence learned as the padding so we want to show the uh impact of padding and sparsity in these type of optimizations how much you can gain and how much you can lose impact so uh if you look at here we just generate a bunch of uh different um input sequences here with different padding lengths and uh we run the benchmarks and we want to plot the results here so what you see here is interesting so um as much as your padding percentage it means that you have made longer sequences but most of them are padding tokens you can see that the speed ups that you get is like up to two points of 4X uh faster actually compared to your normal model and this is similar with 8 and 16 um uh 16 sequence sorry 8 and 30 sequence length and what uh you will see actually is that this is for a Roberta model we have seen similar or even better uh speed UPS if you go for like at the steel Bird model and here you can see that we got up to four point x speed up and again that's a very cheap optimization you don't have to do honestly do anything it's a one line of code and that's what you get out of it so it definitely worth of trying and uh as you as I mentioned previously it works for region nodes and Audio models as well I'm not going to go through them but similar speed UPS uh with different configs you can see on the blog post here too but this is about the encoder layer so whatever I talked about is about this Transformer and an encoder so hugging face models are just calling to this one and what if you want to use these NN modules I mean and then multi-head attention which is using uh flash attention or uh basically exformer memory efficient kernels what if you want to use them so uh basically what you need to do is that what what they are doing again they are optimizing the skilled up product they are optimizing basically this computation here what you want to do is if you have a model that you are Computing it yourself in your attention layer just replace it with NN multi-head attention that's all what you need to do and you will get the speed ups and if you are using this one already you don't have to do anything I'll show you this implementation here on a real code as well so you can see here that there's actually a couple of people asking where they can get access to the samples or notebooks is this example somewhere that people can go um uh you know the previous one um I I guess both what whatever we're showing oh yeah the first one is definitely available here on this blog post so there is a collab notebook here this one uh that if you uh basically go through it you will get the same uh collab notebook so that comes with the blog post okay yeah and this one is actually from uh the new uh blog post that we just put out a few days ago and this is how you want to accelerate stable diffusions with uh pytorch 2.0 plus uh the acceleration for the Transformers uh these code of sneakers is mostly coming from there awesome uh I just shared both the links on the live stream so uh thanks for your questions and again feel free to keep asking questions um this is really your show we are here to share information but it's better if we are focused on you know answering some of your questions about these new features yeah so yeah I I just want to kind of show this one that I mean you see all the computations that we mentioned here as the scale.product this is basically what you are saying here you are basically doing a map mode between your query and key and you know uh then you are basically adding the masks and do the soft Max and then finally you are actually Computing uh I mean you have another map from the Wii uh I mean values and if you can just avoid this by running even a skill dot product attention but this is not a pattern that we recommend because you have to do these computations manually again I mean QB query values but you don't have to do that really honestly if you use just n and multi-head attention that's the best pattern to use uh this new feature yeah and I guess yeah recommendation is to I mean again correct me if I'm wrong like use the ones that are already available because you and others have taken care to make sure they are the most optimized versions rather than trying to re-implement it yourself right the more you rely on these apis that we already provide the more higher chances that you're getting the best portion of those functionality than having to implement yourself that's the exact uh flash attention and ex-former kernels into Pi torch core we wanted to provide a unified you know optimized way that everyone can use it and we want to make sure that we maintain it and if there are improvements coming up like in extra I mean in future we are going to add them into next releases as well and yeah definitely that that's the idea here and uh yeah I want to stop here uh I think I will probably let Mark talk a little bit more about other optimizations that we have into our serve but quickly mention here that this is uh these are two um uh Python's profiles from uh when we ran uh a larger I mean could you zoom in a bit like I think it's like very very slowly oh oops and if that helps is it better [Laughter] [Music] it's good yeah okay yeah I mean like I I'll make the point but uh so the idea here is that uh we saw that when we are passing basically larger workloads like bigger batch sizes longer sequences we got better speed ups with these Transformer encoders and I want to show you what is the reason actually so here is a comparison between like you know batch size 64 and sequence length of 256 versus batch size 1 and sequence lens of 25 which is a very small local if you look at the profiles you see that the first one which is a bigger uh which is a heavier workload it is actually GPU bound so the first one these are the Cuda kernels uh on the top row that are launched from CPU and these are uh the Cuda kernel the actual Cuda kernel is running on GPU and you see that this line is pretty much occupied and GPU is basically crunching the numbers here versus if you look at this one there are these very Chinese very tiny chunks of you know uh Cuda kernels running and most of the time we are spending on CPU and if you look at here we have this very long you know Cuda device synchronized that basically CPU is waiting for Cuda to finish for GPU to finish and get back to CPU this is the overhead edad we have the python and this is the issue we mentioned so this is a CPU bound issue we it takes time to transfer this data really to uh uh to to GPU and uh the time that you spend on launching these kernels from CPU is much higher and Cuda graph is something that could be super helpful here and I guess that Mark is going to talk about it more that how we are supporting in Python 2.0 as well yeah I'm gonna stop here and let me so let me know if there are any questions yeah I love this graph uh first first view I was like okay this looks complicated but I now that you explained it I I love this graph because you're actually showing the two different paradigms right there's one wherein the the second one which is like not just the data transfer but launching a kernel takes time too right every time you launch a kernel uh which is a function that computes on the GPU like it takes time each time you launch it and that's true for any language when you're launch upon and you show how the uh it's uh it's bound by the CPN and in the first one which I I assume is the ideal State you want to be in wherein you want the Jeep to actually be occupied doing its work and the CPU is actually waiting for um because in that case um you know you're fully utilizing the the I really like yeah and and that nested tensor uh I mean a smart padding that means a lot on a GPU bound work as well versus here you don't get much use of itself yeah so I have a question that um I I'll ask those and uh um when we could kind of continue our conversation like so there's a a question which is about we have a GPT Dash J um one-fifth or 0.2 second inference time I want to make it point two seconds without beams decrease and maximum one second for a batch of 50 users on Tesla T4 16 GB how I so I I'll take a stab but there's a experts but it's really hard to make specific recommendations without knowing your specific workloads right um at the same time I guess question for Mark and the other General recommendations you could provide to make sure that they stay within those bounds so I can I can decompose this this question a bit so uh you're asking like you have a fixed model right and you want to make it faster uh but you've already like like hard-coded a lot of parameters in this problem like for example you're like well like can we do it on a T4 uh to which I would respond like you know it's easier to upgrade your new upgrade your Hardware uh you're likely to get like like much more significant speed up so like you know maybe consider something like an a10g for for the speedo uh I think the the other parameter is like well yes like you know maybe the model could model could be smaller maybe you quantize it maybe you distill it maybe you pick a smaller model uh maybe you run it via like something like touch compiler use like multi-added attention even though I think the decoded support isn't there yet uh but you know you can decrease like the batch size so like I wouldn't fix too many aspects of your problem because like it's it comes at a cost right like uh you have to either pay money to get better Hardware or you have to pay time if you're not willing to reduce any of your parameters so it's more of like a pick your poison kind of situation I will have a little bit of color over there too I I think as Mark mentioned there are multiple variables here that is hard to chase Without Really knowing what is going on so I I have some general suggestions here that probably you can start with profiling your model see where is the bottleneck actually is it like a CPU bound issue could have graph my build is it like a GPU bound issue should you increase your batch size or whatever there are different things so I would say the start uh you know with some of the principles basic principles look at the profiles see what is going on second uh I would say that if you are dealing with GPT style and generative style modes I think that KB cash is a thing that you need to consider so the way that you want to Cache previous tokens so one of the issues with these other aggressive models is that you have to generate each token at a time and for that one you take all the past generated tokens and send it to the model for the Next Generation this is very this is actually the model maker and beam search on top of that so these are all the factors that is contributing to this higher latency I would say that if you can cache those previously generated tokens uh look at the KV cache I guess hugging face has implementation of that and there are a bunch of resources out there so I think that would be something uh to look at that too but again without more specifics I'm looking into details it is hard to make any concrete suggestions yeah thanks thanks so much thanks Mark uh I I just a follow-up from the same person ask the question is we use hugging face Transformers for inference uh we are looking uh to thought so is that a segment to some of the other topics we want to discuss today uh sure yeah let me let me let me take a stab at this here I'll share my screen uh here actually um I'm I think I jumped ahead there was just one other question if you want to take that and then transition the other question was does this logging also work with other gpus such as Intel's xpu and ipex ipex yeah so I I think they mean the Python's profiler in this case I'm I'm not sure but but I I I honestly doubt it but like ipex sure yes you know and you could also just like use all of like uh intels like profiling tools and we we have a couple of case studies into our server of people doing that uh so uh yes but but just maybe like I think not naturally like Nvidia gpus tend to get more attention than AMD is then you know like everyone else is the profile other Hardware vendors tend to come with their own profilers I think is this traditionally what I've seen all right sounds good okay let's switch to third serve then okay so uh here okay could you write these guys let's try my screen yeah yeah so as we mentioned in the beginning like I'm one of the primary maintainers like for for third serve and the way you should think about third serve is it basically does all of the things uh that you need to do when you're running model.forward for inference right so this includes things like tracking like metrics like logging uh like like Auto scaling and because like increasingly more especially last year the theme for us was performance uh users really were coming to us like for out of the box like suggestions for how to make like models run really really fast out of the box uh so let me just show you this a quick picture here in our internals guide and it helps because like this often answers questions about like well how's stores are different from past API for example uh is that like one like you can basically have like a bunch of models that are deployed concurrently a third serve instance will automatically swap in models in and out depending on like the scale you also have like like different trustful apis to manage models so you can for example query the status like unregistered add more workers you also have an inference API so you can get like specific inferences from a model again uh and this is like another one of those things right like where you know like like I said like if people are asking me like hey like how do I make my third serve models faster well you know like one thing for example is to like really carefully configure like the number of workers that you have and so this is something we talk about like in our performance guide a bit more uh but there's like a lot more that goes into this right like basically uh here let me let me draw a quick picture uh I sort of hinted at this when when uh when I was answering the question like regarding uh the GPT J but like when you think about like performance in general there's like uh like a sort of several series several sets of optimizations to think about like one is like what I would call like your matmo optimizations all right so this is like something like a tile padding uh flash attention uh go into this bucket then you have uh what's like basically the model level optimization so this is things like uh quantization uh distillation uh make a model smaller so just you know you don't have to pick a huge model uh uh another one is like the framework level so this is like in your inferencing framework I mean stuff like configuring workers uh your Q size uh like batch delay uh so there's a couple of these configs and then you have uh like basically what I would call like the hardware upgrades and just to give you a tldr on this this really on AWS at least like refers to uh get a G5 instance for GPU or like an m6i for CPU so these are like some really good like cost effective ones and then we have like what I would call like the compiler level uh stuff and this is uh like torch compile uh this is uh Fusion's uh and like memory planning and all this right so so I know this is a lot of information but the reason I want to show you this picture is that there's a lot of like knobs you can turn and like without profiling it's like it's like really easy to just like optimize the wrong thing however like because in torture if people come to us we're like hey I want to make something fast I don't want to have to go like read the theory of ml model performance like could you help me out a bit and then I think that's like sort of what we targeted to make a bit easier uh so I was gonna go like I was gonna quickly show you some uh like one really important file in torch serve called the base Handler all right okay all right uh so I just want to show you like this this really quickly for example like uh if you're using like a like I even say this like in in the code itself like you know get an Athen G or an a100 if you're using torch compile like this is how important this advice is because for older gpus the memory bandwidth isn't as big of a bottleneck and so you're less likely to see insane speed UPS from Fusion so that's why often when people run things like better like accelerate Transformers or like try to compile on older gpus and then they're like hey like why isn't all that much faster like well that's why because you're sort of optimizing the wrong bottleneck right so specifically here the way this works is like we sort of gate we're like look is Cuda available if so is this an 810g or an a100 so this is like so this is like the architecture and if so this means this is shorthand for enable tensor cores and so we saw from the Nvidia docs that like the tensor core flops is like substantially higher than the fb64 which is often used for more scientific code or fp32 and if you're not setting this like you're gonna by default be using this and so you're gonna be 10x reducing your Peak perf so these are like this is like a substantial substantially important config variable because it's just like one flag and you just get like 10x different 10x speedups uh going like a bit further here uh a lot of like run times like let's say for example pencil RT for example uh like they have this pattern where you have a model and you need to convert it so for example like you let's say you touch the model and then you want to say like compile it with tensor RT here is an example input so you have to be very precise about the specific like shape of your input data and then you say like this is the D type and then you say for example like run it with FB 16. so for example here like tensor RT mixes compiler level optimizations along with like numerical accuracy optimizations like quantization but the the tricky part about this is that they make you define what your input shapes are up front uh which just like adds like a bit more hassle to it right because uh whereas like with something like search compile for example uh you're not you just you just pass in the model and you pass in a back end which can also be tensor RT so but by default it's inductor and there's this important mode here called reduce overhead which is absolutely critical for info it's an Emmy dented at this earlier so let me just show you quickly what this mode is so if you go to the fighter triple uh let's go here so we're gonna go to torch and then we're going to go to the init so let's go to Def compile so this is actually the compile function and you can see here it takes like a model you can have like Dynamic shapes like different backends and it has this parameter called the modes and so you can notice here there's three modes like one mode is default another one is called reduce overhead and another one is called Max autotune Max autotune really means find the right shape that your matrices should be in terms of padding them or how the memory layout should be on GPU so this is very slow to compile but will be very fast once it's done but the one we're paying attention to is this guy right here called reduce overhead and reduce overhead within the context of torch compiled is equivalent to Cuda graphs and what chronographs do is instead of like sending multiple small kernels to a GPU so basically again memory bandwidth will be your bottleneck you batch them all together and you send like one giant kernel uh so it used to be the case that using Cuda graphs in pie church was was somewhat complicated like you needed to muck around like with Cuda streams and do a whole bunch of stuff and now it's like somewhat trivial like you just pass in this extra flag and especially if you're running towards serve with small batch sizes this is just going to be absolutely critical for a larger batch sizes it's not as big a deal because uh you're going to be more compute bound as opposed to memory bandwidth bound but the bad size options need to be pretty large uh but again this is like really like a really important flag which is why like we we sort of decided to just make it like a reasonable default however if you actually look at the release notes for a third to compile within third serve like we actually for now at least like discourage you from using it because like the inference performance isn't as competitive for gpus as much as tensor RT or for CPU as much as ipex and so our general philosophy when it comes to like what is the right like what is the right run time what are the right compiler tricks to use is like we don't know you should just like Benchmark and see but we'll make it really easy for you to Benchmark and see so ipex works the exact same way it's just like a flag to optimize ipex there's also like Onyx support so it like works like pretty much the same way here where is it here so where is that here so in the case of Onyx you need to set up like an ort session you need to allocate like uh your CP like like the number of number of threads to be equal to like the number of logical cores on your CPU and once you have an ort in front session you set up like a model and so like also like this is like another important consideration like it's like I know the name torches and torch serve but torserve can like deploy like a tensorflow model it can deploy like an onyx model and we found it like it makes it really easy for people like wanting to migrate and an onyx scan very often be like the right framework to use if you want like the performance tensor RT is easier because tensor RT like you just save it like it just looks like a regulatory script model so because we support draw script models we support tensor RT and so most of the pain goes into preparing the model which is basically the part that I showed earlier here so this you wouldn't do within toy serve but you would do it right before going to toaster so I'll stop here because I know this was a lot uh but I just wanted you to be aware that there's like a lot of options and what's right is very much like a benchmark in C kind of exercise yeah specifically I think we were discussing this before the live stream to write inference um it's such a um involved process there are so many budgets uh you have your latency budget your throughput budget and I and it's uh your different Frameworks that you're working with and different optimization paths it's I think ultimately and different Hardware choices that you can go through ultimately you have to try out these options and see what works well for you there's really hard to give a recommendation I think like this is the best GPU and this is the best framework right you have to go through the uh the whole process and see what works well for um I think that's a yeah that's been a recommendation I give always to um thanks uh there's a question here um which is uh with torch dot compile will there be dedicated lightweight inference engine that can be used standalone uh yeah so it's a great question uh yes it already exists actually it's a torch.export uh it's still like very much like uh like I even like pre-beta feature like expect BC breaking changes except bugs except like unparsable error messages uh but if you're willing to go through the opinion uh of it like you know please like message us and we can like work with you on on making it useful but yes I would say this is like a very like high priority ask a lot of people have asked us for this and it makes tons of sense and I'm actually looking forward to integrating that in door serve myself so I'm very much like also a customer here and to reiterate I think we discussed this a lot of what we discussed today is available in the night list right not in the touch one point x so it's in the nightly so if you go get the PIP install the nightly or go uh get the latest nightly Docker container you could try out a lot of these capabilities and if you have feedback uh what is the best way to reach out Mark and how many is that have discussed pytouch Forum or um GitHub repo issues yeah like uh I mean yeah like like either yell at us on Twitter like add us on Twitter at us on like GitHub uh I think that's totally fine like you know we both spend a significant chunk of our day on both of those social platforms so you know whatever works for you yeah that's nice that you're being so uh approachable because um yeah I guess we want to reduce the barrier to providing feedback if you put so many layers between the user and the person taking the feedback it makes it so that's good but yeah as I mentioned there are few different forums right Dev discuss is pretty active there are a lot of Maintenance and contributors who spend time there and you could obviously Reach Out directly to Mark and Hamid to Via these channels um as we we have I guess a minute left of the live stream but are there any last things there are no questions so last things to reiterate before we close uh Mark and Hamid closing words if you will uh sorry yeah I I have a quick one so uh I don't know if you all like know the sort of the expanding brain meme but I sort of want to like translate that to uh to to Performance right so I think like the very like I would say like when you're thinking about performance sort of like the level one is you like hear some advice about how to make models faster and then you leverage it right level two is you yourself figure that out and you write a really good blog about it and people go figure it out level three is you sort of make these things more like out of the box like supported in your code with like various config Flags or Warnings or errors and level four which is you know why I'm personally so excited about first compile is you just make it compile like think about all of this for you like this the parameter space is too big humans aren't meant to deal with this kind of complexity uh so I'm very excited about compilers for the potential of automating like model perf improvements yeah and then that's where technology is heading right the early Technologies are always a lot of DIY um get your hands dirty but once it matures it's it's a lot more straightforward and uh hopefully you'll get out of the box performance without having to do anything uh all right folks so the we're coming to the end of the live stream today I just want to reiterate that uh this is an ongoing series uh we've had several of these you can find a link uh list of links uh in the pytouch.org events uh all the all the previous live stream including this one will be available in the YouTube channel if you're watching this on YouTube you know where to find it and uh be sure to come back and register for the upcoming live streams and uh if you're on YouTube you can also you know like subscribe not uh get notifications directly there or if you want a calendar invite go to the Python's website and click register so uh we'll close here um thanks again uh for spending your time with us um um and hopefully we were able to answer some of your questions and share some knowledge that you previously didn't have with that we'll we'll uh close here so that's bye from me Prashant and thanks everyone Thanks for listen everyone see you in the next one bye
Info
Channel: PyTorch
Views: 4,639
Rating: undefined out of 5
Keywords:
Id: ZOWjOxC80qw
Channel Id: undefined
Length: 61min 45sec (3705 seconds)
Published: Fri Feb 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.