How to export and optimize YOLO-NAS object detection model for real-time with ONNX and TensorRT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome to new video in this video here we're going to cover a ton we're going to convert a yolon n model to T RT we're going to take every single element from going from a pytorch model we're going to load in a pre- model I'm going to show you all of that and then we're going to see how fast we can actually like run these models when we optimize them to run on dbus on specific Hardware so we're going to cover each element in the whole pipeline for converting a Yol Nas model to T RT and we're also going to run it with O and andex so you can both run it on CPU GPU and you're going to see some awesome results these models here can run very very fast and up to 100 frames per seconds so we should definitely before redeploy models optimize them and is actually like really hard to do and I'm going to cover all of it in this video so definitely stay tuned for that so first of all here let's start inside the supergran GitHub repository from Daisy AI these are the creators behind YOLO Nas and this is the model that we're going to use we're going to set it up we're going to see some assult STW we're going to export it to on X convert to 10 rtu and then we're going to see some install standing um results so here is basically just the GitHub repository if we just scroll a bit further down we can see the different kind of like computer vision task that it can be used for and also some benchmarks so here we can see that they both have this quantized model which is actually like pretty awesome as well you can also do that in the different steps for converting it to on X and also T RT so you can basically just convert it from uh float 32 to int 8 and it will significantly speed up your models as you can see down here so we can see the latency down at the bottom so it's a millisecond if we're act like optimizing these models we can run down to 5 milliseconds per image so that is act like pretty fast and up to 200 frames per second on specific Hardware so these models here are really cool and you should definitely optimize them because the pyo versions are just very slow so let's cover that in this video here first of all we're just going to get like a pre- turn model let's see some inference result on the pre turn model and let's start with the whole pipeline so let now just go inside YOLO Nas and let's go inside the documentation so we will have to the documentation up here then if we go inside the sources we can basically just see all the different kind of examples that they have let's scroll down to um the export so first of all we just scroll further down so here we see model export so let's go into that one and basically just go through it step by step so here we can see that we can both enable a floating point 16bit and 8 bit quantisation and we can export the model for Onex and also tens RT and if you just scroll a bit further down we can see like how we can pip install it set it up and so on how we can get like get um a specific model here so we can use pre-train models but you can also use your own custom train models and I'm going to show you how you can use that so if you just go a bit further down we're basically just going to create a notebook with all of this I've cleaned everything up for you and it will be available in the video description then we have two different output formats so here we're basically just going to focus on the flat format because we just want to process one image let's say that we have a video stream just load in one image from the video stream we process it and then we just keep doing that for a video stream or our camera and we can also run in in batch mode if you have more multiple batches that and multiple images that we want to process at the same time so we're going to use this flat format where we have batch index so that will basically just be a one then we have our bounding box class score and also our class index if we scroll a bit further down we can then see how we can visualize the prediction I'm going to cover all of that in the code where we're actually going to see the results as well but the last thing here that I want to show you guys is down here at the bottom so it is basically just for choosing the back end for exporting the model so you see that the support back ends is onx runtime and also T RT I'm going to show you how we can export it in both formats I'm also going to show you how we can set it up with t RT run it both with onx tens RT and then we're going to see how fast it really is and also see some results so you guys after this video here if you follow a step by step you will be able to convert your models your pych model with the Yol s pre-train models and also your custom models into onx 10 RT optimized for running on Hardware to get real time performance so let's not just jump into this jup notebook I've created jup notebook where we have the examples cleaned up from their GitHub repository then we're just going to go through it block by block but first of all make sure that you have Cuda installed on your computer and also T RT because that is what we're going to use both for the back end but also later on for converting the model we're going to do some benchmarking with t RT they have commands directly we can use in the terminal and also run inference directly in there so need those two different cun things so if you have CA installed already you should probably have that then we only need to go in install this 10 RT to have a pretty good guide in here if you just choose the version that you want and then you will also get the installation guide just open this one up here follow it step by step this is the best guide out there for act like installing it if you're on Windows just scroll down here to the zip file you just need to download a sip file so you can set it up with your environment variables where you just specify the path to the dll files or you can copy those P dll files into your Cuda directory so let's now jump back into the code here and let's see how we can set it up so first of all here let's go in and get a model so models. getet we can get this yolan as large model they have a large medium and small model and here we're just going to use pre-train weights we're going to use the Coco pre-train weights up here at the top we just have some different modules that we actually need to import both for our optic names training and then we also need these conversions from Super gradients so this is the detection output format mode because we want to have our flat mode and if we also want to do quantization we also have this export quantization mode and our export or Target backend so we can choose between on X and also tensor RT for our backend export so this is how you can get a pre-train model but you can also go down and use a custom train model I already have a video about that here on the channel so definitely check that out we take a whole data set we create the whole training pipeline setting up the model labeling the data set exporting data set training the model exporting the model and then you can basically just specify the path here to the average model and also the base model that you use so if you have trained on a y y n large model model you just need to specify here the number of classes that you have trained and which is in your data set and then specify the path here to the model that you have downloaded from your Google collab notebook everything is available here on my YouTube channel so definitely go check that out so yeah we have our pre- model you can also have your own custom model that's everything that we need then we can call Export directly on our model so here we basically just specify the name of the Onex file that we want to export so Yol asore preore large. onx we can set the confidence score threshold here so let's just go with for example like 30% confidence we can set the non- maximum suppression threshold so it's also going to do non- Maxim suppression and that is why it's really good to use this export function from um super gradient it has taken care of everything from you you don't have to do like the postprocessing the pre-processing everything is built into the Onex model and the ownx graph and then you can actually like run that on the TPU as well so if you're just having a standard Onex model you don't have anything you just have a raw model then if you want want to apply non- maximiz oppression after that model then you can really run in real time because we actually need to run our non- maximiz oppression on the GPU let's say after we have done a forward pass of our image then we probably have like hundreds if not thousands of detections and they're overlapping all over each other so then we actually need to go and use non- maximum respiration to reduce the Bing boxes to the specific Optics that we have in our scenes we're basically just choosing the bounding boxes that fits our Optics the best so we actually want to build a non- maximiz oppression inside of orex file as well and Daisy with super gradients is taking care of all of that we can also set the pre non- maximiz pression uh detection so basically like how many predictions we want to limit it to and also the maximum number of predictions per image so say that you only want to detect the best object in in the image you can just set it to one if you want to have like an an arbitrary number of predictions you can just set it to like a very large number let's just go with 50 in this example here which is act like just an arbitrary number then we need to specify our output predictions format right now we can set it to Flat format again we just go inside detection output format mode set it to Flat format if you just want to process a single image with a bat size of one so we can really use Dynamic batches when we're going to export with onx and also when we're going to use 10 RT we need to specify um explicitly the bat size so that is very important and if you're getting any errors it is probably the reason because you're not using like explicit BS size or a bad size of one so definitely have that in mind when you're actually like going through your errors if you run into any we can also set the engine down here but I'm going to wait with that for just a second so we can set it to tens RT and then it's going to export that as a TENS RT back end but the reason why I don't want to do this to start with is that we can actually use the own next model together with t RT still so we actually don't need to export a TT we can just use the own next model in t RT as well but if we export it into T RT we can use it in onx unless you have a really good reason for exploring it with t RT backend just go with the onx model and you can use it for both as you're going to see with onx we also get very fast inference speed so let's just run this and see some results so it's just going to run here we might get some outputs here we can see that is downloading the pre-train weights successfully loaded the pre-train weights with the architecture yolan Lo Mar and now we have started the export and now we see that the export is done let's go over and see if we actually have this model file in our directory so we should have it a bit further down here yolas preore large. onx so now we have our exported Onex file here let's go back into our notebook there we go we have returning these export results let's just take a look at them where we actually like get a lot of information of how to use the model the input shape and so on here we can see that the input image type is unsigned integer 8 so this is act like a quantized model so here we have a bat size of one and three channels for our RGB image and then we have 640x 640 image then we can also go down and see we have these parameters that we set up when we initialize the model here and now we can see how we can run with onx how can we load in the model how can we do a forward pass with this inference session with onx runtime if you want to run this you can just P install onx runtime and if you want to use use it on GPU you just have onx runtime Das GPU and then it will take care of all of it here we set up the Cuda executed provider and also the CPU so it's going to choose whatever is available on your computer then we'll have our inputs and also our outputs that we can directly go in and extract this is how we're going to get the extraction so we have our predictions which is the output from our model then we have our flat predictions where we have our first of all we have our batch we don't care about that then we have our bounding box confidence and class ID where where we're just having follow Lo loing through all the predictions that we have from our model then we can use that in our own applications and projects and I'm going to show you how we can do that in just a second so here we also have a command with tens RT so once you have tens RT installed you can use this trtx for executing like 10 RT models you can do benchmarking do average runs you can also do conversions of models so how can you convert and own an X model to a t RT engine and I'm going to show you all of that as well but let's now just go down and run this model here let see an example so we're just going to load in an image we're going to resize the image to our input size which our model takes this is really important as well all of these parameters here are fixed and it needs to be the exact same every time now we have our image we can go down and create our own next run time inference session we specify the path to our model let's just going and take the correct one so we're going to copy the relative path throw it in here then we have that we set up ouru providers and our CPU providers we have our input outputs and then we're also going to time how long it takes to do inference with this session. run and then we're going to print our inference speed and also the results so we set up the model we have exported the model loading on the image we set up the model and now we're going to do inference let's hit shift enter and it's going to run so first of all you just need to set up the run time before it's actually like going to do the forward path so this might just take some time and this is just initializing the model so if you're running this in real time on a video stream you'll set this up with outside of the Fall Loop or a while loop now we can see that we get our results back time here so around 12 milliseconds so that's almost 100 frames per seconds directly with this large model here if you're running the pytorch version you will probably only get like a few frames per seconds um with the super gradients directly so this is already like an increase by 5 to 10x this is very awesome I'm running this on a on a 3070 graphic card so it's not the best out there in the world but we can see we get really good performance we also get all the predictions out here so again those doesn't really make sense right now but let's now go down and visualize our predictions so I have our results we can set them equal to our flat predictions and just extract the results as we saw in the exported results variable so let's run this and take a look at it so here we can see that we have class ID zero and also two so this is probably like this is person and this is probably like a car or something like that but let's just take a look at that after we have the confidence score and we have our bounding box then we can go and use our predictions to just directly visualize them with super gradients so just run this block of code and don't spend too much time on this let's call the function show predictions from Flat format there we go now we have our predictions we have the persons walking here so here we have like a bench but again it's very low confidence score we have some cars in the background and we also have a person so now we can see both the confidence score and and also the predicted label together with the bounding boxes and again this is just single image but it only took 12 milliseconds to process this image you can basically just extract this code here directly if you want to use the onx model out of the box this is really fast you're going to run very very fast if you're running this on a live webcam or a video stream the only thing that you need to do just set it up I have a bunch of code for that set up a video streamer with open CV initialize all the things outside the W Loop then have the W Loop and then basically just have the Run command and also the visualizations the fall Loops for running through the results and then your equ to you have the whole pipeline up and running you can process videos in real time now so I promise you guys to go back to this command here again let's just go in and grab that and let's go in and open an anaconda prompt so we can B basically just Benchmark our model so now go back here again I just need to go inside the correct directory so let's just delete this part so first of all I need to CD into the directory that where we are so CD YOLO Nash from supergradient and then we can call the command again now we need to specify the path to our model so now we called it yolo um YOLO Nash pre-train and large let's just verify that this is correct so Yol ncore pretrain uncore large unsign integer eight and then we want to run average um average Run 100 and duration 415 so it's not just running here it will take some time it will load in the model and then it will do the benchmarking and then we're going to see the results after that okay so it's now done running and we have these Benchmark if we just scroll a bit further up we can see that we have these average on 100 run so it's basically just going through 100 times on these examples so it's loading into on next model converting it to 10 RT and then it's running everything in here doing these benchmarks so here we can see that we get around like 12 milliseconds sometimes 11 milliseconds but like yeah probably around average of 12 milliseconds if we go down and see the latency 10.5 maximum the mean and so on so again we can see the minimum is around 10 and the mean is round 12 so again this is pretty fast as well kind of similar to what we got for the onx model so this is kind of like the same but this is pretty cool we can directly run it in here we can Benchmark it we can see a lot of different kind of like metrics and also information about our GPU and the model so this is how you can Benchmark it so let's now see how we can convert and own next model to T RT first of all we had the py model converted that we saw the results with onx set it up with dpu and also CPU now we have the benchmark let's now take the onx model convert it to T RT engine and let's run some inference with that I'm going to show you how you can process all of that create the engine take the engine and run inference for that and see the results so now I have a command that I've just copy pasted in directly we have trtx again now we need to set this explicit batch as I mentioned in the start like this is very important like often when you're getting errors it's because you don't have the correct Dimensions the correct values that you're actually convert your model with and also you need to convert the model in the same um versions of 10 RT C and so on that you're going to run inference with we also need to set this explicit batch because it can't do Dynamic batching um with the 10 RT now we need to specify the path to our on andx model again and where we actually like want to save our engine so right now let's just go in here and we have YOLO Nash uncore pre-trained again uncore ill there we go now we also need to specify this argument save engine equal to T RT engine first I'm just going to copy paste this and then I just want to show you guys what we can actually like do so let's just run this here it's going to run for some time convert the model and so on until then just open up a new anono prompt and let's just call this trt exic so these are all different arguments that you can actually like throw into this so you can set like the average run percentile you can set up different kind of like export outputs devices so if you have like multiple different cooler devices you can also set that up if we scroll a bit further up here we can see the warmup iterations which we already been playing with we can also set the batch if you're using like batch format you can also just set the batch equal to one or the explicit batch if we scroll a bit further we can see we have load engine save engine which we're using now so we have all these different kind of arguments you can go and see here how we can use them also if you want to use like floating Point 16 in8 and so on you can see how and a short description over here to the right so just wanted to cover that while this is running so now our model has been converted into a tensor RT engine as we can see down here at the bot them if you just scroll a bit further up we can then see it does this um benchmarking again now on 10 runs again we get around this um 12 milliseconds so if we go over here to the left we can now see we have this YOLO tt. engine so this is the engine that we're going to use to act like run with tensor RT so we both have our own next model now and also our engine model you can also have like a trt model it is basically the same as an engine model so now to be able to run inference we actually need to go inside this GitHub repository we need to pull the code so you can just go in and clone this repository into your current repository or just go in and take the individual files so if you just go inside this yolon T RT we need to go inside and use this infert TT it's going to do a lot of different kind things in here in the back end as well you need to go in and change the max size the model weights also the names for your classes if you're using a custom model but I've already done that we have the repository in here if I just scr up here to the top so we have YOLO das1 RT if we go inside of that we have YOLO Nas but we also have all these other YOLO versions this is the YOLO NAS from Super gradients and D aai so now we have these trt backend and the trt loader which is the most important thing we won't go into details like this is very complex and you definitely have to like sit for very long time to be able to understand all of this but it's going to pull up these loaders then we can use this inference script here to directly use our um model that we have so if you just scr up here the only thing that I've changed is the Mac size and also the name so these are just for my local path then it's going going to do this detection I've also removed some of this part here so I'm just going to use open CV to load in the image so it doesn't really like resize it doesn't like rescale I don't want to post process my image I just want to take the raw image convert it to an RB image so we can throw it through the model and get our results back so this is everything again all this code here will be available in the video description under then we just have this fold again running through all our we're going to draw a rectangle and also put some text this is just going to process a single image we create an instance of our model we specified the model weights so this is the YOLO unor trt engine that we have just converted from our own X model with trt xit and now we just specify the path to a test image so we just read an image with open CB we do this detection function on our image with our 10 RT model if you just take a look at this test image that we're going to run I'll just scroll down so this is just a selfie of me we're going to take the image convert it into the correct input format so that's 640x 640 throw through the model and then we're going to see the results so you can basically just wrap up the code so you can run it on live webcams video streams or whatever you want to run inference on both X and tensor RT so let now going and open up a new Anaconda prompt or we can just use this one here so like right now let's go inside YOLO so Yello tensor T RT and now we need to go inside YOLO Nash again there we go if we just go inside the directory we can now see that we have our um somewhere we should have our infert tt. PI so let's just call python infer 10 RT Pi it's going to create an inst of model read in the image and do detection and along the way it's also going to write some of the results out to the directory so now run the program see the inference results with tens RT so first of all here we can see that I'm just printing this 10 R te model object we can see the input size so three by 640 X 640 the profile shapes it's going to divide it into each individual Channel and then we can see The Binding shape and also the flat prediction binding shape so this is the output where we get the badge the bounding box confidence score and also the class so this is for our explicit batch mix boort and then we can basically see that we did a forward pass and now our path and script is actually like done running and we should be able to see our results so again I'm not directly displaying it but if we go over here to the left we can see the results and also our test so this is the image that we're going to throw into the model this is the post process the image this is the image that we throw into the model directly with tensor RT the engine file that we did set up and now we get the results so this is the input and this is the results with the tensor RT model so we're detecting a person we can see the class up here at the top but we can also see the confidence score so confident score of 97 so 90% confident that this is act like a person that is in this Frame and that is correct and this is how we can run inference with both on X and also tens RT so this is act like very comp complex you guys might have to digest this for a bit this is act like a video covering a lot of stuff so you probably also want to go back watch this video here several times try it out on your own have it on on the sideline try to follow it along step by step and see if you can actually like do it on your own models or use some pre models as I did in this video here it's a really good thing to know when you're working as a machine learning engineer and specifically computer vision engineer where you want to have our models run as fast as possible let's say that we want to create a realtime system this can actually like have a huge impact let's say that we can go with a lower budget GPU compared to just buying like the most expensive GPU out there because we just want 100 frames per seconds but we can actually like go with a with a decent dpu just optimize our model for that specific hardware and then we're both saving money we're running faster and we create some really cool projects and applications so I hope you have both enjoyed this video here and also learn the time a little bit this is very hard to get into so if you guys are a bit confused it is a really hard topic to get into you just need to put in the time and then I'll just see you in one of the upcoming videos guys until then Happy learning
Info
Channel: Nicolai Nielsen
Views: 5,837
Rating: undefined out of 5
Keywords: object detection, deep learning, yolo, object detection deep learning, object detection tutorial, yolonas, yolo nas, yolo-nas, real time custom object detection, custom object detection yolo-nas, yolo-nas google colab, best object detection model, deci ai, tensorrt optimization, tensorrt, onnx, how to optimise model with tensorrt, how to optimize model with onnx, onnx model export, export tensorrt, tensorrt inference, onnx inference, how to run inference with tensorrt
Id: JVCtx7-4qxE
Channel Id: undefined
Length: 24min 2sec (1442 seconds)
Published: Thu Feb 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.