HOW YOLO V3 WORKS?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi uh today i'm going to talk about yellow algorithm mainly version 3 a bit of a version 4 and tiny version of those uh and i'm not going to focus on how to train your network in five minutes and three lines of code you have videos like that but i haven't seen a good video explaining how actually yellow works so what is inside yellow and this is what we're gonna focus on today and i assume that you have uh some knowledge from neural networks what is neuron what's activation function convolutional layer something like that you have to know uh we're not going to focus also on things that are unoriginal in yellow so the things that you can see somewhere else but how those little things come together to create this unique algorithm okay and i will give you this great tool that i'm using right now to explore your your on your own so we can change uh everything from the keyboard uh in real time so i would change now from yellow v for tiny to yellow v4 in split of a second change i can show you where is the center of the detection and in which grid it was detected the anchor boxes i can record print screen i can change confidence you know you can see a banana you can see uh this is probably a cell phone too yeah it sees cell phones everywhere okay and this is very easy to install but because it works on cpu i run it on gpu but also works on cpu a bit slower so uh what you do to download it is you go to the first link in description below and you have installation described here it's very very simple if you don't want to install it first and give it a try on google call up you click here and it will send you back here so you can do everything i do here except that not from keyboard line you you will have to set up the values from um oh this is long link you have to set up values in here and then you can change confidence i refresh hold i will take a different image of poppy's copy image link i will paste it here i will run it it will take like a second or two to get uh the detection because google cloud has great cpus but they have the worst possible gpu the us possible cpus they have great gpus okay installing this on your machine is very easy because it's simple get clone and pip install you can do obviously a virtual environment but remember to run this line also because it creates a ip kernel which is virtual environment for uh notebooks because we will run it on jupiter and you have to have it yellow tutorial here i run it on open cv gpus because on the gpu it works faster but installing this on gpu it takes a few hours so do not advise to do it okay so let's start with this tutorial and we will start with architecture and it's supposed to have uh split into a backbone and the head like you have a backbone 53 layers and heads 53 but i don't see that split i don't think it we should look at it in that way i would split it into feature extractor uh feature maps and the head which is changing those feature maps into the actual detection so how it works how do we extract features from the image first of all we give an image to your algorithm and then it has a bunch of convolutional layers one by one three by three and they like interchange all the time and we have something called uh residual blocks and this is the the core the backbone real backbone of this algorithm because they are everywhere as you can see here okay i will probably upload this image also on a github repository because it's a it's a paint but you have all the visualization here so we have a bunch of residual blocks and what is a residual blocks it's not really complicated thing in basics in basic models you see something like a sandwich so every layer is on top of the previous one but it's not the case here instead of this if we take an output from this layer we do a convolution one by one and three by three but then we add what we have before a feature map that we had before to the output of those two uh layers okay so we preserve something that we already had but we also added a new perspective on the feature map new feature maps okay and we do it a bunch of times because you can see this is a one residual blocks two eight eight four so we have a bunch of convolutional layers uh three by three and one by one uh three by three gives a spatial orientation within the image okay what do i mean by that yellow gives practically to every pixel knowledge of all the image and it does it through those three by three layers just like i don't know exactly but like 40 of convolution three by three and this entire network maybe a bit less don't quote me on that uh okay so how the convolutional layer you should know but uh what i mean that it gives a spatial orientation of entire image so this pixel let's say this pixel is here have information from its neighbors okay so this pixel will know information from this neighbor but this neighbor have information from its neighbors from previous layer 3x3 and this neighbor has information from this neighbor from the previous layer of 3x3 so the information comes from outer parts of image through this convolution 3x3 into this pixel that's why i can detect that i'm here from this pixel from this grid okay that's how we have those information okay because in yellow you don't have anywhere fully connected layer it's all based on convolutions okay what is the point of having one by one convolution because many people see it's like it doesn't make sense to them it doesn't make sense if we look on the resolution so in this 2d space that we usually see in pictures but it makes sense if we look in depth so when you look at those numbers here we have 64 32 64 32 128 64 128 64. goes up and down up and down all the time and we do it to reduce information we have and reduce computation that we need because this works in real time and we want that so instead of having like in this case 192 feature maps we can reduce it to let's say 16 okay so how does it work we take this tensor and we it's not tension vector we take this vector uh and we uh extract information uh extract important for us information and we save in this one uh point then we do it 16 times then we do it 16 times so we have 16 layers of uh that image it's more of a feature map because here's the thing we should look at we should look at we should see a feature maps instead of images because image looks like image on the first layers then it looks nothing like image that you can see right now it looks more of a feature map it it's getting smaller in resolution but uh bigger in depth okay uh so this is why you have one by one convolution uh how does the max pooling work so this resolution that you that we start with for 16 on 416 and in this case 320 on 320 because yellow is very flexible about resolution input so it work it happens inside a convolutional layer it's not a separate max pooling we do a convolution simply jumping two pixels not one so we don't have a stride equal to one but to two so we skip every second pixel and then we have a low resolution and that's how it works okay and yellow makes a detection on three different uh resolution the the basic input resolution for 16 so we're going to stick with that so from 416 it gives you a resolution at 13 by 13 26 by 26 like you can see here 13 by 13 26 by 26 and 52 by 52 so it shrinks input image um i don't know how many times how many times 32 times 16 times and eight times okay and we you can see a bunch of feature maps here and here we are on the resolution of 52 here we are on 26 and here we are on uh 13. but look at it that way we have feature maps perfectly designed to make a detection from 26 by 26 so we'll just simply take it and send it here but it would be wasteful to not use new perspective that the algola algorithm learned on smaller resolution so 14 by 13 because you have a lot of layers here and a lot of uh convolutional networks and it might have a perspective that will be helpful to us so we not simply ignore this information but we upscale here we upscale 13x13 images feature maps to 26x26 and then we concatenate it with this untouched information from the 61st layer and then we do a detection on it so we don't waste any new information it's the same may be information but different perspective on it and then we do the same with 52 and we have a detection on three different uh three different resolutions and now we will go to uh something that is really unique to your algorithm which is the head so how the head how do we change feature maps into actual detection okay so now we will focus on the head of this algorithm so how to transform feature maps into a detection okay let's see uh how it looks on the architecture so we go here we go here and we do last convolutional layer that gives us 1024 feature maps in this resolution 13 by 13 then we get a vector of 1 by 1 by 18 in this case because in our case it will be a different length of the vector so let me tell you what that vector is in our case it will have 255 of length because it contains three guesses from which grid okay grid is this this thing right here okay this is great this is kinda like this thing right here so one pixel in that image we have 320 resolution so we will up it to 416 and we have 13 grids like that in this image and 13 grids in height okay and every single grid on this image have to give you free prediction of what it sees every single one have to give you this it has something like objectiveness current class score that it also can tell you my prediction is rubbish don't even consider it but every single one have to give you that prediction okay this is another point of looking at it so three boxes with prediction every single grid okay hope that's clear enough what does it prediction contains it has a we have a vector of 85 a length of 85. it has this of length of 5 that's always there plus number of classes so person dog cell phone and so on we have 80 classes because this model was trained on cocoa data set that has 80 classes okay then we have this five and now we will focus on that we'll focus on those four at the beginning how many information you need to find a rectangle you need for information usually there are parameters of left top corner and right bottom corner but in this case it's going to work in a different way we need parameters of the center this point right here and width and height okay and this vector will give us that because we take this value tx which is the first value in this 85 vector and we pass it to that function those functions are pretty important you have a picture here that also tells you a bit of the story what's happening here so we take this value and we put it into a sigmoid function and that's very important because what the sigmoid function do it looks kind of like this well that's a really bad drawing but it squishes value from it normalizes value from zero to one okay so the value can be from zero to one so i've heard in some tutorials or description of how this algorithm works and they tell that it only detects an object if the grid finds the center of that object inside the grid that brought the detection but this is the only way it can happen there is no mathematical possibility that this grid would give me a detection outside of itself so for example right here because it doesn't have that possibility because you can see like it goes pretty smooth right here and now it's it's slower slower slower and then this grid catches the uh this prediction because the grid cannot give a prediction outside of itself because it you have a normalized function from 0 to 1. so from 0 here to 1 here what is cx in this this is the number of the grid because you could count which grit is it it's like this is the first second third fourth this is like kind of a second grid from the left and like eight from the top so we would have to add these cx which is the number of the grid so we would know the normalized values from zero to 1 for the grid itself but we have to know where is the grid we do the same for the height so t y we do the same then we have the center but the center is only a part of that information we also need to find width and height and this is where um anchor boxes this magic control boxes that no one understand comes in place because anchor boxes doesn't exist they are used nowhere on the image it's not like a sliding window that you sometimes may be seen in different algorithms there is no sliding window here so these anchor boxes we can show it by clicking b which is a box because a was taken now it detects me here on the resolution three five nine two three oh one but if i'm getting smaller it catch me in the second anchor boxes because remember every grid gives you a pre free prediction from three different anchor boxes those anchor boxes are used nowhere in the image so the image doesn't look like on this anchor box that you see here anchor boxes are only a value that's it the value that you rescaled so this value tw by this function tells you to this people pw is the value of this 192 that you see here 142 depends on in which frame and that value is rescaled according to the value of tw so p is p value pw is 192 okay so we can write it down it is multiplied by e to the power of the t w uh value okay and it can make it bigger or you can make it smaller right now it's hard to take it smaller okay like this so you can see the width is practically perfect so we only can must tell the network yeah make it a bit smaller but height you have to make it a bit a lot of taller okay so make the width make the height much bigger okay and those anchor boxes think about we have to start somewhere we could rescale a grid for example but we would always have to make a grid bigger we could rescale picture but we will have to always rescale it to be smaller so we take the most common boxes that we find in a data set in training data set so they took a coco data set which is a bunch of pictures and they found out that this anchor boxes okay this anchor boxes you would have to group objects that th this three anchor boxes is the best way to group there to group that okay and we start with those anchor boxes this one in is in a zero so it means that it's resolution 13 by 13. but if i go back or if i take a smaller object i hope it's gonna okay you can see even now that this one is detected in 26 by 26 because this grid is uh four times smaller like twice as small in one dimension than the second one okay so this is how this anchor boxes works so it's somewhere to start the rescaling uh okay the objectiveness score okay the objective score to understand objective score we have to understand first intersection of union and it sounds scary uh description probably wouldn't tell you anything mathematical description but if you look at these images it's as clear as it can be so we divide we have for example we have this red rectangle that contains a car perfectly this is a true bounding box we call it true bounding box and this is a prediction bounding box the purple one okay and this yellow part is what they share this the surface that they share and the blue part is something that they have like both like how much space they both contain and the better the prediction the higher the intersection of union is because right now you can see that i'm a pretty perfectly described bundesbanding box so the true bounding box would be practically the same so the the intersection would be practically the same as union so better your prediction is the higher intersection of union you have it's not only used when it comes to true bounding box and the bounding box that the prediction gives we use it in us in kind of a second way in non-maximum suppression we will come back to it later but for now uh for this objectiveness score it's not only telling us whether or not we have object in this grid okay so this grid contains an object and it's me but grid's right here right here right here doesn't contain it but it shouldn't give you always one when it thinks that it contains an object it should give you a zero if it thinks that it doesn't but if it does it should give you an intersection of union between grand truth box and prediction that this grid can give you so it if this grid will give you a prediction of that car and you can see that it didn't do a very well job i think like 60 we have a 60 of uh intersection over union so it should give you the 60 so not only can it seed an object but also how precisely can it describe a bounding box on it so it is very sure if it's very sure it should give you close to one if it's like yeah i don't i don't think that i can give you a precise so i'll give you like sixty percent this is how objective score work okay uh what else this is important lesson for neural network because they have no idea how people call them they have no idea about it they only know that they are punished for something and reward for something you could call it not objective score but iou score or something like that or you could call it susan no difference whatsoever because it doesn't see it it only see what is spanish for it and what is rewarded for then we have a class score and [Music] you have different implementation of yellow and you could use soft mags but originally it doesn't use softmax and why because you can have a classes that you can have objects that have two classes like now i am a person but i could also be a man okay you could have a vehicle that is also a car that is also a bmw so in this case we use a normal logistic regression uh to this function but as i say as many implementation if you have many implementation you will have many activation function at the end okay what is next here we usually give this confidence that you see here this confidence okay as we multiply object nascar and the highest value from the class score okay so let's say objective score was 95 and the class score is 95 so they give you like 90 percent of confidence that they see a person okay and they can describe the person correctly that's two things uh okay right now we will go to um i don't know what so see ya okay so now we're gonna talk about what do we do with all of that detection because we have a bunch of them because every single grid gives you a free detection um so we have a bunch of grids and we have three resolutions and this is the smallest of them so let's find out how many detections we have per image okay and we can use this notebook that i created for you because i created it to be easy to explore yellow on your own like you can change something in a code find something in a code it's supposed to be very short code at the beginning but then i just keep on adding new stuff to it but i think that you can it's kind of readable and you can change something so let's find out how many detections we have okay so we will go to by the way this is how i designed this code we have a detection instance of an object of a class detection and we just share a bunch of parameters in this instance between functions of that instance okay so we run detection and all detection are saved here and then drawing image have access to that those uh values configure in it configure net and the key keyboard okay so let's go back to this code let's find a detection because this function handles the detection and we have we enumerate heal layer outputs because remember we have three so we will print uh blend length of self layer outputs okay and we will do it for the first resolution so 13 by 13 but also for second and the first so we run it and okay so on this image there is only this is on this is with aspect ratio so we'll change it on 40 for 16 by 416 so small image but we have on the smallest resolution 812 detections on the second smallest resolution 2028 and on the smallest five uh 507 so bunch of detection how do we deal with it we have something called non-maximum suppression or something like that nms so how does it work it takes all the detection and it only picks the best one and it based on two variables it's a detection it's a confidence and intersection over union okay uh both of those things matters because um well detection is pretty self-explanatory like those points have very low detection like they have very low confidence so they wouldn't be shown but every grid here i would bet that every grid here detects me just in a slightly different way and we can check it by changing intersection of union okay we will do it like uh like that i will up it okay so remember right now we have 80 right now we have 90 and we we can see that something interesting is starting to happen okay maybe okay right now we have a person first person and second person and it's natural that the grits both grits is me because i'm my center is kind of on a brink like on a just between something somewhere between them some the blue grid sees me and the green grid sees me too but why does it only show a blue blue bounding box if we go with intersect intersection of union smaller why because as you can see the intersection of union of those two bounding boxes is around 88 percent so this part here that they share is 88 of the part that they are that they both make okay remember you have intersection over union right here so this is what they share and this is what they both have and the algorithm if we set the iou to 88 it's something between 87 and 88 because if i lower it then it doesn't seize it that's why we have to have this intersection of union not super high because it will detect the same objects in different grids and it can be even will up it even more okay so as you can see we have like if we give it up to a hundred percent maybe i will move around a bit okay you can see in nine uh it gives me a nine detections of the same object okay in different grids and in different bounding boxes okay so different anchor boxes are rescaled to find me and this is a problem that's why we can't have an intersection of union too high and this is how it it also works for different objects so we have a banana here okay we can catch it okay we have a banana and now i will up the intersection of union and downgrade the intersection of union and now the banana is going to disappear because apparently this banana is at like uh between 10 and 11 of my surface that's why disappearing so we have to find the sweet spot for this intersection of union okay so the 50 is kind of good way to start and i will show you the way that this detection algorithm is not perfect so we will go to um so we will go here and instead of a cat we will write hug no that's okay it's a hug and it's very difficult for this algorithm to detect whether or not you um which person is it like they share the bounding box this is the same bounding box okay so this is not an instant segmentation when you can tell that each pixels are for are the part of that instance of a class person this is not it we have to find a bounding box and this is not always possible to do okay so this uh this is a great example the hug and let's see what happens if we get it bigger so we can see a second person and the first person but i don't really see which one is which and the important for the important thing that i didn't say to you is that if it has to choose between intersection of union of two objects it will always take the the biggest confidence to show so if few of them are detected it will always take the the one with the biggest confidence okay so this is how it works um the second thing i want to talk uh right now it's what size of an image to choose okay uh to run a detection because let me give you an example of the cut that was cut 4k we can show you the current 4k but is actually the original image in original image is in 4k but this image is in 739 or 416 pixels because the ratio is on but right now it's 416 on 416 and it works great would it work better if you would give it a bigger um bigger resolution so let's give it not for 16 by but uh 8 32. okay let's see what's gonna happen it will take some time because it's a bigger picture bigger resolution we see it more clearly and we didn't find it how is that possible it's because of the spatial orientation of one pixel has its limits okay because you can see a grid here is 13 by 13 but grit here would be i don't know if we're gonna see anything here but let's uh make it smaller to i know 608 and we can start to see and we can see a cut but look how small the grid is here it's very small compared to this so for this big cut this grit simply doesn't have a spatial orientation big enough to see this cat okay so remember when you pick a size for an image don't pick too big of a of an image size for the blob so the thing that you actually give to the uh to the algorithm yellow just make sure that the biggest objects are now the bigger i would say than 600 pixels dependent on on which resolution the network was trained but if we would give it even bigger resolution i don't know if if my cpu can handle it but maybe around 1020 to 24 maybe it will handle it uh yeah it will handle it but it doesn't see anything okay uh but this is not the only case where we can uh use it so let's see the data what kind of image i can show you to show you the the different part of that so we will take new york 4k and we will see how this works on the normal resolution so for 16. and we see that we cannot detect much think much details on this resolution if we even have a for 16 by 4 16 we only see like five things here okay because the objects are too small they are few pixels like you cannot detect few pixels uh and tell it that this is a person so what will happen if you will get this to uh 120 it should be 48. yeah 48. it will take some time it detected a bunch of items and why is that possible because we gave a much bigger resolution because we shouldn't look at the resolution of entire picture we should look at the resolution of objects that we want to detect because that cut was like on 1000 pixels wide but people or cars here are like 100 pixels and if we give a too small image like for 16 the person is like four pixels of width so it doesn't give you anything so remember adjust the size of image that you fit to the network to the objects that you want to detect on the image okay a second thing is this ratio that you can change by using this error okay so this is gonna be this is a big image so on cpu it takes a lot of time to to run it i hope it doesn't break okay i will change it manually so we can take uh for example okay panoramic yacht pack panoramic will run this on a normal resolution so maybe 320 or 416. now 320. and if you have a ratio then we see every person like clearly and with very high confidence but what would happen if i would turn off the ratio then a lot of people would just simply disappear okay so this light is not detected this light is not detected a lot of people are not detected because think about what is happening to that image from this ratio it's squished so those people are like three times as uh as three times narrower than they originally are so if you have some weird if you have a like normal image like this one on camera or like more quadratic even like a squared then it's okay to not have this ratio aspect like i don't have it here that is okay but if you have some weird image like super wide or super tall like you would have a giraffe that remember to have this ratio aspect saved so uh once again we see a 12 person percent as a lot of people that are missed here if you put it on ratio that is uh normal ratio not squished that we see everyone with a very big accuracy okay i think those are the most important like pointers that i can give you the rest are much more complicated how to train your network but those are those two i think the like the most basic one so now it comes to which implementation of yellow to use then i used a few so first of all this one is on opencv it works on cpu it work can work on gpu installation cpu is easy on gpu is very difficult but it doesn't work super fast like you can see we have a 10 f frames per second on gpu nothing fancy and you cannot train anything here it's very easy to install very easy to like take like see what kind of layer what kind of weights you have in certain layer but it's difficult to it's difficult to you cannot train anything on that so the second thing is the second option is a yellow built-in dark net so the outer jazzer friend or builds his implementation of yellow in c plus but it runs in python it's just a such a terrible thing to install on windows you have to have so many prerequisites installed from source like so much work on linux is much smoother but on windows is just the worst and it's trained it's adjusted to training pretty well uh and it's has like 50 50 frames per second something like that the cool part is that you can use this training on google call up and you don't have to worry about any configuration it just works it has its limitation like you can only run it for 12 hours um you have to upload all the data there uh but it has a powerful gpu so i think the go to would be on google cloud just to start uh then you have a tensorflow implementation i don't remember the guy zero zero says something like that uh this is a pretty good implementation that has to convert weights from yellow from darknet to the tensorflow acceptable uh format it works pretty okay like 37 frames per second but the training isn't just the basic one so even outer of that tensorflow implementation suggests that you should train it on darknet then you have my favorite which is implementation by ultra latex in python and it works it's not super easy to install but pretty easy to install um but it's very easy to use like they give you so much helpful tools the author is very involved in uh the the community they fix bugs right away um they run inference in fp16 so normal ones run it in fp floats with the length of 32 bits and they run out on a float 16 the inference the model that actually detects something so it's much faster it's just tiny a bit less precise because you can you know you have to lose accuracy of a value when you don't downgrade it to fp16 but it's very easy to use um it has implementation of yellow v3 or v4 yellow v5 so i think it that would be my pick ultralytics or uh training it on google column uh okay i think that's it when it comes to this algorithm i hope that i cleared something like clear some answers some questions that you have remember that you can have that you can explore this notebook to maybe learn a bit more about the your algorithms um yeah and i i hope that i help you in some way maybe give me a thumbs up subscription whatever and maybe see you soon bye
Info
Channel: MaxML
Views: 2,501
Rating: 5 out of 5
Keywords: YOLOV4, YOLOV3, YOLO, Detection, AI, Artificial intelligence, Neural Networks, Pytorch, OpenCV, Tensorflow
Id: MKF1NHGgFfk
Channel Id: undefined
Length: 41min 45sec (2505 seconds)
Published: Mon May 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.