What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

yolo is state of the art object detection algorithm and it is so fast that it has become almost a standard way of detecting objects in the field of computer vision previously people were using sliding window object detection then more faster versions were invented such as rcnn fast rcnn and faster rcnn but in 2015 yolo was invented which outperformed all the previous uh object detection algorithms and that's what we are going to discuss today we will go over the theory on how exactly yolo works and in the future video we will also do coding so this video is just about the theory behind how yolo works and will try to see why it is faster full form of yolo is you only look once let's say you're working on an image classification problem where you want to decide if the image is of a dog or a person in this case the output of neural network is pretty simple you will say dog is equal to one person is equal to zero but when you talk about object localization you're not only telling which class this is you're also telling the bounding box or the position of an object within the image so here in addition to dog is equal to 1 and person is equal to 0 you are also telling about the bounding box now how exactly you do that so in in terms of neural network output you can have a vector like this where pc is the probability of a class so here if there is a dog or a person then this number will be one if there is no dog or no person this number will be zero then the bounding box so bx bi by is the coordinate of the center which is indicated in yellow circle here and 1670 is the width and height of this red box c1 is class one that is for dog so here it will be one c2 is for person and it will be zero if you have a different image like this there is a person here this is my picture in my high school the pc probability of any class is 1 because there is some object and these are like bounding box coordinates and c 1 is 0 because it's not a dog and c 2 is 1 because it's a person and when you have no object in the image the pc will be zero and rest of the values don't matter so now you can train a neural network to uh classify the object as well as the bounding box so you can have i am just showing three images here but you can have less than ten thousand such images and for each of these images since it's a supervised learning problem you need to give the bounding boxes and the way you give bounding boxes to neural network understand neural network only understands numbers so you have to convert this into this kind of vectors so you will have a vector of size 7 for each corresponding image so that will so image is x strain and y train will be a vector of size 7. you can have 10 000 such images you can train a neural network in a in a way that if you input a new image now it will tell you that particular vector and now this vector is telling you that this is a dog because c1 is set to 1 and it is also telling you the bounding box so basically it's essentially giving you the answer for your object detection or object localization rather this only works for a single object if you have a multiple objects what do you do here there is person and a dog in the same image one might say that okay you know in my image there could be n number of object there could be two dogs three people there could be five dogs one person you don't know how many objects are there in the picture so it's hard to determine the dimension of your neural network output if you have one one object um it's it's pretty fixed right but if you have n number of objects and you don't know then determining the size of the output of neural network is hard you can say upper max is 10 let's say there will be only 10 objects and you can have 10 into 7 which is like a 70 size vector but what if there are 11 objects see so that doesn't work so you have to do something else all right so let's say you have this image and there are two bounding boxes that this image has what yolo algorithm will do is it will divide this image into this kind of grid cells so i'm using four by four grid here it could be three by three it could be 19 by 19. there's no fixed rule that it has to be four by four and for each of the grid cells for example this grid cell you can encode or you can come up with that vector that we saw previously which is pc bounding box c1 and c2 there are no objects here so probability of class will be zero and then rest of the values don't matter but for this particular grid cell so i have highlighted here the dog is there in the picture see when dog is expanding to multiple grid cell you try to find the central place of that dog and the dog belongs to that particular grid cell so i'm in this particular cell here and when i look at the coordinates you can think about this per point as a 0 zero and this point has one one coordinate and now you can create this vector where p c is one which means you have some object then c one and c two c one is for dog so it is one c two is per person it is 0 there is person's head here but the person's center is here so this person object belongs to this cell and then 0.05 like this particular distance is 0.05 this is 0.3 because see this whole thing is 1 and then your bounding rectangle can go out of your grid cell it is fine that's why these values are more than one so 1.3 and 1. oh sorry 2 and 1.3 so that is the width so 2 is this width and 1.3 is height so it is this height and now talking about this particular grid cell so there is a person center here so we can say person is in this grid and therefore c2 class value 1 is 1 c1 is 0 because there is no dog and these are like bounding boxes so 0.32 is see 0.32 is this much 0.02 is this this particular height and it is 3 because the rectangle with this yellow line is equal to almost 3 the size of see the width of this grid cell and if you compare this this is three times this that's why i have three here and now you can have uh for remaining all the cells the vector will be this so pc will be zero remaining will be don't care so now you have four by four by seven volume why because you have four by four total grid cells 16 cells each cell is a vector of size seven that's why i'm saying four by four by seven so if you're talking about this top left cell and if you expand it in a z direction that will be this vector of size 7 so i hope you're getting an idea if you don't please pause the video and just think about what i just said so now you have the image and then the bounding rectangles now you can form your training data set so your training data cell will have so many such images let's say i am showing only three for example but you will have 10 000 such images each image will have bounding rectangle and based on that rectangle you will try to derive you will first form this kind of grid 4x4 grid or 3x3 or 19x19 it varies it doesn't have to be four by four and you will come up with the y or a target vector which will be for each cell there will be one vector so there will be 16 such vector per training sample or per training image using this now you can train your neural network and after you have trained it it can do prediction so when you now give this type of image it can produce 16 such vectors and y 16 because this is like 4 by 4 grid which will basically tell you the bounding rectangle for each of these objects so this is the yolo algorithm it is called you only look once because we are not repeating it see we are not doing something like okay we have 16 cells so it's not like we are inputting it 16 times and doing 60 nitration in one forward pass you can make all your prediction that is why it is called you only look once now this is a basic algorithm we need some tweaks because there could be few issues with this approach first issue is the algorithm might detect multiple bounding rectangles for a given object it is possible so how do you tackle that so let's think about this let's say for a person it detected all these two yellow and this one white rectangle and we know by visual observation that this white one is the most accurate one and the algorithm will also throw out the probability it will say this is point nine percent you know the pc the pc class it will say this is point nine percent matching with person and the other rectangles have less probability so maybe we can look at all the probabilities for a person class and take the max right well we cannot do this okay if you just take a max and if there is another person what happens to that you don't know where that person is right so so as a neural network as a computer you don't know so you can't take a max you have to use different approach so we use this concept of iou so iou is basically intersection over union which is you take this rectangle which is 0.9 this is that white rectangle and then for that same class which is person you will take all other rectangles and try to find overlapping area and to find overlapping area you use iou so here in this case see this is that yellow box okay so this is that yellow box here and this is the white box and the area indicated in this orange color is intersection area area indicated in purple colors is union area so you find division of these two and if the objects are overlapping this value will be more so let's say if it is the value is more than 0.6 or 0.7 we can say these rectangles are overlapping if they are completely overlapping the value will be 1 if they are not overlapping at all value will be 0. so now we find that these two yellow boxes are overlapping because their iou is let's say greater than 0.65 and then you discard those rectangles so i discarded all the rectangles which had iou greater than 0.65 and kept the rectangle which has class probability as max okay so this so i do this for a personal object then i do the same thing for a dog object so for dog i find that okay point 81 this is the max probability i find all other rectangles in this image again there could be two more dogs here and there will be rectangles for those also so you will try to find overlap okay so let's see if there is a dog here you will not find overlap so you will not discard that particular rectangle but this rectangle you find it to be overlapping and since point 81 is max point seven is less you discard this and you get final bounding boxes this technique is also called nomex operation so after neural network has detected all the objects you apply no max suppression and you get these unique bonding boxes there could be another issue is what if a single cell contains the center of two objects in this case the dog and the person both are in the middle's middle uh grid cell now we use this vector to represent the grid cell but see this vector can represent only one class so how do you represent two class well i have this value for dog i have this value for person so instead of having a seven dimension vector how about we have a vector of size 14 where you're just concatenating these two vectors okay so this is said to have a basically it has two anchor boxes so this is one anchor box this is second anchor box so here you have two anchor boxes and you can actually have more than two anchor boxes let's say if there are three objects which has the same center then you can have three anchor boxes you can have five anchor boxes but if your grid sales are small enough then in real life it's hard to have you know many objects belonging to one grid cell so now cnn with two anchor boxes will look something like this so instead of a vector of size the only change is now you have a vector of size uh 14 if you if you want to have three anchor boxes you'll have a vector of size 21 7 into three okay and that will give you your final output so that was all about you only look once or yolo algorithm it's a very very fast algorithm even on a video clip which is let's say at 40 frame per second it can detect objects really fast and it is the most modern way of detecting objects so if you are in computer vision fields if you want to do object detection you have to use yellow because it is very fast and accurate in the next video we will be looking at some code we will do a real object detection in image and in video using yolo framework i hope you're liking this series so far if you do give it a thumbs up and share it with your friends thank

Info

Channel: codebasics

Views: 61,345

Rating: 4.9343543 out of 5

Keywords: yt:cc=on, yolo algorithm for object detection, yolo algorithm explained, yolo computer vision, yolo algorithm, yolo algorithm deep learning, deep learning, yolo algorithm python, yolo deep learning python, yolo deep learning, yolo object detection, what is yolo, yolo algorithm implementation, yolo tutorial, yolo object detection tutorial, yolo custom object detection, object detection deep learning, yolo, yolo object detection python, yolo python, how yolo works

Id: ag3DLKsl2vk

Channel Id: undefined

Length: 16min 4sec (964 seconds)

Published: Fri Dec 25 2020