Introduction into YOLO v3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
What is YOLO and how version 3 detects objects?  We'll discuss these points now. This lecture is   a brief introduction into YOLO version 3. Hi  everyone. My name is Valentyn Sichkar. Let's   get started. Lecture is organized by following  content. Firstly, we will identify YOLO in   general. Next, we will look at YOLO version 3  architecture. After that, we will discuss what   is input to the network. Next, we will compare  detections at different scales. Then, we will   define how actually network produces output.  We will also describe how network is trained,   what is anchor boxes, and how predicted  bounding boxes are calculated. Finally we   will explain what is objectness score and make a  conclusion. Let's start now from the first topic. YOLO is a shortened form of “You Only Look  Once”. And it uses Convolutional Neural Networks   for Object Detection. YOLO can detect multiple  objects on a single image. It means that apart   from predicting classes of the objects, YOLO also  detects locations of these objects on the image. YOLO applies a single Neural Network  to the whole image. This Neural Network   divides image into regions and produces  probabilities for every region. After   that YOLO predicts number of Bounding  Boxes that cover some regions on the   image and chooses the best ones  according to the probabilities. To fully understand principle  idea how YOLO version 3 works,   following terminology needed to be known:  Convolutional Neural Networks, Residual Blocks,   Skip connections, Up-sampling, Leaky ReLU  activation function, Intersection over Union,   Non-maximum suppression. We will cover  these topics in separate lectures. Let’s   turn now to the next topic and have a  look at architecture of YOLO version 3. YOLO uses convolutional layers.  And YOLO version 3, originally,   consists of 53 convolutional layers that are  also called Darknet-53. But for detection tasks,   original architecture stacked with 53 more  layers that give us 106 layers of architecture   for YOLO version 3. That’s why when you  start any command in Darknet framework,   you will see the process of loading architecture  that consists of 106 layers. The detections   are made at three layers: 82, 94 and 106. We  will talk about detections in a few minutes. This latest version 3 incorporates  some of the most essential elements,   that are Residual Blocks, Skip connections  and Up-sampling. Each convolutional layer is   followed by batch normalization layer and Leaky  ReLU activation function. There are no pooling   layers, but instead, additional convolutional  layers with stride 2, are used to down-sample   feature maps. Why? Because the use of additional  convolutional layers to down-sample feature maps   prevents loss of low-level features that  pooling layer just exclude. As a result,   capturing low-level features helped to improve  ability for detection small objects. A good   example of this is on the images, where  we can see that pooling exclude numbers,   but convolution takes into account all numbers.  Let's look now at input to the network. How does input to the Network looks like?  The input is a batch of images of following   shape (n, 416, 416, 3), where n is a number  of images. Next two numbers are width and   height. The last one is a number of channels  – red, green and blue. The middle two numbers,   width and height, can be changed and set as  608, or any other number that is divisible by   32 without leaving a remainder (832, 1024). Why  this number must be exactly 32 we will consider   in a few moments. Increasing resolution of input  might improve model's accuracy after training.   In the current lecture, we will assume that we  have input of size 416 by 416. These numbers   are also called input network size. Input images  themselves can be of any size, there is no need   to resize them before feeding to the network. They  all will be resized according to the input network   size. And there is a possibility to experiment  with keeping or not keeping aspect ratio by   adjusting parameters when training and testing  in original Darknet framework, in Tensorflow,   Keras or any other framework you want to use.  Then you can compare and choose what approach best   suites your custom model. Now we'll move on to  the next topic and discuss detections at 3 scales. How the Network detects objects? YOLO version 3  makes detections at three different scales and   at three separate places in the Network. These  separate places for detections are layers 82,   94 and 106. Network downsamples input image by  following factors: 32, 16 and 8 at those separate   places of the Network accordingly. These three  numbers are called stride of the network and they   show how the output at three separate places in  the Network is smaller than input to the Network.   For instance, if we consider stride 32 and input  network size 416 by 416, then it will give us the   output of size 13 by 13. Consequently, for the  stride 16 the output will be 26 by 26 and for the   stride 8 the output will be 52 by 52. 13 by 13 is  responsible for detecting large objects; 26 by 26   is responsible for detecting medium objects and 52  by 52 is responsible for detecting small objects. That is why few moments before we  discussed that input to the Network   must be divisible by 32 without leaving  a reminder. Because if it is true for 32,   then it is true for 16 and 8 as well. The next  topic I’d like to focus on is detection kernels. To produce output YOLO version 3 applies 1 by 1  detection kernels at these three separate places   in the Network. 1 by 1 convolutions applied to  downsampled input images: 13 by 13, 26 by 26   and 52 by 52. Consequently, resulted feature maps  will have the same spatial dimensions. The shape   of detection kernel also has its depth that  is calculated by following equation. “b” here   represents number of bounding boxes that each  cell of the produced feature map can predict.   YOLO version 3 predicts 3 bounding boxes for  every cell of these feature maps. That is why,   “b” is equal to 3. Each bounding box has  5 + c attributes that describe following:   centre coordinates of bounding box; width and  height that are dimensions of bounding box;   objectness score; and list of confidences for  every class this bounding box might belong to.   We will consider that YOLO version 3 trained  on COCO dataset that has 80 classes. Then,   “c” is equal to 80 and total number  of attributes for each bounding box   is 85. Resulted equation is as following: 3  multiplied by 85 which gives us 255 attributes. Now we can say: each feature map produced by  detection kernels at three separate places   in the Network, has one more dimension depth  that incorporates 255 attributes of bounding   boxes for COCO dataset. And the shapes of these  feature maps are as following: 13 by 13 by 255;   26 by 26 by 255 and 52 by 52 by 255. Let’s  move now to the next topic of grid cells. We already know that YOLO version 3 predicts 3  bounding boxes for every cell of the feature map.   Each cell, in turn, predicts an object through one  of its bounding box if the centre of the object   belongs to the receptive field of this cell. And  this is the task of YOLO version 3 while training:   identify this cell that falls into the centre  of the object. Again, this is one of the feature   map’s cell produced by detection kernels that we  discussed before. When YOLO version 3 is training,   it has one ground truth bounding box that is  responsible for detecting one object. That’s   why and firstly, we need to define which  cells this bounding box belongs to. And to   do that let’s consider first detection scale  where we have 32 as stride of the Network.   The input image of 416 by 416 is downsampled  into 13 by 13 grid of cells as we calculated   few moments ago. This grid now represents  produced output feature map. When all cells,   that ground truth bounding box belongs to, are  identified, the centre cell is assigned by YOLO   version 3 to be responsible for predicting  this object. And objectness score for this   cell is equal to 1. Again, this is one of  the corresponding feature map’s cell that   now is responsible for detecting lemon. But  during training all cells, including this one,   predict 3 bounding boxes each. Which one to  choose then? Which one to assign as the best   predicted lemon’s bounding box? We will  open these questions in the next topic. To predict bounding boxes YOLO version 3 uses  pre-defined default bounding boxes that are   called anchors or priors. These anchors are used  later to calculate predicted bounding box’s real   width and real height. In total, 9 anchor boxes  are used. Three anchor boxes for each scale. It   means that at each scale every grid cell of the  feature map can predict 3 bounding boxes by using   3 anchors. To calculate these anchors, k-means  clustering is applied in YOLO version 3. Width   and height of 9 anchors for COCO dataset are  as following. They are grouped according to the   scale at three separated places at the Network.  Let’s consider graphical example on how one of   the 3 anchor boxes is chosen to calculate later  real width and real height of predicted bounding   box. We have input image of shape 416 by 416  by 3. Image goes through YOLO version 3 deep   CNN architecture till the first separate place  that we discussed earlier and that has stride   32. Input image is downsampled by this factor  to the dimension 13 by 13 and 255 depth of the   feature map produced by detection kernels as we  calculated earlier. Since we have 3 anchor boxes,   then each cell encodes information about  3 predicted bounding boxes. Each predicted   bounding box has following attributes: centre  coordinates; predicted width and predicted height;   objectness score; and list of confidences for  every class this bounding box might belong to.   As we use COCO dataset as an example,  this list has 80 class confidences. And now we need to extract probabilities  among 3 predicted bounding boxes of   this cell to identify that this box  contains certain class. To do that,   we compute following elementwise product of  objectness score and list of confidences. Then,   we find maximum probability and can say that this  box detected class lemon with probability 0.55. These calculations are applied to all 13 by 13  cells across 3 predicted boxes and across 80   classes. The number of predicted boxes at this  first scale in the Network is 507. Moreover,   these calculations are also applied to  other scales in the Network giving us   2028 predicted boxes and 8112  predicted boxes. In total,   YOLO version 3 predicts 10 647 boxes that are  filtered with non-maximum suppression technique.   Let’s move to the next topic and identify  how predicted bounding boxes are calculated. We already know that anchors are bounding  box’s priors and they were calculated by   using K-means clustering. For COCO dataset they  are as following. To predict real width and real   height of the bounding boxes, YOLO version  3 calculates offsets to predefined anchors.   This offset is also called log-space transform. To  predict centre coordinates of the bounding boxes,   YOLO version 3 passes outputs through  sigmoid function. Here are equations   that are used to obtain predicted bounding  box’s width, height and centre coordinates. bx, by, bw, bh are the centre coordinates, width  and height of the predicted bounding box. tx,   ty, tw and th are outputs of the Network after  training. To better understand these outputs   let’s again have a look at how YOLO version 3  is training. It has one ground truth bounding   box and one centre cell to be responsible  for this object. Weights of the Network   are trained to predict as accurate as possible  this centre cell and bounding box’s coordinates.   After training and after forward pass, Network  outputs coordinates tx, ty, tw and th. Next,   cx and cy that are the coordinates of the  top left corner of the cell on the grid of   the appropriate anchor box. Finally, pw and  ph are the anchor’s boxes width and height. YOLO version 3 doesn’t predict absolute  values of width and height. Instead,   as we discussed above, it predicts  offsets to anchors. Why? Because   it helps to eliminate unstable gradients  during training. That’s why values cx, cy,   pw and ph are normalized to the real image width  and real image height. And centre coordinates tx,   ty are passed through sigmoid function  that gives values between 0 and 1. Consequently, to get absolute values  after predicting we simply need to   multiply them to the real and  whole image width and height.   Let’s move now to the next topic and  interpret what is objectness score. We already discussed that for every cell  YOLO version 3 outputs bounding boxes with   their attributes. These attributes are tx,  ty, tw, th, p0 and 80 confidences for every   class this bounding box might belong to. And  these outputs are used later to choose anchor   boxes by calculating scores and to calculate  predicted bounding box’s real width and real   height by using chosen anchors. p0 here is  so called objectness score. Do you remember   that YOLO version 3 when training assigns  centre cell of the ground truth bounding box   to be responsible for predicting this object?  Consequently, this cell and its neighbours have   objectess score nearly 1, whereas corner cells  have objectness score almost 0. In other words,   objectness score represents probability that this  cell is a centre cell responsible for predicting   one particular object and appropriate  bounding box contains object inside. The difference between objectness score and 80  probabilities for class confidences is that class   confidences represent probabilities that detected  object belongs to particular class like person,   car, cat etcetera. Whereas, objectness score  represents probability that bounding box   contains object inside. Mathematically  objectness score can be represented as   following. Where P object is a predicted  probability that bounding box contains   object and IoU is intersection over union  between predicted bounding box and ground   truth bounding box (make pause for showing image).  The result is passed through sigmoid function that   gives values between 0 and 1. To summarise  this lecture let’s move to the final topic. We discussed all major points of how YOLO version  3 works. Now we can come back to the definition   and update it giving more details. YOLO version 3  applies convolutional neural networks to the input   image. To predict bounding boxes, it downsamples  image at three separate places of this Network   that are also called scales. While training it  uses 1 by 1 detection kernels that are applied   to the grid of cells at these three separate  places of the Network. Network is trained to   assign only one cell to be responsible for  detecting one object if that cell falls into   the centre of this object. 9 predefined bounding  boxes are used to calculate spatial dimensions and   coordinates of predicted bounding boxes. These  predefined boxes are called anchors or priors.   3 anchor boxes for each scale. In total, YOLO  version 3 predicts 10 647 bounding boxes that   are filtered with non-maximum suppression  technique leaving only the right ones. That was extended definition according  to what we covered during the lecture. Are you interested in training your own detector  based on YOLO version 3? Then join the course!   You will create your custom dataset, build your  model and train detector to use it on image,   video and by camera. Find the link  in the description right below this   video. Thank you very much for watching.  I hope you found this lecture useful and   motivating to keep studying. Please be  sure to like this video and subscribe to   the channel if you haven’t already.  See you soon with more great stuff.
Info
Channel: Valentyn Sichkar
Views: 41,191
Rating: 4.928205 out of 5
Keywords:
Id: vRqSO6RsptU
Channel Id: undefined
Length: 26min 56sec (1616 seconds)
Published: Sun Jun 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.