What is YOLO and how version 3 detects objects? 
We'll discuss these points now. This lecture is   a brief introduction into YOLO version 3. Hi 
everyone. My name is Valentyn Sichkar. Let's   get started. Lecture is organized by following 
content. Firstly, we will identify YOLO in   general. Next, we will look at YOLO version 3 
architecture. After that, we will discuss what   is input to the network. Next, we will compare 
detections at different scales. Then, we will   define how actually network produces output. 
We will also describe how network is trained,   what is anchor boxes, and how predicted 
bounding boxes are calculated. Finally we   will explain what is objectness score and make a 
conclusion. Let's start now from the first topic. YOLO is a shortened form of “You Only Look 
Once”. And it uses Convolutional Neural Networks   for Object Detection. YOLO can detect multiple 
objects on a single image. It means that apart   from predicting classes of the objects, YOLO also 
detects locations of these objects on the image. YOLO applies a single Neural Network 
to the whole image. This Neural Network   divides image into regions and produces 
probabilities for every region. After   that YOLO predicts number of Bounding 
Boxes that cover some regions on the   image and chooses the best ones 
according to the probabilities. To fully understand principle 
idea how YOLO version 3 works,   following terminology needed to be known: 
Convolutional Neural Networks, Residual Blocks,   Skip connections, Up-sampling, Leaky ReLU 
activation function, Intersection over Union,   Non-maximum suppression. We will cover 
these topics in separate lectures. Let’s   turn now to the next topic and have a 
look at architecture of YOLO version 3. YOLO uses convolutional layers. 
And YOLO version 3, originally,   consists of 53 convolutional layers that are 
also called Darknet-53. But for detection tasks,   original architecture stacked with 53 more 
layers that give us 106 layers of architecture   for YOLO version 3. That’s why when you 
start any command in Darknet framework,   you will see the process of loading architecture 
that consists of 106 layers. The detections   are made at three layers: 82, 94 and 106. We 
will talk about detections in a few minutes. This latest version 3 incorporates 
some of the most essential elements,   that are Residual Blocks, Skip connections 
and Up-sampling. Each convolutional layer is   followed by batch normalization layer and Leaky 
ReLU activation function. There are no pooling   layers, but instead, additional convolutional 
layers with stride 2, are used to down-sample   feature maps. Why? Because the use of additional 
convolutional layers to down-sample feature maps   prevents loss of low-level features that 
pooling layer just exclude. As a result,   capturing low-level features helped to improve 
ability for detection small objects. A good   example of this is on the images, where 
we can see that pooling exclude numbers,   but convolution takes into account all numbers. 
Let's look now at input to the network. How does input to the Network looks like? 
The input is a batch of images of following   shape (n, 416, 416, 3), where n is a number 
of images. Next two numbers are width and   height. The last one is a number of channels 
– red, green and blue. The middle two numbers,   width and height, can be changed and set as 
608, or any other number that is divisible by   32 without leaving a remainder (832, 1024). Why 
this number must be exactly 32 we will consider   in a few moments. Increasing resolution of input 
might improve model's accuracy after training.   In the current lecture, we will assume that we 
have input of size 416 by 416. These numbers   are also called input network size. Input images 
themselves can be of any size, there is no need   to resize them before feeding to the network. They 
all will be resized according to the input network   size. And there is a possibility to experiment 
with keeping or not keeping aspect ratio by   adjusting parameters when training and testing 
in original Darknet framework, in Tensorflow,   Keras or any other framework you want to use. 
Then you can compare and choose what approach best   suites your custom model. Now we'll move on to 
the next topic and discuss detections at 3 scales. How the Network detects objects? YOLO version 3 
makes detections at three different scales and   at three separate places in the Network. These 
separate places for detections are layers 82,   94 and 106. Network downsamples input image by 
following factors: 32, 16 and 8 at those separate   places of the Network accordingly. These three 
numbers are called stride of the network and they   show how the output at three separate places in 
the Network is smaller than input to the Network.   For instance, if we consider stride 32 and input 
network size 416 by 416, then it will give us the   output of size 13 by 13. Consequently, for the 
stride 16 the output will be 26 by 26 and for the   stride 8 the output will be 52 by 52. 13 by 13 is 
responsible for detecting large objects; 26 by 26   is responsible for detecting medium objects and 52 
by 52 is responsible for detecting small objects. That is why few moments before we 
discussed that input to the Network   must be divisible by 32 without leaving 
a reminder. Because if it is true for 32,   then it is true for 16 and 8 as well. The next 
topic I’d like to focus on is detection kernels. To produce output YOLO version 3 applies 1 by 1 
detection kernels at these three separate places   in the Network. 1 by 1 convolutions applied to 
downsampled input images: 13 by 13, 26 by 26   and 52 by 52. Consequently, resulted feature maps 
will have the same spatial dimensions. The shape   of detection kernel also has its depth that 
is calculated by following equation. “b” here   represents number of bounding boxes that each 
cell of the produced feature map can predict.   YOLO version 3 predicts 3 bounding boxes for 
every cell of these feature maps. That is why,   “b” is equal to 3. Each bounding box has 
5 + c attributes that describe following:   centre coordinates of bounding box; width and 
height that are dimensions of bounding box;   objectness score; and list of confidences for 
every class this bounding box might belong to.   We will consider that YOLO version 3 trained 
on COCO dataset that has 80 classes. Then,   “c” is equal to 80 and total number 
of attributes for each bounding box   is 85. Resulted equation is as following: 3 
multiplied by 85 which gives us 255 attributes. Now we can say: each feature map produced by 
detection kernels at three separate places   in the Network, has one more dimension depth 
that incorporates 255 attributes of bounding   boxes for COCO dataset. And the shapes of these 
feature maps are as following: 13 by 13 by 255;   26 by 26 by 255 and 52 by 52 by 255. Let’s 
move now to the next topic of grid cells. We already know that YOLO version 3 predicts 3 
bounding boxes for every cell of the feature map.   Each cell, in turn, predicts an object through one 
of its bounding box if the centre of the object   belongs to the receptive field of this cell. And 
this is the task of YOLO version 3 while training:   identify this cell that falls into the centre 
of the object. Again, this is one of the feature   map’s cell produced by detection kernels that we 
discussed before. When YOLO version 3 is training,   it has one ground truth bounding box that is 
responsible for detecting one object. That’s   why and firstly, we need to define which 
cells this bounding box belongs to. And to   do that let’s consider first detection scale 
where we have 32 as stride of the Network.   The input image of 416 by 416 is downsampled 
into 13 by 13 grid of cells as we calculated   few moments ago. This grid now represents 
produced output feature map. When all cells,   that ground truth bounding box belongs to, are 
identified, the centre cell is assigned by YOLO   version 3 to be responsible for predicting 
this object. And objectness score for this   cell is equal to 1. Again, this is one of 
the corresponding feature map’s cell that   now is responsible for detecting lemon. But 
during training all cells, including this one,   predict 3 bounding boxes each. Which one to 
choose then? Which one to assign as the best   predicted lemon’s bounding box? We will 
open these questions in the next topic. To predict bounding boxes YOLO version 3 uses 
pre-defined default bounding boxes that are   called anchors or priors. These anchors are used 
later to calculate predicted bounding box’s real   width and real height. In total, 9 anchor boxes 
are used. Three anchor boxes for each scale. It   means that at each scale every grid cell of the 
feature map can predict 3 bounding boxes by using   3 anchors. To calculate these anchors, k-means 
clustering is applied in YOLO version 3. Width   and height of 9 anchors for COCO dataset are 
as following. They are grouped according to the   scale at three separated places at the Network. 
Let’s consider graphical example on how one of   the 3 anchor boxes is chosen to calculate later 
real width and real height of predicted bounding   box. We have input image of shape 416 by 416 
by 3. Image goes through YOLO version 3 deep   CNN architecture till the first separate place 
that we discussed earlier and that has stride   32. Input image is downsampled by this factor 
to the dimension 13 by 13 and 255 depth of the   feature map produced by detection kernels as we 
calculated earlier. Since we have 3 anchor boxes,   then each cell encodes information about 
3 predicted bounding boxes. Each predicted   bounding box has following attributes: centre 
coordinates; predicted width and predicted height;   objectness score; and list of confidences for 
every class this bounding box might belong to.   As we use COCO dataset as an example, 
this list has 80 class confidences. And now we need to extract probabilities 
among 3 predicted bounding boxes of   this cell to identify that this box 
contains certain class. To do that,   we compute following elementwise product of 
objectness score and list of confidences. Then,   we find maximum probability and can say that this 
box detected class lemon with probability 0.55. These calculations are applied to all 13 by 13 
cells across 3 predicted boxes and across 80   classes. The number of predicted boxes at this 
first scale in the Network is 507. Moreover,   these calculations are also applied to 
other scales in the Network giving us   2028 predicted boxes and 8112 
predicted boxes. In total,   YOLO version 3 predicts 10 647 boxes that are 
filtered with non-maximum suppression technique.   Let’s move to the next topic and identify 
how predicted bounding boxes are calculated. We already know that anchors are bounding 
box’s priors and they were calculated by   using K-means clustering. For COCO dataset they 
are as following. To predict real width and real   height of the bounding boxes, YOLO version 
3 calculates offsets to predefined anchors.   This offset is also called log-space transform. To 
predict centre coordinates of the bounding boxes,   YOLO version 3 passes outputs through 
sigmoid function. Here are equations   that are used to obtain predicted bounding 
box’s width, height and centre coordinates. bx, by, bw, bh are the centre coordinates, width 
and height of the predicted bounding box. tx,   ty, tw and th are outputs of the Network after 
training. To better understand these outputs   let’s again have a look at how YOLO version 3 
is training. It has one ground truth bounding   box and one centre cell to be responsible 
for this object. Weights of the Network   are trained to predict as accurate as possible 
this centre cell and bounding box’s coordinates.   After training and after forward pass, Network 
outputs coordinates tx, ty, tw and th. Next,   cx and cy that are the coordinates of the 
top left corner of the cell on the grid of   the appropriate anchor box. Finally, pw and 
ph are the anchor’s boxes width and height. YOLO version 3 doesn’t predict absolute 
values of width and height. Instead,   as we discussed above, it predicts 
offsets to anchors. Why? Because   it helps to eliminate unstable gradients 
during training. That’s why values cx, cy,   pw and ph are normalized to the real image width 
and real image height. And centre coordinates tx,   ty are passed through sigmoid function 
that gives values between 0 and 1. Consequently, to get absolute values 
after predicting we simply need to   multiply them to the real and 
whole image width and height.   Let’s move now to the next topic and 
interpret what is objectness score. We already discussed that for every cell 
YOLO version 3 outputs bounding boxes with   their attributes. These attributes are tx, 
ty, tw, th, p0 and 80 confidences for every   class this bounding box might belong to. And 
these outputs are used later to choose anchor   boxes by calculating scores and to calculate 
predicted bounding box’s real width and real   height by using chosen anchors. p0 here is 
so called objectness score. Do you remember   that YOLO version 3 when training assigns 
centre cell of the ground truth bounding box   to be responsible for predicting this object? 
Consequently, this cell and its neighbours have   objectess score nearly 1, whereas corner cells 
have objectness score almost 0. In other words,   objectness score represents probability that this 
cell is a centre cell responsible for predicting   one particular object and appropriate 
bounding box contains object inside. The difference between objectness score and 80 
probabilities for class confidences is that class   confidences represent probabilities that detected 
object belongs to particular class like person,   car, cat etcetera. Whereas, objectness score 
represents probability that bounding box   contains object inside. Mathematically 
objectness score can be represented as   following. Where P object is a predicted 
probability that bounding box contains   object and IoU is intersection over union 
between predicted bounding box and ground   truth bounding box (make pause for showing image). 
The result is passed through sigmoid function that   gives values between 0 and 1. To summarise 
this lecture let’s move to the final topic. We discussed all major points of how YOLO version 
3 works. Now we can come back to the definition   and update it giving more details. YOLO version 3 
applies convolutional neural networks to the input   image. To predict bounding boxes, it downsamples 
image at three separate places of this Network   that are also called scales. While training it 
uses 1 by 1 detection kernels that are applied   to the grid of cells at these three separate 
places of the Network. Network is trained to   assign only one cell to be responsible for 
detecting one object if that cell falls into   the centre of this object. 9 predefined bounding 
boxes are used to calculate spatial dimensions and   coordinates of predicted bounding boxes. These 
predefined boxes are called anchors or priors.   3 anchor boxes for each scale. In total, YOLO 
version 3 predicts 10 647 bounding boxes that   are filtered with non-maximum suppression 
technique leaving only the right ones. That was extended definition according 
to what we covered during the lecture. Are you interested in training your own detector 
based on YOLO version 3? Then join the course!   You will create your custom dataset, build your 
model and train detector to use it on image,   video and by camera. Find the link 
in the description right below this   video. Thank you very much for watching. 
I hope you found this lecture useful and   motivating to keep studying. Please be 
sure to like this video and subscribe to   the channel if you haven’t already. 
See you soon with more great stuff.