What is YOLO and how version 3 detects objects?
We'll discuss these points now. This lecture is a brief introduction into YOLO version 3. Hi
everyone. My name is Valentyn Sichkar. Let's get started. Lecture is organized by following
content. Firstly, we will identify YOLO in general. Next, we will look at YOLO version 3
architecture. After that, we will discuss what is input to the network. Next, we will compare
detections at different scales. Then, we will define how actually network produces output.
We will also describe how network is trained, what is anchor boxes, and how predicted
bounding boxes are calculated. Finally we will explain what is objectness score and make a
conclusion. Let's start now from the first topic. YOLO is a shortened form of “You Only Look
Once”. And it uses Convolutional Neural Networks for Object Detection. YOLO can detect multiple
objects on a single image. It means that apart from predicting classes of the objects, YOLO also
detects locations of these objects on the image. YOLO applies a single Neural Network
to the whole image. This Neural Network divides image into regions and produces
probabilities for every region. After that YOLO predicts number of Bounding
Boxes that cover some regions on the image and chooses the best ones
according to the probabilities. To fully understand principle
idea how YOLO version 3 works, following terminology needed to be known:
Convolutional Neural Networks, Residual Blocks, Skip connections, Up-sampling, Leaky ReLU
activation function, Intersection over Union, Non-maximum suppression. We will cover
these topics in separate lectures. Let’s turn now to the next topic and have a
look at architecture of YOLO version 3. YOLO uses convolutional layers.
And YOLO version 3, originally, consists of 53 convolutional layers that are
also called Darknet-53. But for detection tasks, original architecture stacked with 53 more
layers that give us 106 layers of architecture for YOLO version 3. That’s why when you
start any command in Darknet framework, you will see the process of loading architecture
that consists of 106 layers. The detections are made at three layers: 82, 94 and 106. We
will talk about detections in a few minutes. This latest version 3 incorporates
some of the most essential elements, that are Residual Blocks, Skip connections
and Up-sampling. Each convolutional layer is followed by batch normalization layer and Leaky
ReLU activation function. There are no pooling layers, but instead, additional convolutional
layers with stride 2, are used to down-sample feature maps. Why? Because the use of additional
convolutional layers to down-sample feature maps prevents loss of low-level features that
pooling layer just exclude. As a result, capturing low-level features helped to improve
ability for detection small objects. A good example of this is on the images, where
we can see that pooling exclude numbers, but convolution takes into account all numbers.
Let's look now at input to the network. How does input to the Network looks like?
The input is a batch of images of following shape (n, 416, 416, 3), where n is a number
of images. Next two numbers are width and height. The last one is a number of channels
– red, green and blue. The middle two numbers, width and height, can be changed and set as
608, or any other number that is divisible by 32 without leaving a remainder (832, 1024). Why
this number must be exactly 32 we will consider in a few moments. Increasing resolution of input
might improve model's accuracy after training. In the current lecture, we will assume that we
have input of size 416 by 416. These numbers are also called input network size. Input images
themselves can be of any size, there is no need to resize them before feeding to the network. They
all will be resized according to the input network size. And there is a possibility to experiment
with keeping or not keeping aspect ratio by adjusting parameters when training and testing
in original Darknet framework, in Tensorflow, Keras or any other framework you want to use.
Then you can compare and choose what approach best suites your custom model. Now we'll move on to
the next topic and discuss detections at 3 scales. How the Network detects objects? YOLO version 3
makes detections at three different scales and at three separate places in the Network. These
separate places for detections are layers 82, 94 and 106. Network downsamples input image by
following factors: 32, 16 and 8 at those separate places of the Network accordingly. These three
numbers are called stride of the network and they show how the output at three separate places in
the Network is smaller than input to the Network. For instance, if we consider stride 32 and input
network size 416 by 416, then it will give us the output of size 13 by 13. Consequently, for the
stride 16 the output will be 26 by 26 and for the stride 8 the output will be 52 by 52. 13 by 13 is
responsible for detecting large objects; 26 by 26 is responsible for detecting medium objects and 52
by 52 is responsible for detecting small objects. That is why few moments before we
discussed that input to the Network must be divisible by 32 without leaving
a reminder. Because if it is true for 32, then it is true for 16 and 8 as well. The next
topic I’d like to focus on is detection kernels. To produce output YOLO version 3 applies 1 by 1
detection kernels at these three separate places in the Network. 1 by 1 convolutions applied to
downsampled input images: 13 by 13, 26 by 26 and 52 by 52. Consequently, resulted feature maps
will have the same spatial dimensions. The shape of detection kernel also has its depth that
is calculated by following equation. “b” here represents number of bounding boxes that each
cell of the produced feature map can predict. YOLO version 3 predicts 3 bounding boxes for
every cell of these feature maps. That is why, “b” is equal to 3. Each bounding box has
5 + c attributes that describe following: centre coordinates of bounding box; width and
height that are dimensions of bounding box; objectness score; and list of confidences for
every class this bounding box might belong to. We will consider that YOLO version 3 trained
on COCO dataset that has 80 classes. Then, “c” is equal to 80 and total number
of attributes for each bounding box is 85. Resulted equation is as following: 3
multiplied by 85 which gives us 255 attributes. Now we can say: each feature map produced by
detection kernels at three separate places in the Network, has one more dimension depth
that incorporates 255 attributes of bounding boxes for COCO dataset. And the shapes of these
feature maps are as following: 13 by 13 by 255; 26 by 26 by 255 and 52 by 52 by 255. Let’s
move now to the next topic of grid cells. We already know that YOLO version 3 predicts 3
bounding boxes for every cell of the feature map. Each cell, in turn, predicts an object through one
of its bounding box if the centre of the object belongs to the receptive field of this cell. And
this is the task of YOLO version 3 while training: identify this cell that falls into the centre
of the object. Again, this is one of the feature map’s cell produced by detection kernels that we
discussed before. When YOLO version 3 is training, it has one ground truth bounding box that is
responsible for detecting one object. That’s why and firstly, we need to define which
cells this bounding box belongs to. And to do that let’s consider first detection scale
where we have 32 as stride of the Network. The input image of 416 by 416 is downsampled
into 13 by 13 grid of cells as we calculated few moments ago. This grid now represents
produced output feature map. When all cells, that ground truth bounding box belongs to, are
identified, the centre cell is assigned by YOLO version 3 to be responsible for predicting
this object. And objectness score for this cell is equal to 1. Again, this is one of
the corresponding feature map’s cell that now is responsible for detecting lemon. But
during training all cells, including this one, predict 3 bounding boxes each. Which one to
choose then? Which one to assign as the best predicted lemon’s bounding box? We will
open these questions in the next topic. To predict bounding boxes YOLO version 3 uses
pre-defined default bounding boxes that are called anchors or priors. These anchors are used
later to calculate predicted bounding box’s real width and real height. In total, 9 anchor boxes
are used. Three anchor boxes for each scale. It means that at each scale every grid cell of the
feature map can predict 3 bounding boxes by using 3 anchors. To calculate these anchors, k-means
clustering is applied in YOLO version 3. Width and height of 9 anchors for COCO dataset are
as following. They are grouped according to the scale at three separated places at the Network.
Let’s consider graphical example on how one of the 3 anchor boxes is chosen to calculate later
real width and real height of predicted bounding box. We have input image of shape 416 by 416
by 3. Image goes through YOLO version 3 deep CNN architecture till the first separate place
that we discussed earlier and that has stride 32. Input image is downsampled by this factor
to the dimension 13 by 13 and 255 depth of the feature map produced by detection kernels as we
calculated earlier. Since we have 3 anchor boxes, then each cell encodes information about
3 predicted bounding boxes. Each predicted bounding box has following attributes: centre
coordinates; predicted width and predicted height; objectness score; and list of confidences for
every class this bounding box might belong to. As we use COCO dataset as an example,
this list has 80 class confidences. And now we need to extract probabilities
among 3 predicted bounding boxes of this cell to identify that this box
contains certain class. To do that, we compute following elementwise product of
objectness score and list of confidences. Then, we find maximum probability and can say that this
box detected class lemon with probability 0.55. These calculations are applied to all 13 by 13
cells across 3 predicted boxes and across 80 classes. The number of predicted boxes at this
first scale in the Network is 507. Moreover, these calculations are also applied to
other scales in the Network giving us 2028 predicted boxes and 8112
predicted boxes. In total, YOLO version 3 predicts 10 647 boxes that are
filtered with non-maximum suppression technique. Let’s move to the next topic and identify
how predicted bounding boxes are calculated. We already know that anchors are bounding
box’s priors and they were calculated by using K-means clustering. For COCO dataset they
are as following. To predict real width and real height of the bounding boxes, YOLO version
3 calculates offsets to predefined anchors. This offset is also called log-space transform. To
predict centre coordinates of the bounding boxes, YOLO version 3 passes outputs through
sigmoid function. Here are equations that are used to obtain predicted bounding
box’s width, height and centre coordinates. bx, by, bw, bh are the centre coordinates, width
and height of the predicted bounding box. tx, ty, tw and th are outputs of the Network after
training. To better understand these outputs let’s again have a look at how YOLO version 3
is training. It has one ground truth bounding box and one centre cell to be responsible
for this object. Weights of the Network are trained to predict as accurate as possible
this centre cell and bounding box’s coordinates. After training and after forward pass, Network
outputs coordinates tx, ty, tw and th. Next, cx and cy that are the coordinates of the
top left corner of the cell on the grid of the appropriate anchor box. Finally, pw and
ph are the anchor’s boxes width and height. YOLO version 3 doesn’t predict absolute
values of width and height. Instead, as we discussed above, it predicts
offsets to anchors. Why? Because it helps to eliminate unstable gradients
during training. That’s why values cx, cy, pw and ph are normalized to the real image width
and real image height. And centre coordinates tx, ty are passed through sigmoid function
that gives values between 0 and 1. Consequently, to get absolute values
after predicting we simply need to multiply them to the real and
whole image width and height. Let’s move now to the next topic and
interpret what is objectness score. We already discussed that for every cell
YOLO version 3 outputs bounding boxes with their attributes. These attributes are tx,
ty, tw, th, p0 and 80 confidences for every class this bounding box might belong to. And
these outputs are used later to choose anchor boxes by calculating scores and to calculate
predicted bounding box’s real width and real height by using chosen anchors. p0 here is
so called objectness score. Do you remember that YOLO version 3 when training assigns
centre cell of the ground truth bounding box to be responsible for predicting this object?
Consequently, this cell and its neighbours have objectess score nearly 1, whereas corner cells
have objectness score almost 0. In other words, objectness score represents probability that this
cell is a centre cell responsible for predicting one particular object and appropriate
bounding box contains object inside. The difference between objectness score and 80
probabilities for class confidences is that class confidences represent probabilities that detected
object belongs to particular class like person, car, cat etcetera. Whereas, objectness score
represents probability that bounding box contains object inside. Mathematically
objectness score can be represented as following. Where P object is a predicted
probability that bounding box contains object and IoU is intersection over union
between predicted bounding box and ground truth bounding box (make pause for showing image).
The result is passed through sigmoid function that gives values between 0 and 1. To summarise
this lecture let’s move to the final topic. We discussed all major points of how YOLO version
3 works. Now we can come back to the definition and update it giving more details. YOLO version 3
applies convolutional neural networks to the input image. To predict bounding boxes, it downsamples
image at three separate places of this Network that are also called scales. While training it
uses 1 by 1 detection kernels that are applied to the grid of cells at these three separate
places of the Network. Network is trained to assign only one cell to be responsible for
detecting one object if that cell falls into the centre of this object. 9 predefined bounding
boxes are used to calculate spatial dimensions and coordinates of predicted bounding boxes. These
predefined boxes are called anchors or priors. 3 anchor boxes for each scale. In total, YOLO
version 3 predicts 10 647 bounding boxes that are filtered with non-maximum suppression
technique leaving only the right ones. That was extended definition according
to what we covered during the lecture. Are you interested in training your own detector
based on YOLO version 3? Then join the course! You will create your custom dataset, build your
model and train detector to use it on image, video and by camera. Find the link
in the description right below this video. Thank you very much for watching.
I hope you found this lecture useful and motivating to keep studying. Please be
sure to like this video and subscribe to the channel if you haven’t already.
See you soon with more great stuff.