So in this video, Im going to give to you a clear and simple explanation on how Deep SORT works and why its so amazing compared to other models like Tracktor++ TrackRCNN and JDE But to understand how DeepSORT works we first have to go back,
waaaaaay back to understand the fundamentals of object tracking and the key innovations that had to happen along the way for DeepSORT to emerge Before we get started, if you are interested
in developing object tracking apps then checkout my course in the link down below where I show you how you can fuse
the popular YOLOv4 with DeepSORT for robust and real-time applications. Okay so back to Object Tracking Now lets imagine that you that you are working for Space X and Mr Musk has tasked you with ensuring that on launch the ground camera is always pointing at the falcon 9
as it thrust into the atmosphere As much as you are excited to be personally chosen by Elon to work on this task you ask yourself “How will I go about this?” Well given that you have a PTZ or
pan tilt zoom camera aimed at the rocket you will need to implement a way to track the rocket and keep the the rocket at the center of the image So far so good…? Just note that if you do not track it properly your PTZ motion will stray off the target and you’ll end up with a really disappointed Musk And you cannot screw this up because this is your first job
and you really want Elon Musk to be impressed I mean who wouldn’t want to, right? Soo, Question… how will you track the rocket? Well you might say "well ritz, you did a whole tutorial series on object detection" "why don’t we just track by detection" "you know umm using something
like YOLOv4 or Detectron2?" Hahaha.. Okay okay..
lets see what happens if we use this method. So the Falcon 9 launches on a day with a clear blue skies you are armed with the state of the art detection models
for centralizing the camera on rocket Everything going well~ until all of a sudden a stray pigeon swoops in front of the camera occluding the rocket from your camera
and just like that the rocket is out of sight …The boss is not happy Deep down inside you feel your heart sink
and your soul crushed by the disappointment But you light up some greens he chills out and after a smoke or two he decides to give you another chance. The high has also given you a chance
to reflect on why this did not work you conclude that while detection works great for single frames there needs to be a correlation of
tracked features between sequential images of the video. Otherwise any sort of occlusion,
you will lose detection and Your target may slip out of the frame. So you dig a little deeper in attempts
as to not disappoint Mr. Musk again you go back to traditional method
such as mean shift and optical flow Starting with mean shift you find out that it works by taking our object of interest which you can visualize as a blob of pixels, so not just location, but also size. So in this case the falcon 9 rocket
that we are detecting is our blob. Then you go to next frame and you search within a larger region of interest known as the neighborhood, for the same blob. You’ll want to find the best blob of pixels or features in the next frame that best represent our rocket
by maximizing the similarity function. This strategy makes a lot of sense If your dog goes missing, you wont just drive to the countryside but instead start with searching your
immediate neighborhood for your best friend Unless course you have a dog like Lassie.
In that case, she’ll find you. The other tool you look into is optical flow which looks at the motion of features
due to the relative motion across frames between the scene and camera. So say for example you have your rocket in your image,
and it moves up in the image, you will be able to estimate the motion vectors in frame 2 relative to frame 1. Now if your object is moving at a certain velocity, you will be able to use these motion vectors
to track and even predict the trajectory of the object in the next frame. A popular Optical Flow model that you could use for this is Lucas Kanaader..
Kanada? Kanade? Cool so now you’v got another shot at impressing Mr. Musk he was only a little annoyed... thats right.. only a little annoyed
.. that you lost his rocket.. So to save Elon a buck or 2,
you decide to model this in simulation and test the viability of Optical flow and Mean Shift You find out some interesting things from this experiment. After running your simulations
you discover that while the traditional methods have good tracking performance,
they however are computational complex and prone to noise in the case of optical flow And for mean shift, its unreliable if the object happens
to go beyond the neighborhood region of interest So Move too fast,
loose the track. And that’s not even considering any type of significant occlusion. So as much as you want to show this off to Mr. Musk you have a gut feeling telling you
that you can do better.. way better!! You go to your shrine and meditate for a bit, Spend some time crunching the numbers
and reasons why you were better off working somewhere else. But You stumble across an amazing technic
used almost everywhere known as the Kalman Filter. Now I have a whole video on what the Kalman filter is
and how you can use it catch pokemon But essentially its premise Is:
say you are tracking a ball rolling in 1 dimension You can easily detect it within each frame. That detection is your input signal which you can rely on as long as there is a
clear line of sight to the ball, with very low noise. Now during detection,
you decide to simulate cloudy conditions using that fog machine
you used at the last office party. You can still see the ball but now
your vision sensor has noise in it, thus decreasing the confidence
of where the ball is Now Lets make it a bit more complex
and throw in another scenario where the ball travels
behind a box which occludes the ball. How do you track something
that you can’t see? Well this is where the Kalman comes in. Assuming a constant velocity model
and gaussian distribution You can guestimate where
the ball is based on the model of it motion. When the ball is able to be seen, you rely more on the
sensor data and thus put more weight on it. When it is partially occluded, You can place weight
or reliance on both motion and sensor measurement data And if its fully occluded.
You will shift a lot of weight on motion data. And the best part of the Kalman filter
is that it is recursively, meaning where we take current readings,
to predict the current state, then use the measurements
and update our predictions Now ofcourse there is a lot more to
the Kalman filter to cover in just one video. But by now you probably wondering, "Ritz, the title of this video is
on DeepSORT.." "what are you going on about Kalman filters
and traditional tracking algorithms from the good ol days?!" "What going on here man!?" Hold up, hold up, hold up we are getting there, just bare with me The Kalman filter is a crucial components
in DeepSORT. Let’s Explore why. The next launch is coming up
soon where multiple Projectiles may be need to be tracked so you are required to find a way
for your camera to track your designated rocket. The Kalman filter looks promising,
but your Kalman filter alone may not be enough. Enter SORT Simple Online Realtime Tracking You learn that SORT comprises of
4 core components which are: 1.Detection, 2.Estimation, 3.Association, 4.And Track Identity creation and destruction. Hmmm, This is where is all starts come together You start with detection So as you’ve learn earlier
that detection by itself is not enough for tracking. However the quality of detections
has a significant impact on tracking performance. Bewely et. al. use FRCNN(VGG16) back in 2016
now you can even you YOLOv4 in according implementation we use YOLOv4
which you can check out in the lnk down below Estimation So we got detections
now what the f*** do we do with them? So now we need to propagate the detections
from the current frame to the next using a linear constant velocity model. Remember the homework you
did earlier on the Kalman filter? yes that time was not wasted. When a detection is associated to a target, the detected bounding box
is used to update the target state where the velocity components are optimally solved
via the Kalman filter framework. However if no detection is
associated tot the target, its state is simply predicted without correct
using the Linear velocity model. Target Association In assigning detections to existing targets, each target’s bounding box geometry
is estimated by predicting its new location in the latest frame. The assignment cost matrix is then computed as the intersection-over-union (IOU)
distance between each detection and all predicted bounding boxes
from the existing targets. The assignment is solved optimally
using the Hungarian algorithm. This works particularly well
when one target occludes another. In your face Swooping Pigeon!! Track Identities life Cycle When objects enter and leave the image, unique identities need to be created
or destroyed accordingly For creating trackers, we consider any detection
with an overlap less than IOUmin to signify the existence
of an untracked object. The tracker is initialized using the geometry
of the bounding box with the velocity set to zero Since the velocity is
unobserved at this point the covariance of the velocity component
is initialized with large values, reflecting this uncertainty Additionally, the new tracker then
undergoes a probationary period where the target needs
to be associated with detections to accumulate enough evidence
in order to prevent tracking of false positives. Tracks are terminated
if they are not detected for TLost frames you can specify what the amount of frame for TLost Should an object reappear,
tracking will implicitly resume under a new identity. Wow, you are absolutely on fire now with all this SORT power consuming you, you power up even more, surging power level over 9000 screaming until you transform
from SORT to your ultimate form DeepSORT Super Sayans be proud Now if you’re almost there So now you explore your new found powers
and learn what separates SORT from the upgraded DeepSORT So in SORT we learnt that
we use a CNN for detection but what makes DeepSORT so different? If we analyze the full title of which is Simple Online and Real time Tracking or SORT
withwith a deep association metric. "Hhmm okay Ritz.." I really hope you are going to explain
what deep association metric is We’ll discuss this in the next video.. hahahh..
just kidding I cant leave you hanging like that. Especially when we are so close to
completing the project for the falcon 9 launch. Okay so Where is the
deep learning in all of this? Well, we have an object detector
that provides us detections, the almighty Kalman filter tracking it
and giving us missing tracks, the Hungarian algorithm
associates detections to tracked objects. You ask:
"o, is deep learning really required here?" Well while SORT achieves an overall good performance
in terms of tracking precision and accuracy, also despite the effectiveness
of Kalman filter it returns a relatively
high number of identity switches and has a deficiency in tracking
through occlusions and different viewpoints So, to improve this, the authors of DeepSORT introduced
another distance metric based on the “appearance” of the object. The Appearance feature Vector So a classifier is build based on our dataset which is trained meticulously
until it achieves a reasonably good accuracy. Then we take this network and
strip the final classification layer leaving behind a dense layer that produces
a single feature vector, waiting to be classified. This feature vector is known
as the appearance descriptor. Now how this works is that
after the appearance descriptor is obtained the authors, use nearest neighbor queries
in the visual appearance and this is to establish
the measurement-to-track association "Measurement-to-track association" or MTA is the process of determining the relation
between a measurement and an existing track So now we also use the
Mahalanobis distance as oppose to the Euclidean distance for MTA. So while tensions are mounting,
on the dawn of the launch day You quickly run your simulation
and you find the Deep extension to the SORT algorithm shows a reduced number
of identity switches by 45% achieved an over competitive performance
at high frame rates. Just like that you find yourself
standing alongside Elon in the bunker moments before
the commencement of the launch You clench your fists, you feel the sweat on your brow heart beating, saying: "This is it..
this is the moment of truth” Elon raises the same question
that you have on your mind “So.. will it work?” You stemmer a little.. but you answer with a confident
“Im sure it will” Elon looks forward as the countdown begins 3... 2... 1...
We have lift off!! You PTZ camera is set on the target on the target
as the rocket lifts up from the ground… So far so good we have track. However, the rocket is passing through some clouds that partially occluding the target. The camera is still targeting the deepsort model is holding up quite well. Actually very well, as you notice as the swooping pigeon occluded the camera
on multiple occasions without hinderance to the tracker. YES!!!
Mission Accomplished Elon looks at you and extends
his hand outwards to shake yours and says: “Well done, that was quite impressive.” You can now relax and pop
some champagne with the team Job well done! That was quite an adventure
for which you have learnt about object tracking, particularly on the DeepSORT model. Just out of curiosity you search the net for
DeepSORT alternatives and you create a quick comparison You find 3 which are: 1. Tracktor++ which is pretty accurate, but one big drawback is that
it is not viable for real-time tracking. Results show an average execution of 3 FPS. If real-time execution is not a concern,
this is a great contender. 2. TrackR-CNN, which is nice
because it provides segmentation as a bonus. But as with Tracktor++,
it is hard to utilize for real-time tracking having an average execution of 1.6 FPS. JDE displayed decent performance
of 12 FPS on average. It is important to note that
the input size for the model is 1088x608 so accordingly, we should expect JDE to reach
a lower FPS if the model is trained on Full HD Nevertheless, it has great accuracy
and should be a good selection. Deep SORT is the fastest of the bunch,
thanks to its simplicity. It produced 16 FPS on average while maintaining good accuracy, definitely making it
a solid choice for multiple object detection and tracking. If you guys enjoyed this video
please like, share and subscribe Comment on whether you would use
DeepSORT for your own object tracking applications And if you want to learn how to implement
DeepSort with the Robust YOLOv4 model then click the link below
to enroll in our yolov4 PRO Course. Thank you for watching and we’ll see you in the next video.