DETR: End-to-End Object Detection with Transformers | Paper Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's cracking guys in this video i'm covering end-to-end object detection with transformers or de tur for short by nicholas carrion francisco masa and others from facebook ai research so obviously it's about object detection and they are using transformers so they mentioned here that the main ingredients of the new framework called debtor are set based global loss that forces unique predictions via bipartite matching it's a mouthful we'll break that down in a minute and a transformer encoder decoder architecture so the the basic premise here is they are training this this this detection pipeline using end-to-end learning and they're using as the as the differentiable blob they're using transformers so those are the two main ideas and comparing that to previous uh object detection pipelines uh like if you're familiar with that basically most of them consist out of multiple stages so as you can see here on the left side uh basically we have so the first the stage one just outputs bunch of these different uh region proposals so all of these boxes are kind of um uh claiming that they they they see a car inside of that box and the idea is to just kind of filter out and leave just the best one and in this case this uh this non-max suppression algorithm works perfectly so uh basically after stage one proposes these regions we have this heuristical non-max suppression algorithm and it just basically filter filters out the best box and the way it works is super simple uh basically you find the box with the highest confidence that this is a car and then what you do you just you just set some threshold and you find the iou or the intersection of reunion with other boxes and you just kind of filter out those boxes that have uh that whose iou crosses that threshold okay so i just went ahead and pasted this iou definition so that i can better explain how the the this nms algorithm so non-max suppression works so as i said we have two stages usually the first stage proposes a bunch of these different regions and all of those boxes kind of uh claim that they see a car inside of it but we want to let just leave a single one that has the like the highest confidence and the best bounding box and the way we do that is the following so out of all of these boxes we'll just find the one with the highest confidence score and then we're just going to set a threshold for example i don't know like 0.7 and we're going to remove all of those boxes whose iou is bigger than 0.7 that means in human language those boxes that overlap too much with the box with the highest confidence those will be removed and iou is just just defined like this so you find the intersection between the two boxes you find the union and you just divide the two as you can see as as the the overlap becomes bigger this iou will will tend to one and as the overlap becomes smaller you'll basically this iou will drop to zero okay that's the basic idea so we wanna the idea the whole idea of the debtor pipeline uh is to remove this this heuristic and kind of train this whole system detection pipeline and to end without these uh multiple stages without these heuristics such as the nms and this is not the only method out there but that's the one that i took as an as as an example and yeah that's it so with the motivation out of the way let's see how that works so jumping to this high level uh explanation of the pipeline we have uh how how the thing uh like flows is we have an image at the input we pass it through a cnn which will downsample the spatial extent and increase the number of channels so we'll end up with multiple feature maps here so something like h w times c where c is the number of channels and initially we maybe have something like h0 we have w0 and we have three in the case of rgb images so after we have these feature maps what we do is we're going to flatten them out and pass them as tokens into the transformer and later i'll just we'll just unpack how this works in the next diagram but here just stick with me so basically we just kind of process all of these tokens similarly to division transformer you can you can treat it that way and what what happens is uh at the output of the decoder we have a bunch of these different region proposals so we'll have a fixed number of these and they were using hundred so we'll have 100 proposals and each of these boxes has two pieces of information so we can treat this box it will just be a two tuple so we'll have two tuples so we'll have one tuple will tell us about the actual box so what's the center coordinate and what's the height and width of the box so that will be four numbers so we'll have like cx that's the center coordinate on the x-axis then we have the center coordinate on the y-axis we have the height of the box and we have the width of the box so in the case of this red one so obviously this would be the center then we had the height and we'd had the width so that's the output now the second piece of information that's vital is which kind of object is that and that's where the second tuple comes in and this tuple will just contain c number of classes plus one where this is just the cardinal number of the of the classes set so that means for example if we have i don't know 10 classes like dogcat whatever we'll have 10 here and plus one is because of the this no object class so sometimes a box like this green one just is just a spurious box and it doesn't contain any objects so that's why we include plus one so that that will be the output and all of these hundred outputs will contain exactly these two tuples so that's the output information now the interesting part is how do we actually match these so that we can compare them and how do we back prop this through the encoder and the cnn and train the detroit pipeline so the idea is fairly simple the first thing we do is uh this bipartite matching and i'll explain that in a bit more detail but the rough idea is to find the permutation so basically if we if we focus on the ground tooth the ground truth here would be something like this so we'll have let me just draw it like this so we'll have a bunch of outputs and the first one will be this this let's say this red box is the first output in the ground truth and then the second output will be this uh maybe this yellow one i'll just denote it like this so we'll have so this will be the two outputs and then we have a bunch of these uh no object uh like proposals and so as you can see the the goal will be to kind of learn how to find a permutation how to associate this one to this one and how to associate this box with this box so we'll basically want to find a way how to sort this so that we have this red box standing in the first place and then the yellow box being in the second place and all of the other no object uh proposals should be here and once we have that then we can just kind of compare them because they're matched right because if we didn't do that it just wouldn't make any sense so again on a high level we first do the sorting using this hungarian matching algorithm and the second thing is we just compare them so obviously you can think of it uh we'll basically want to compare the boxes so the the red box here and the red box here will somehow include the iou in the loss will somehow include uh all of those uh four coordinates in the loss as well as the class information so for example uh this here so the the the red class for this proposal will be equal to one so the probability is hundred percent here maybe our prediction is not that good since predicting maybe zero point seven for this red class and we want to push that 0.7 towards one okay that was it in a nutshell now let me start digging into more details so that this becomes way more clear than it currently is okay let's find uh and i'll explain all of these equations in a second i just want to focus first on on the on the expanded uh diagram of that of that same pipeline so again we have cnn we have these we have these feature maps and we'll have positional encoding so that's standard stuff for transformers except that for for a small detail they are actually adding this positionally coatings in every single layer of the encoder and of the decoder but that's a minor detail again um if you're not familiar how transformers work please check out my video on transformer i'll link it somewhere here but like in a nutshell i'll try to explain it quickly here basically every of these tokens will be mapped to a query key and value and what you'll do so you'll have queries so let me denote queries with with this like like an arrow so that's a vector uh and uh query will be in a red so we'll have like queries and then we'll have let's focus on on a specific query so on this one and basically this query will be uh we'll we'll do a dot product so that's uh there's a way to find the similarity between deck query and all of the keys so we'll have keys for every single uh like token and as you can see here this red uh this red arrow and this blue arrow are super similar they're pointing in the same direction so that means the the score will be really high between these two so so basically the score between these two between this blue key and this red query will be really high so that means that the value vector and i'll denote value vectors with with green uh that means that this value vector will have a high coefficient when we sum up all of the all of the value vectors so the basic idea is every token attends every other token so we have this global attention and we'll just basically figure out these scores by doing a query and a key dot product and we'll use the scores to kind of do a weighted average of these value vectors and that's how we form a novel representation so that's a super high level overview that's how it works you can think of it as information routing and uh you do a bunch of those you just propagate that through the encoder you get these representations and now the second part is interesting so they have these object queries and they are basically learned so these are just randomized vectors initially and you can think of it as this one is maybe so this vector here this red one is looking for a particular thing in an image so it's going to attend all of this information and it's going to focus on a particular things so they have a nice visualization a bit later in the paper uh let me just find it for a second uh as you can see as you can see here basically one of these so they have as i said they have hundred output proposals here they just depicted 20 of those and they just so each of these dots is just the center of the of the box that that proposal it's outputting across the validation set so that may be a mouthful but in in a nutshell what that means is this one particular proposal so maybe the first one from the output of the decoder so the leftmost proposal in in the of the decoder is basically attending on all of the small so the green dots just mean a small bounding box so it's attending all of these small bounding boxes in the in the bottom left part of the image and as you can see those different output tokens will be attending different parts of the image so this one as you can see is attending to the center most part of the image and also it's attending to those small boxes etc so basically all of these let me go back here um so all of these will be focusing on particular uh like semantics of the image let's call it that way the only difference compared to the original transformer is that here we'll actually be outputting this in parallel we won't be doing this all regressively so we're just going to output all of the outputs at once in parallel and the final step is once we have these output tokens we just pass them through these fully uh like feedforward networks and we output the class and box information so that's the as i already mentioned this is for tuple this is the uh just the uh c c plus one number of classes and uh that's that's it in a nutshell so that's how the pipeline is structured now i'm gonna go and explain the most important part and that's how we do how do we match the the proposals and how do we actually do how the loss looks like so let's focus on that now okay so okay this is the this is the main part basically uh the whole idea is to find a permutation let me go here so again i just copy pasted the the diagram from the above and so the whole idea will be the following so you have your ground truth so in this case in the case of this image we have one black one red box as you can see here we have one yellow box and all of the other boxes will just be uh no object uh like uh proposals okay and so the idea is to find the permutation of this output here so that the loss is minimized and uh that means we want to push this this this red one should be in the first place this yellow one should be in the second place and all of the other ones should be here so that's the permutation that will minimize the loss and as you can see here so let's focus on this so y i is just a tuple of the class information and the box information and this is just the sigma of i is just the current permutation okay so we want to minimize the this match loss between those two so n just stands for the number of proposals as i said they are using 100 of these and that's hard coded so basically none of the images in the data set will ever have more than 100 objects so they just took a like a upper threshold and set it to 100. so we sum up all of the match losses over all of the proposals and we find the permutation that minimizes this matching loss and that will be the permutation that we end up using okay so that may be abstract still so i'll try and break it down here is the the main the main this is how the the matching loss looks like break broken down into two components we have this left term with the like class probability and we have the right term with this uh box loss and let me try and break this down so first things first is if we focus on on on this on this proposal okay so this is one of the outputs as i said it's going to have the the fourth tuple that kind of tells you where the box is and it's going to have this distribution over classes and because it's red it's probably going to have like this red component being the strongest one maybe 0.6 and in the current permutation just treat this as as the current permutation of the algorithm so we'll be comparing red with the yellow one and because the the the yellow class has a probability of 0.2 that means here we'll have minus 0.2 and that's not good enough because we want to minimize this as you as you remember so this matching loss needs to go down so we we need to minimize it as much as possible and the minimum value would be if we had uh one for the for the yellow class so instead of having two here we'd have one so again let me just dig into this comp into this loss and kind of break it down uh one just means uh an index uh function so that means that when the when the ground truth this that's the ci class is uh different from the no object class so that means when we have some class so here we have this let's call it red class we here we have yellow class so in this case for the for the i equals one we he we see here the indices so indices go from one to six so c of i c of one is red so uh that means that because it's different from the no object we'll have one here so this index function will be one okay so c of two is yellow so that means it's still different from no object so that means we're gonna have one here so it's minus one times this and now for the fun part the probability of the of the yellow class for the second index here as we can see is 0.2 and that means what we'll end up with minus 0.2 here that's minus 0.2 and this is going to give us some value whatever that may be let's just denote it as x as you can see uh the goal will be to find the permutation that minimizes all of these so if you imagine a permutation where this red is here and the yellow is the second one in that case we'd have a much much lower matching loss so for that permutation because this is the red class we can see it has 0.0.6 we'll have minus 0.6 here and that means we are we are going to minimize this loss even more so in a nutshell that that's the way how this hungarian algorithm works like it just iterates through different permutations it just calculates this these expressions and finds the permutation that has the minimal matching loss and it turns out that that corresponds to having red as the first proposal then the yellow one and then all of the four ones it doesn't matter the order because basically as you can see here these indexing uh this index function will just be zero because those proposals have ground truth to no no object so all of these have uh the ground truth of no object class okay uh hopefully that was clear um and now let's break down the the loss the box loss and here it is here is how it's defined it's defined as a weighted average of this iou loss and of l1 loss as you can see bi as i mentioned that's the fourth tuple so that's the center coordinates plus the height and width of the box and what we do is we just take the so this is the the ground truth proposal this is the current permutation proposal and we just find the iou as i said iou is distinct the second term just has this l1 between these bounding box vectors so there's one thing that's kind of confusing me here because this goes with a plus sign let me just see here so this guy goes with the plus sign the the box loss that means we want to minimize it and that means that obviously when the when the bounding box vectors co-align when they are the same this will go down to zero that's cool but the more they intersect so that means it's a better aligned this thing is going up so i suspect that we need a minus sign here or something to to kind of this for this to make sense or basically maybe the specific choice of these hyper parameters can help it uh work out but like no i'm i'm kind of confused by by this part i think there should be a minus here or i'm misinterpreting this a bit okay basically that's it that's the l box and uh as i said after we find the correct matching after we find the correct permutation we do the following thing and this is the lost component this is how the data model is trained we basically go over all of the end proposals so that's as i said hundred in their case and what they do is they just find the whatever the ground truth class is we find what's the current probability of that class and we obviously want to push that towards one and that's when log becomes zero so this is minimized um and the second part is whenever we have a class that's not a no object so basically a class that has something meaningful inside we'll want to include that box loss and that's the same box loss as the one as i just explained for the matching procedure so that's it um that's that's that's that's how everything works again i'm a bit confused by the l box loss and uh let me just kind of uh step out and try and explain this once more on a high level so again we are looking for a permutation so for this sigma that minimizes a sum over all of these hundred proposals of this l match so this matching loss and that will lead us to find the correct like let's go to sort the output so that we have perfect matching with the ground truth once we have the alignment between the output and the ground truth we just apply this this simple loss here so we both want to penalize for the class is not being as confident enough and we want to also penalize if the l box if the boxes are too different so we want to make sure they that both the class is correct as well as the bounty box that's all the information you need to know about how the model works now let's focus on the results uh there is one more detail though here and that's we add prediction fee for networks and hungarian loss after each decoder layer so that's just a minor detail but will be kind of important later on so that means that we are not only doing these ffns at the output the final layer of the decoder we're also doing it here and here and here and although it's conceptually not that like necessary to do this they they figure out that improves the performance and so they included it and they'll have later some oblations on this so that's why i wanted to mention this so again we'll have uh the same procedure i just mentioned so the matching plus the the loss the hungarian loss across all of the layers of the output of the decoder okay that's it now let's see for the experiments uh they compare with this uh faster rcnn uh object detection baseline and they show some really nice results so uh again these architectures are super handcrafted there was a many years of iterations to improve these and uh they managed to using this end-to-end learning uh pipeline using transformers they managed to outperform uh the faster arsenian baseline as you can see here focusing on all of the so first thing i really like here is they they kind of uh mentioned the flops the fps and the number of parameters because only when you kind of even those out then you can then they can kind of uh claim that you have better performance or whatever our own pair and in this case as you can see on this apl so that's just the the precision for the large objects uh we can see that we are the the detroit outperforms faster rcnn uh by quite a lot as well as on the medium sized objects but here on the smaller sized objects it's kind of underperforms but all in all looking at the this ap metric where like the detroit model is better than than the faster rcnn and they did mention it somewhere here in the paper let me find it um okay so here so better demonstrates significantly better performance on large objects or is that likely enabled by the non-local computations of the transformer it obtains however lower performances on small objects so the the main idea here is because certain objects span a like a huge portion of the image and the transformer having its global attention on all of the layers has the advantage compared to cnns which need to go into couple you need to have a couple of layers before you have the receptive field that can kind of perceive that huge object in the scene in the image and so that's the reason why they they hypothesize uh the transformer is performing better for the large objects whereas it has some problems with the smaller objects and i can suppose that the the reason being is the the the speed spatial reduction we have in the cnn so if we were to not reduce the spatial extent but then we'd kind of have to increase the computation and the memory footprint probably like this this can be further improved with some of the novel like uh efficient transformer architectures uh yeah i guess that's left for for the future work okay let me get back to the results uh we saw we saw the the comparison with a faster rcnn now let's see some ablations so they showed that uh and this is pretty straightforward i mean they showed that that the encoder a part of the transformer is necessary so the more layers you add the better it performs but the the biggest gap goes from not using trans like encoder to using it you can see that the ap like increases significantly uh as well as obviously the the number of parameters as well as the computation and fps but we have a huge improvement here and then we kind of start saturating uh adding more and more layers they ended up using six i think in their encoder and that's pretty much it okay so let's continue here here is a nice visualization of the output of the encoder so what they do is they take a query token from the output of the encoder and they then attend using that query vector they just attend all of the keys all the key vectors of this image and by doing that uh they they as you can see here this query attends to this cow here so it does some sort of an instant segmentation as you can see here so focusing on this query vector here so this one basically it attends as you can see the small cow so that's again instant segmentation similarly for all of the other query tokens continuing and seeing other results this one is super interesting basically what they do is they form this synthetic image where they have six times four so 24 giraffes and they mentioned that in the training set they don't have more than 13 giraffes in a single image and despite that this data model as you can see pretty successfully manages to detect all of the 24 giraffes with a high confidence scores you can see hundred hundred most of them are above ninety ninety percent as i can see here so that's a nice uh out of distribution uh like uh generalization and that's cool for this second chart here what they do is i mentioned that on the every layer of the decoder they have this hungarian loss being applied and what they show is that when they are using one of the intermediate layers instead of the final layer as the as the output uh they get you can see this progression so obviously taking the first layer taking the output from the first layer is much worse than if we take it from the sixth layer so that's the final layer and additionally what they show here so i'm just focusing on this ap metric ignoring this ap50 metric uh what they additionally show is that this non-max suppression heuristic which i mentioned in the beginning of the video basically helps if we are taking the outputs from the initial couple of layers but then later on it does not help and that's and that's by design because the hungarian loss and the matching was meant to make this uh like the the proposals unique and so we don't have to kind of filter out similar proposals because that's handled by the loss implicitly okay uh that's that part here they have interesting visualization of the this time decoder attention mechanisms so we take one of the proposals one of the tokens from the output of the decoder we use that and we attend to the outputs of the encoder network and that's what we got and basically as you can see focusing on on one of those uh output tokens uh so whatever like the the the whatever the the index of the orange token is uh you can see that it attends to this uh to the extremities of the of the elephant uh this kind explains why why the encoder is necessary so encoder kind of does this uh this initial segmentation of the instance and then what the decoder does it learns how to kind of do contours contours of the object which is necessary for it to find the convex call or the the bounding box in this case uh of the of that object they just kind of work in symbiosis and help each other uh and again for the blue token you can see that it again attends the extremities and that's how it figures out so so these components here are really important for for the detector to figure out that this is an elephant and similarly in this image for zebras you can see a green token attending here to this zebra the blue one to this zebra and the orange one to the zebra okay that's it um they did a couple more ablations i mentioned that the spatial these the these uh positional encodings they have uh it's pretty intricate i can kind of go to appendix and show you uh this part so as you can see here the spatial encoding is added to the keys to values in every single layer and it's also added to the keys of the of the uh decoder as well and you can see these object queries which are also positional encodings but they are learned not fixed in this in this particular case it's again added here and there so it's pretty convoluted but like uh you can also do it the same way as the as the original transformer which was only at the input but they showed that this increased the performance in the ablation table i was just showing you so let me get back here so yeah basically here they showed that adding uh these uh positional encodings on every layer at the attention module just increases the the ap metric uh compared to all of the other combinations so i don't think we we can extract something like something super intuitive here it's just that this experimentation showed that it's just better you can include them in all of the layers okay they also try and uh like do ablation of the l box component which i showed you and they try using only l1 or try using on the iou and it turns out that if you if you ignore the iou you have a significant decrease in this ap metric so you have to include the iou l1 is just extra and using both of those obviously just increases across just uh boosts up all of the metrics okay that's it yeah i already mentioned this one as i said so uh this just shows how different uh output those they call it they call those prediction slots i just call them tokens of the decoder basically you can see that each of them attends different different regions of the image and different types of objects of of boxes so you can you can notice a trend that this red dot dots are equally present across all of these 20 prediction slots out of the 100 they have and the reason being those are uh i think huge this system over here are huge horizontal boxes or something so red to large horizontal boxes and it seems that the ms coco dataset just has a bunch of those objects which span the whole image horizontally so that's why all of the all of these prediction slots or tokens i'll learn how to kind of focus and attend to those okay that's that's pretty much it um let me see i'm going to briefly uh just tell you about this uh they also showed that just by simply modifying the debtor pipeline uh by adding this this this additional like segmentation head they can also do really well on this panoptix segmentation task and the idea is they just do regular object detection training of the debtor pipeline so that's remains the same and then they just append the head and they trend they freeze the the denture and they just train the the head itself and they show some nice results you can see here uh like segmentation of this giraffe image and bus etc uh yeah they just showed that this works really well here in this table comparing it to other baselines like the spanoptic fdn etc and they just report nice results together as well so that's pretty much it uh it's exciting uh so the main two components again are using transformers uh and learning end-to-end using this this match plus hungarian loss uh logic and that's it if you like this video share it out subscribe to this channel and until next time bye bye
Info
Channel: The AI Epiphany
Views: 2,080
Rating: undefined out of 5
Keywords: detr, object detection, segmentation, object detection with transformers, transformers, resnets, cnns, hungarian loss, bipartite matching, bounding box, arxiv, paper explained, the ai epiphany, ai, deep learning, machine learning, aleksa gordic, facebook ai, fair
Id: BNx-wno-0-g
Channel Id: undefined
Length: 31min 19sec (1879 seconds)
Published: Sat Jul 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.