Applying the Cutting Edge of Object Detection to Medical Imaging

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so nice to meet you everybody my name is Dan basak and I'm the head of AI at a dock where we do we use the technologies of object detection and image segmentation to detect urgent medical normalities in different medical imaging modalities such as CT scans and it's nice to see some familiar faces here in the crowd thank you all for coming and today I want to share with you and talk talk together with you about some some of the three most interesting papers in object detection of the last year or so which have done really amazing things that are really relevant to the challenges of medical imaging data so first of all a little bit about object detection so object detection is progressing really quickly just between 2016 and 2017 in the living benchmark the cocoa competition the accuracy has raised in about - a 20% relatively between the years so the bottom row is the best submission of 2016 and all of these four are the submissions from 2017 challenge in the Abbott was in the end of 2017 and not only is it a mature and quickly advancing technology I think it's also a very transformational technology and it it has the potential and I really believe that it will transform any and every industry be it medical imaging defense and business intelligence robotics and autonomous vehicles and even augmented reality and many many other more industries and the reason I really like these these meetups and talks is because I think that object detection and deep learning in general is a very very it has a lot of potential but we wouldn't be able to fulfill this potential fast enough if you won't have a lot of people that really understand this field and the problem with that is that we are building a mountain of papers that are really hard to read or really hard to get into it takes hours to and really understand them especially if you want to dive into the little details that are relevant to really implement something especially if it's on new data and I was really inspired by this article on distill which was called research debt and was talked exactly about that and I really recommend you to read it it was written by Chris Ola and the Shankar - research scientists from Google and what they saying I really agree with them is that we really need to look at this mountain and instead of making we can continue making the mountain bigger as long as we build staircases and elevators that enable everyone to climb it together with us because if we won't have enough engineers that understand the state-of-the-art we won't we wouldn't be able to create placated solution fast enough and I've personally invested hundreds of hours in learning this field and I really I still have a lot more to learn so and after learning learning investing a lot of time in it my conclusion is deep learning is advanced and it's mind-blowing and it's creative and you need to dive into it and learn it seriously in order to understand it but it's not rocket science and by the way I think that even rocket science is not really rocket science and my question is what can we do to reduce the time that is required for the next people to join this field by a factor of 10 from the time it took me to join in this field and I think that if we it should be a community effort and we should spend more time on trying to make these things more explainable more easy to understand use the right explanation and the right visualizations and that's it so now let's dive into first the structure of this doc it's going to be about how an hour and a half first I will talk about the challenges in medical imaging data and after that I will dive into a depends on how much time we'll have two to three of the most advanced papers deformable convolutional networks feature pyramid networks and focal loss I guess most of you heard these names if you're in the field and I'm going to these three papers address most of the challenges that I'm going to present now and they do it in a really nice way and I'm going to explain them in a way that is relevant both to the medical imaging domain and explain the unique details that are relevant to apply it in the medical imaging human but also it will be relevant for any one of you that wants to understand these concepts and take them to other fields as well so let's start with challenges of medical imaging data by the way can everyone hear me well I I don't think I asked ok if anyone if someone can't hear me just say mmm the bottom of the screen well I'm not sure I can solve that but you're welcome to come closer so the first challenge is extreme class imbalance objects are very small and rare compared to the number of images and the image sizes what I'm what I mean about with that you can see here this arrow is the detection of one of our algorithms in a brain CT scan of relatively medium size very urgent finding in the brain and the this is a relatively this is relatively large compared to many of the findings that we are required to detect and it's relatively obvious I put it here because I think that even that is not considered large in terms of classic object detection and I wanted you to understand what I'm talking about and and really see it so and the we're talking about findings that are sometimes smaller than 10 by 10 pixels and they are found in images which are a single image is actually a 3d image for us it's not a 2d image so it can be 10 by 10 pixels over a few slices inside a an image which is 100 slices 100 2d images or even more so the finding is a very small part of the brain and most of the scans are of healthy brains or healthy spines or whatever so the the interesting data the data that we want to detect is very rare and that's a very big challenge the second challenge is that the objects and they are not in our background which is done atomic all structures in my opinion are relatively much less well structured much more deformable and less rigid if you can we can take an example which of course we can find a hard example in in the classical data set as well but if you look at a wheel it is bounded pretty well by a square bounding box which is the classic use case of object detection but here is a part of a brain and you can see that these pixels that I highlighted in yellow it's not their original color this is a single finding in the brain so if I put a the the tightest bounding box that I can around this finding it will sink still contain a lot of uninteresting pixels and then when I will extract the features for this bounding box most of the signal will come from background rather than from interesting pixels and that's that's part of the shapes of the object being less deformable another challenge is that the images are 3d and they're large and there is a lot of difference between images of cats - cats cans and actually you can see that the difference in sizes the inputs are like could be 30 times larger and the objects are nominally ten times smaller and this is a challenge in terms of a computation power that is required a time that is it takes it to converge and the memory footprint of your networks and how you can find a good compromise in the design of your model and what input to use and the rest challenge that I will talk about is that when a radiologist analyzes a CT scan it doesn't just look at the current CT scan he looks at a lot more types of data the Democrat the demographics of the patient his age the referral letter of the doctor that referred him to this to this scan his past scans and the reports that were written on these cans and actually the radiologists are not only they do this but they are obligated to do is do this by regulation and for a reason because not all the information is in the image and if you don't look at the past you can't really diagnose some of the cases so how do you combine all the different types of data both visual text and structured data okay so now I want to dive in into as I said two to three papers and the first paper I wanted to start with is deformable convolutional networks and that's the paper I chose because for my experience a lot of people are kind of afraid from this paper and find it very hard to get into and I think that it's not that difficult if you understand it if you explain it correctly so what is the motivation for the differ by the way this paper is by the Microsoft Research Asia group and it's a it's been for the last two years both in 2016 and 2017 a significant component of one of the top three entries to the cocoa object detection competition so it's a it's a very significant boost to performance and it's by Microsoft Research Asia which is one of the top object detection groups in my opinion in the world and gave some of the top contributions in the last year's so the motivation for this paper is that neural networks and the popular mechanism that we use with the neural networks with convolutional neural networks such as data augmentation know how to deal pretty well with simple transformations such as translation and rotation but non rigid transformation like changing the pose or the viewpoint or just the object being the less clear in regular form are much more challenging for a neural networks to deal with and how can we answer that that challenge so the solution is to give the network a capability a dynamic on a dynamic way to control the receptive field of the convolution so instead of using the traditional convolution does that samples the that samples the image with a square grid we can sample why not sample the image in any shape that we want and why not have it adapt to the input the shape that we sample the image with adapt to the image to the input that we get to our images and objects so we didn't talk it about why should it solve the problem that we are talking about but we will see it in a minute but we don't only want to implement this solution if it works well in order to be applicable in the industry we want to do it in to implement something that is easy to train and preferably end-to-end we don't want to train several different components and then combine them it creates a very cumbersome research process we don't want to increase the model complexity its run time and training time a number of parameters too much and we don't want to increase the code complexity if it's a convolution when I define the modern architecture I want to write the formable conf 2d lie and they put the parameters inside it just like I write cons 2d today and I want to I want it to be proven on challenging tasks rather than just toy datasets and the earlier works in similar they try to deal with similar problems with this such as spatial transformer networks gave major scientific contributions but in the in the paper they only proved it on toya datasets and many people they try to apply it on real-world datasets found it very hard to make them converge at all so this is our these are our requirements from this solution and now let's talk about the components of this solution and there are actually two components and they build them on top they can be combined with any object detection architecture in the meta architecture like faster are CNN and RFC and these are the two architectures that they demonstrate in the paper and the first component is the deformable convolution and the concept is we keep convolution the same except for making the settling location a function of the image the sampling location are not a fixed speed they are a function of the image so as few samples examples that they give in the paper is that the receptive field after several convolutions of this of a neuron in this area of the image will be each of these in the area of each of these we're ready in the image so the value that we get from that is that you can see that the pixel is then this neuron is is on the sky or on the border between the sky and the mountains and if we reduce the traditional convolutions after three convolutions we would get like a square or affectively it would be like a circular or Gaussian shape around that point but now with the deformable convolutions we are able able to simple a very large part of the sky the mountains and the objects in the image and intuitively it looks like a desired property because if I'm just seeing blue pixels how can I know if it's sky or water or or like a wall in that color I can't really know that for sure unless I have larger context and when they use the same network but they look on a different part of the image where there is a far motorcycle the same deformable convolution mechanism creates a much more dense with especially dense receptive field which covers a much smaller area and samples the object very very tightly and also samples a bit of the object background because it's intuitive that we want to sample not only the object but the background as well and on a closer and larger object we can see that the receptive field is larger and a bit less dense but does cover again the the entire object instead of just a sip a rectangular or circular part of it and the background as well so this is this is the value that we can get this is an intuition for the value that we can get from these deformable convolutions and this was the first component which will we will dive into the implementation of this component in a few slides but first I want to talk about the second component which is deformable Roi pulling so first let's do a short reminder of faster our CN n and in faster our CN n we get an input image we put it through a feature extractor and we get a feature man okay I'm assuming that you are you know fast you are CNN or similar models and I'm just giving you a really short reminder then using this feature map we predict about let's say 2000 bounding box proposals and about 2000 bounding box proposals and some of them a few of them really cover the objects that we are interested in such as the cars but some of them are just false positive of our bounding box proposal mechanism and they lie on the background and then we take each of these these bounding boxes and we crop them from the feature map one by one we crop them from the feature map and we have like 2000 crop feature maps for different bounding boxes and then we put we put each of them through a second feature extractor which is also called the first part is called our PN region proposal network the second part is they called the second stage or first our CNN and at the end of this feature extractor we classify each bounding box and we find its coordinates with a regression head so the deformable Roi polling changes the implementation of how we crop the each of these proposals from the feature map so what is actually deformable RI pooling instead of cropping a rectangular bounding box a single rectangular bounding box we pull nine separate bounding boxes and we'll see how it actually works in a few minutes we pull nine separate bounding boxes and that way you can see that these objects and that's an example from the paper are able to cover the object of interest much more tightly and the features that are cropped from the feature map are much more relevant to classify the object of interest and are not wasted on background which in a site in which in large amounts in the for mobile object is less interesting and valuable for us so these were this was the description of the different components and it shows a very strong improvement in the both in both of the significant metrics in the world of object detection we have two metrics the cocoa metric and the Pascal VOC metric this is the cocoa metric which gives more weight to accurate localization how well am I giving tight bounding boxes around the object so this locally tight localization metric gets about five to ten percent relative improvement due to this solution and the second metric which is which only which gives less weight to tight localization and thus it means its value is actually by telling us even giving us more insight into how many objects we were missing or detecting how many objects am I not detecting at all etc so this metric by the way is much more important in my opinion to medical imaging applications in most cases because tight localization in many times it's less important but if we miss a medical finding critical medical finding that's something that the doctors will really be mad at us about so this metric is also improved by 5% sorry sorry yeah that's the hours here in this table is their implementation of for example faster our CNN with deformable convolutional networks 5% 5% Oh 51 it's a Pascal vocab mini average precision metric I don't want to dive into that too much just take it as a score the score for how good your a detector is it's a it's not really important to to understand it right now okay what is the percentage of undetected object it is not you can't understand it from this number but you can only understand it that is it is it has increased by significant Emma it is improved by a relatively significant amount you don't know the exact number of undetected object because this metric covers a lot of different working points of sensitivity of recall and precision that you can choose for your algorithm okay so now let's talk about the implementation by the way after that we you can ask questions freely so please keep like keep your questions to the end of this part if you have any more questions unless they are really really important so first of all let's start by the with the implementation of deformable convolutions so this is the diagram that they have in the paper and I think that it's confusing a little bit because it's a good diagram but it contains too many levels of abstraction and it's hard to wrap your minds around what's going on here so I invested some time in decomposing this diagram into several parts so it would be easier to understand so this is the essence of the layer which is called deformable convolution the essence that we have an input feature map he doesn't have to be the image it's actually most of the times not it's not used directly on the image but on deeper layers on deeper feature maps and we put a cone we put we use a convolution all over this image but the convolution is not the old square 3x3 a convolution it's a different sampling grid for each location of the convolution and then let's say that I'm talking about this location so I have nine points that I'm sampling with in the locations of the blue squares and then those nine nine points are transformed into one point just like in the regular convolution the 3 by 3 square was converted into one point or one vector in the feature map one spatial location so this is the essence now the implementation so you start by doing a regular square 3x3 convolution ignore the blue squares for now we start with the regular 3x3 convolution with the square shape and the output of this convolution is a feature map with which size is relatively the input feature map spatial size but the depths of this feature map is about 9 times to 18 y 9 times 2 yeah it's because we can we can visualize it this this is just an aid to understand it this is not a stage this is the last computation that happens here actually but each 980 a vector of length 18 can be seen as 2 squares of size 3 by 3 so the top left elements in these two square give us the offsets that tell us where to locate the top left sampling point of our sampling grid and that Center squares in these two squares tell us where to place the offset the tell us where to place the center blue square in our new sampling grid and and and because we have nine squares nine elements in each each of them we get the offset for for each of our new blue squares and and yeah that's it so now the - yes so one square of course is the horizontal offsets tell tell us on the left-to-right axis how much do we want to move out each of our squares and the second square gives us the vertical offset so then we take this offset and we just sample them from the input feature map sample them multiply them with our with the weight of our convolutional kernel and get the vector and there is a little bit of a problem with what I just described because this convolutional layer is a convolution so it outputs continuous valued real numbers it doesn't output integers but we need to sample in order to sample the image when the image is discrete it contains discrete pixels so we need the integers but the problem is that we can't round these numbers because then it wouldn't be differentiable so and then we wouldn't be able to back propagate through it or it will require much heavier solution and a much cumbersome solution so what we do is say something that this group mentions a lot in their papers imagine that we have just like if we wanted to just if we had two coordinates the x coordinate was 2.3 and the y coordinate was 7.2 and we wanted to sample it from the image hmm so and we wanted to sample them from the image sample this point from the image we could use bilinear interpolation in order to like to interpolate what should be the value at that point so fortunately by neat bilinear interpolation can blink can be implemented very inefficient efficiently using matrix operators and matrix multiplication and that's why we can do it for many points of the sampling grid in real time and even on the GPU of course so explaining it's not very complex to understand how this the implementation of the Metro bi linear interpolation work but it is outside of the scope of this token if you are interested in it you can come talk to me about it later okay so let's say about the first component by the way if anyone has a question about this component ask because maybe it's better time yeah yeah yeah right so no it's those are okay so I repeat the question so everyone will hear so I said that first of all before I do the convolution with the square the yellow square the regular convolution I don't know the offsets for where I want to locate my blue sampling grid of the deformable convolutions and then I said that I when I know these sampling points I take them and multiply them with a the convolution kernel and what's your name Lisa and Lisa asked me if it's the same kernel if the same kernel is used for both of these convolutions or it's a different kernel so it's a different kernel between the yellow convolution has a single kernel and the blue convolution has a different kernel okay and they are learned separately okay any other questions yeah mm-hm what do you mean mhm probably but you know empirically it improves the results so I guess it has some drawbacks and maybe this solution can be improved but it has also desirable is that he asked me if it maybe maybe it creates some discontinuity because of the weird sampling strategy so probably it has some disadvantages but like I even the convolution that we are using today also has disadvantages so it's I think the only question is which mechanism has more disadvantages there relative to its advantages yes yeah so when you get when you get the loss you back propagate them just like through your bilinear interpolation operator that I that I talked about so you you get from it you have these numbers and you multiply them with a matrix of a bilinear interpolation and then you get these values okay it's not that you do something active to sample them it's just like you have a bilinear intermet ryx which is a bilinear interpolation kernel and you multiply it with with these numbers after some vector operations and then you get the values that are sampled in each of these points and then you multiply them with an with an another matrix so it's back propagated through the bilinear interpolation operator yes different sampling patterns for it so in the oh I hope I understood your question yes if there is a different sampling pattern for each pixel in the image so I'll go back if I hope I understood your question I'll go back to this example images that I showed here and I hope this answers your question you can see that for this pixel the the sampling is much more has a much wider coverage and for this pixel or activation the the coverage is is much smaller and the receptive field is a function of the local input and just a second the for each location in the image the offsets are a function of these 3x3 pixels in the input feature Maps so of course that you will get different offsets if you place your conversion here or any purple if you place a convolution here does that answer your question it depends on the yellow convolution yes the offsets yes the yellow regular 3x3 to the convolution square to the convolution determines the offsets and of course that the the output of the convolution is different for each part of the image because it's input is different okay and that's why the offset will be dead that's the mechanism that enables just a second that enables the offsets to be different between different parts of the image okay the output of the original square convolution enables us to sample the image for the real convolution for the deformable convolution which which is and that convolution is the one that way that really creates the next feature map of our feature extractors [Music] I'm sorry not your range - okay wait there's some wait please I would love to answer a question and I prefer to I think it's better that we cover less papers but I understand them better but please keep it to only like if you have a gaps to understand what I just explained and don't be shy to ask because I'm sure that you are not the only one that didn't understand yes can you speak louder I didn't see one something grid yes for each pixel in the original for each spatial location in the original feature map we have different 18 values they determine the real set the new sampling read the deformable self sampling grid and this sampling grid is different between some some between spatial locations okay yes I will love if you could keep this question today after we finish covering this paper and I also have an example of the reason I think it's interesting okay thanks yes the the the layer degenerates the offset only one layer and it's even a linear layer it doesn't have a non-linearity yeah I I think it could be an in in an interesting paper to try it with more convolutions and see if it's [Music] re what what do you mean I'm not if the in this if I understand your question correctly if this called the yellow convolution will be on different locations in the image but the the values of these locations will be equal then the offsets will also be equal is that your question mm-hmm okay so yes mm-hmm okay can we move on yeah no the offsets are not Li are not bounded but so mathematically nothing bounds this offsets and we also know that we from traditional assess if the offsets are bounded to the area of this safe 3x3 square and the authors are not bounded and usually they are larger than these 3x3 square because even in traditional object detection we know that we can use a small convolution kernel to predict much larger bounding boxes that are larger than than than the receptive field of these kernels so the is it's like if you look at it my torso you have enough information to know that my head is up to here and my footer are down there right so you can infer the wanted sampling points even if you are looking just on a part of an object no it just uses the features that it has and in just a limited spatial context is enough to predict something to infer for something that is outside of your of your context okay okay I'll yeah okay good question she asked if after we guys can like be quite so people will be able to hear Thanks so she asked if after we do this deformable convolutions maybe there are pixels that are not covered by our blue pixels all over the image and yeah it can happen nothing ensures us that it doesn't happen and it's okay that it happens because maybe these pixels are yeah yeah maybe there is the information there is less relevant yeah or maybe we are me it's possible do we also miss something that is important but if it happens that we our if it's important it means it will harm our classification results and then in the backpropagation these weights that generated offsets will adapt to predict better offsets so yeah what do you mean I I'm pretty sure that like in the formula there is nothing that prevents it but I get but but I guess it's something that that just happens because it's it's because it I guess it's it's not really beneficial in any way that they will converge to the same sample the same point and then it the way that I learned like a naturally simple difference points okay so let's move on I just want to see how much time we have okay we're good so now let's move on to the second component which is the deformable our Y pulling and I just want to give a quick reminder of the regular our Y pulling and there are many ways to perform our pulling there also called our warping and other names so now it doesn't really matter it can work with all of these methods and I'm going to demonstrate it with the original alright pulling which by the way is not is not differentiable and okay so by the way it's not a French Abell and the solution that I that I spoke about here is also the solution to make our i pulling differentiable or one of the solution to make our i pulling differentiable the solution with the bilinear interpolation operation so let how does our i pulling works from the RPM from the first stage of the fester are seen and we get a bounding box proposal and then we split this proposal into several bins for example to buy two bins or in reality it's a seven by seven or 14 by 14 in most of the cases but for the simplicity of the example let's assume it's two by two then for each of these bins separately we perform max pooling on the entire bin it doesn't matter the bin size we perform max pulling on the entire bin so for example for this P being we get 0.74 for this mean we get 0.39 etc so it can be max pulling in the paper it they talk about average pulling it doesn't really matter and this is the original ROI pulling and deferrable are I pulling the idea is that we keep the same things and we have the same sizes for these bins but we take each bin keep its size but give it an offset so we take the top right bin and we place it somewhere here and take the top left beam and we place it somewhere here etc so in reality we have like 7 by 7 bins and we predict offsets for all of them and that's that's uh basically how it works so the implementation is very similar to the to the previous implementation it will be very easy to understand now so this is the input feature map from which we crop and we perform the ROI pulling on this is the feature map that we pull it at our Y from so let's say we have an hour Y this is the and in this example they're all with all of their diagrams are for 3x3 bins I just mentioned 2x2 they do it with three by three bins so we have in our Y when we split it to three by three bins it's a again it's the three by squeeze yellow square here and then we do the regular ry pulling on this ROI and we get the the down sample the ROI and then we put this ROI into a fully connected layer and again we get vector this time we get a single vector for this ROI of size 18 which can will and we can look at this vector as two squares of size three by three which are the horizontal and the vertical offsets for each bin so the value in the top left part of these squares is the offset for the top left bin and the value for the top right the two values in the top right a part of the square is the offset for the top right beam and that way we can get an offset for each bin and place them in in a sample different parts of the feature map with them and again these we need to sample the beans and do max pooling on areas that are a coordinates that are not squid and this is also solved using the same bilinear interpolation and matrix multiplication that I mentioned earlier so some really cool examples for AHA from their paper on how are I for how different about our i pulling works so you can see that this is the original proposal the yellow is the original proposal and then the nine different bins are we pull the original proposal and we put it into a fully connected layer and actually predict offsets which would give us nine different bounding boxes which are the bounding boxes in red so you can see how nicely they cover the cat and the less relevant information here is not we don't waste any any capacity on it so I think this is really really elegant and another example which we have a good bounding this is an example of the post problem so the woman is reaching her head hand forward and thus the bounding box that covers her covers the has a lot of wasted space that we and we will waste our pooled features on these on the features of the of this background and it speaks for itself I think and regarding how this can be useful for medical applications I think that yeah yeah sure it's good great that you asked because this is their like the basic so is this is the most important things that everyone will learn is turned so I will explain again how from the 18 offsets from how from the 18 numbers that are the output of the fully connected you can get nine different bounding boxes okay so by the way it should I explain it again also for the deformable convolutions or just for the deformable Roi polling okay so when you get 18 numbers the yellow 3x3 structure here is the original each each sub square each of these nine sub squares are the original bins the original 3x3 bins of the original proposal and we know the location of there's the coordinates of the center for each pin it can be can be calculated easily so now I have I have their Center and I have two additional numbers I have the four this top left square I have did it's horizontal offset for example if the offset is minus 2.5 then I know that the new center will for the top left beam will be placed minus 2 minus 2.5 offset in the X in our rosante axis compared to the original centre of that bin and I take the value from this square which represent the vertical offset and if the value is minus 3.1 then I know that this the center will be located minus 2 2 in the vertical axis and minus 3 in the solid minus 2 in the horizontal axis minus minus 3 in the vertical axis and I repeat this process it's doing it it's performed in a vector operation but this process you can imagine that is repeated 9 9 times for each of these bins and that way we can get the new sampling points of our of our grid do you think it was more understandable right now ok great so if we look at the medical case so I will come back to the same example they showed earlier and I think this demonstrates pretty well how one finding that can be detected pretty nicely but when you want to classify it if I just took this ROI and I pulled it and I put it through the second stage of my detector then most of the information most of the features that will be pulled will will be healthy pixels healthy brain pixels so it increases the chances that like the classifier of the second stage will miss classify this example as a healthy example so if I use the deformable are I pulling I can it naturally covers the objects the object the interesting object in a much tighter way and then they pulled our eyes are the the pooled ROI is the features in the pooled roi are much more relevant to the to the non healthy pixels in the image do you think this answers your question from before okay great yeah the final region the do you mean the final prediction of the of the model the final prediction of the model will still be this this yellow square this yellow rectangle the original proposal of the yellow rectangle or or every or a single rectangular e it will not be the original proposal it will probably be refined by the second stage but it will be something like this original proposal but these nine nine different bounding boxes help us classify and refine the coordinates of this bounding box much better [Music] sorry you said yes not at the end of the first stage at the beginning you have a normal region proposed a layer you get the proposal and then and then you take this proposal you do regular ROI pulling on this proposal you put the pull original proposal into a fully connected layer and then you get the offsets for that enable you to locate these red rectangles on the image these rectangles are past the features under these rectangles are passed to the second stage of the detector the second stage uses these features to classify the original yellow rectangle and they are the final output it doesn't matter what where these red rectangles will be placed the final output of this entire detector will still be something like a single yellow rectangle around this proposal okay okay great some best practices so first of all they tried it on the less layers only so the the meaning of that is that they only try to model to use this to model deformation in high level features you can think about it's very intuitive when you look at something like this change of pose and you have feature a feature for a hand and a feature for a hand and a feature for a torso and that way you can model at their locations in in an irregular way and they try to use it on more than the three less layers and it give them they gave them diminishing returns so the use resonant 101 layer in this paper and the this is the end of the resident 101 these are the last 26 layers so this is the left resonant block and this is the the one before it so a large amount or of the convolutions in resonate are 1x1 convolutions and of course it's probably less interesting to do something deformable with the location of these convolutions so the four you have three blocks like this so the the optimal a configuration was to put the deformable solution on on each of these 3x3 convolution and when they tried it on some of the convolutions out of these 23 it didn't gave them too much value in addition what's really amazing about this solution is that it answered our requirement of not adding a lot of complexity to our model and the number of parameters in the network barely increased and also the inference time for a single image didn't increase significantly which is very nice and per right so yeah the their work is on 2d images so of course but yeah adapting it to 2 3 2 3 D is the is more is more advanced just like 3d convolutions also require a bit further explanations on how to work with them and this benefit of a very efficient and weak inference is a I think due to the layers being implemented in CUDA so they implemented them in CUDA and it's important that you know that their original cuda implementation is open-source it was originally intended to be used with MX net but already several repositories around the internet adapted them to use with under other frameworks such as tensor flow and they if you work with chaos then this will also work for you et cetera yeah convolution okay so how can we how can it be that it had so few parameters so for example because they only do it on three convolutions so yeah that's part of the reason I guess yeah so but it's it's like we can sit on it later and you can see easily that it that's the number of parameter that it adds yeah let me finish just a like two slides and then we we can have more questions so as we expected the receptive field is affected by the object size just like we saw intuitive in the intuitive cherry-picked examples in the beginning they they they analyzed all of the objects in the data set or many object in their data set and they checked what are the what is the receptive field of the deformable convolutions for the small objects the medium objects the large opt objects and the end when the convolution is on top of the background and they saw that the for the large object objects and for the background the offsets were larger than the receptive fields were largest just as we would expect intuitively from this mechanism but what if maybe the deformable part and sampling the the sampling the image in an irregular grid maybe it's not really important maybe the only thing that's important here is just the dilation is just sampling further context so using dilated convolutions in the last layers of the of the feature extraction extractors that are used for detection is already standard practice in detection networks and it and it does improve the performance but usually it's used with the I'm sorry I'll explain what are dilated convolutions for for a second so this is the standard convolution and dilated convolution is just in keeping it a square but putting holes between the playing points so this is with a dilation of the hole sizes 1 and here the hole size is 2 or it's also called the dilation rate and usually people use the duration size of 2 and they show that for their cocoa applications in some segmentation applications is even more optimal to use a dilation size of 4 and 6 but it depends the optimal dilation depends on your architecture it depends on your specific imagine it even depends on your specific object because as we saw for large objects and small objects the dilation is different so even if the only important thing here is the dilation still a solution like that will be desirable because the dilation is learned and adapted to the local parts of the image and to the object that you are looking on it's a generalization of convolutions in general and the dilation convolution specifically yeah what sorry on acid yeah dilated convolution on acid that will be the title of my next talk I love it so but they showed that if you use their method then it it improves even on the most optimal a configuration that they could use when with just dilated convolution so it even improved the results further and it didn't require any manual tuning or hyper parameter sweep of the dilation hyper parameter ok so these were deformable convolutions now we have two choices one choice is to move on to the next paper which can be featured pyramid networks or vocalist and the second choice can be to like answer questions so i don'r either you decide or we can have a vote okay so any anyone that has more questions can of course come to me later and ask them or send me an email or whatever you want so feature pyramid networks this is also this is the paper by Facebook AI research group one of the best object detection and deep learning teams in the world and it's as I told you the previous paper try to answer the challenge of the anatomy and the pathologies that to the medical pathologies that we are trying to look for the previous paper I try to answer the problem that they don't have every they have any irregular shape and in a deformable shape and this paper tries to answer the problem of the object being small so this is a spine fracture as a fracture in the spine you can see it here and what is the problem of what what what is what is the problem with the small object why are they difficult for neural network so just as an intuition if we perform max pulling on and each each of the and we have several neurons several neurons next to each other and these neurons receptive field covers mainly this area and the receptive field of this neuron covers the fracture and the receptive field of this neuron covers this area of the bone then after then even if we have really good feature and we know that here this neurons indicate that it's there is a bone under it and here it indicates there is a fracture under it and here it indicates there is a bone under it after we perform max pulling on the three of these neurons we live we lose the spatial order between them we know there is bones there are bones there and we know there is a something like a hole there but the hole is not necessarily a fracture and we need to understand the spatial structure between the things in order to really classify fractures so this is one intuition for white convolutional neural networks have problems with small objects this is not the only reason by the way another reason is class imbalance small objects are more underrepresented in in our data and this is also another reason but this paper deals with the problem of not having good enough features that describe the special occasions so the motivation for this paper is weak a lot of papers before did such things similar to that maybe instead of just predicting using the deepest feature map maybe we can combine somehow feature maps from several depths and a lot of papers did it before and they also show like like dense net as well then set also did did something like that but the Internet's like it's part of the of the architecture is like you can say that ResNet also also does it because of resonant also has keep connections but I mean that you take your final feature map that you predict from will say T in in a minute okay so so we can assume that features from shallower feature maps can be important to classify small objects because they were developed before doing too much max pooling so they lost less spatial information so it would be desirable to use them when we when we classify small objects but we miss them because the the network we use only the last layer and we we could say that we we can hope that the network will be smart enough to develop good enough features before it does max pooling to identify that this is the fracture before it does works pulling what it while it so while it while it has the the lower level features and we can hope that it will work but as we know with neural networks a lot of times if you don't force the network and you don't encode your prior knowledge of the problem into the architectures design then the neural networks don't behave optimally and although they could fit many types of functions they tend to fit not the optimal functions unless you encode your primary information into the design so there are a lot of ways to to combine the shallower the shallower layers and show a shallower feature map but this is currently the most popular implementation and the reason the reason that I feel confident to say it is that in the last cocoa object detection competition in the end of 2017 all four top competitors all four top submission used feature pyramid networks as a major component of their submission and it improves object detection accuracy by about ten percent so what I really love about this paper is that it puts simplicity and elegance is a major part of their work so the first element of it is they said we already know that in convolutional neural networks if we use image pyramids some of you may be know it is test time multi scale so I'll explain what it is in a minute if we use an image pyramid it really improves our way to deal with smaller objects so what is an image pyramid we take the original image and we scale it up to many sizes or scale it up and down to many sizes and then small objects appear larger and are less affected by the pulling operation and it has several several other advantages and then we pass if we use where's at 101 for example we pass each of these scaled images separately and we have nominal nominally like 10 sizes for example which is each of these 10 images separately through the 101 layers and get separate predictions for each of them and this works quite well the problem with it is it's not really feasible for most application between because it we it requires a lot of time to make so many forward passes of large networks so they try to imitate or take their intuition from image pyramid which which has already proven itself and in many of the design choices in the paper they try to just instead of inventing something from strategy try to imitate something that already exists and is known to work and you can see that even in the diagram it looks quite similar to image pyramid and we'll talk more about it in a few minutes so you can use feature pyramid networks as a part of the RPM the region proposal network the first stage of the faster are CNN and as the part of the second stage of the detection network so there it is combined differently into these two parts of the network and we'll talk about each of them separately so for combining it into the RPN and now you will understand what feature pyramid networks actually do we take the image and put it through our feature extractor this is our feature extractor for example ResNet 101 101 and then we have something that they call a lateral connection which is a one-by-one convolution a one-by-one convolution that keeps this feature map the same size but transforms it to be to have 256 features and then we take this new feature map and we up sample it using nearest neighbor up sampling with a scale factor of 2 in each dimension so we enlarge it and since our pulling in the original fixed feature extractor was also in with a factor of 2 then now the scaled up feature map is the same size of the feature maps of this for the previous stage in the feature extractor so now we can take we can choose a single feature map from the previous pooling stage in the feature X structure and we can combine them using a one by one convolution on this feature map and summation of the featuring us not concatenation summation and we repeat this process several times actually this process is repeated something like five or six times depending on the implementation and then we get a pyramid of feature Maps and the shallowest feature Maps feature map contains features from all contains features that were transformed for from all or almost all of the levels of pulling in the original feature extractor and then sorry for each of these feature maps we predict separately we predict bounding boxes from this this feature map separately and this separately and this one separately and there are only three feature maps drawn here but in practice they use five or six depending on the implementation so which which layers do we choose for this to which layers do we choose for the lateral connection that does anyone have like I guess mmm yeah so we we don't take only shallower layers we take both shallow layers and deep layers but we take one layer from each before we there are five pulling operation in the network so before each of these pooling operations we take the let the output of the of the last convolutional layer before before this pooling operation why because as we said before the pooling operation is the component that code that causes us the trouble with the small objects so after we perform the pooling operation we will lose information so we want from from each pooling level to take some information some features and we and we choose the last layer before the pooling because intuitively it has the most developed features for this level of spatial information okay yeah it's similar but you predict from each level yeah it's as I said it's it's very similar to many other architectures they didn't invent the concept of using features from shallower levels it was mentioned in dozens of papers but this group and also several maybe a few other groups come in the same time are the first to propose this mechanism of predicting from several stages by the way SSD SSD detector also predicts from several stages but it doesn't use the shallower features it starts it starts from the top and creates new layers but it never combines the shallower features and that's where SSD misses okay yeah I sorry yeah you are returned before yeah this one yes and I will show the results in a few minutes mm-hm yeah [Music] from each level from each for each other I'll explain so that everyone can hear won't repeat the question I'll just explain the process okay thank you so for each pulling operation you take the output of the convolutional layer before it and you put it to a one by one convolution to reduce its number of features to 256 and then you sum it with the up sampling from the top down connections that they have here okay is that this this feature map is a result of a summation of all of the of all of the 500 outputs but the one above it doesn't include the one below it it's only a result of the summation of of the the ones above it okay and yeah okay yes more lateral connections I think the III would love if we could answer this question later because I think it's a less of an understanding question and more of an intuition question so if anyone has more understanding question about the concept then it's important to ask them now yes the training is end to end yes we will get to it in one of the next few slides mm-hmm how do you how you train it we'll get to it actually I think right right now so now that you have all of these these five pyramid levels and we see a three of them here we use a predict head on top of each of them which predicts the bounding boxes or the proposal bounding boxes and how do these these heads work they work just like the head investor our CNN this is by the way part of what I said of there's the simplicity of their design they tried not to change too much keep all the mechanism the same just add as few as possible so this is taken from the faster our CNN paper this is the original paper this is how to predict in faster our CNN you didn't have this feature map you just had the deepest feature map and you put a 3x3 convolution on top of it and for each location of the 3x3 convolution you turned it into a vector of 256 dimensions you and you own these vectors you use the 1x1 convolution one one by one convolution to predict the coordinates of the bounding boxes for that special occasion and one one by one convolution to predict the probabilities of being an object and the probability of not being an object so here you use the exact same predict head on each of these layers separately and not only do you use the exact same predict head it's not only that you the same architecture for all of them you also use the same weights for all of them they share weights so if the this the predict head on this layer predicts a false positive box and then it's punished in the back propagation mechanism and the weights in the predict head change they change for all of the layers together okay and they they tested it experimentally empirically and they saw that using different predict tabs if they don't share weights then it doesn't really improve their performance so more about how they train so for each of the levels of the networks we have three here and two more here they assign a different size of bounding boxes so the deepest bounding box is only trained on the smallest object and the and the top bounding box is only trained on the largest object this is similar to the concept of anchors in festersen air for those of you who are familiar with it only here the anchors are are trained on separate layers each each anchor at the end of each scale is trained on a separate completely separate layer and the yeah and the reason for that is that we would expect that this layer will contain the most relevant information for small objects so we want it to specialize on small objects because it's a really difficult task so you want it to be the best it can on small objects and that's the intuition behind it okay great so this was about how to combine it with RPN and now we will talk about how to combine it with the second stage of the network first our CNN so a short reminder about how fast our CNN works so with the feature map we with the RPN we get the proposals on the image we get about for example two thousand proposals and then we use our pulling or deformable our eye pulling to pull each one of these proposals from this feature man okay but now that we have so in faster our CNN we didn't have these layers we only had this layer and that's where we pulled the bounding boxes from now that we have many layers of different special resolutions maybe we can also pull the bounding boxes from these layers and not only from the deepest layer and that's the concept but how do you decide which layer do you pull the ROI from so here again they use the concept of simplicity and they said if you're trying to imitate image pyramids we can just use the decision rule that is already known to work with image pyramids so there is a very clear formula that is used and this is the formula the floor of four plus you can read it okay so you can read it and now I'll give examples of how it works so let's say that our W and H are the sizes of our proposal that within the height of the proposal in pixels so let's say that the proposal Z is of size 224 by 224 what we get here is 4 plus log log 2 of 1 which is 0 and the end result is 4 this means we are oh I pull the results from the fourth layer and we'll talk about the intuition behind this in a minute and when we take the a proposal which size is 112 by 112 then the the result will be 4 plus what you can see here and there is the result of this log operation is minus 1 of course so it will be 3 and for larger bounding box it will be 5 ok so you can see that this this formula enables us very efficiently to our Y pool larger bounding boxes from shallowest layers with less spatial resolution information and a smaller object from the layers that are that contain the most special the information with the most special resolution by the way a three is the number for this layer four is the number for this layer four five is this layer index and and so on or actually this is two three four and so on 100 by 100 yes it's a still a relatively big object and we have one feature map below it to take care of objects that are smaller than that we have just one feature map below it ma'am she said 100 and 100 is still big but we use a relatively what we use a relatively shallow layer to to pull it from and maybe we're wasting the high resolution information there on relatively large objects I'm sure that this can be optimized more but but it works quite well okay so in this formula we saw there is like a magic number here 200 yes sorry [Music] before okay maybe it doesn't really matter if even if it it is just the way they decided to build a formula okay so it's for its for ease of it to make the formula more friendly actually and we saw that there is a number in 224 it looks like a magic number because it's although it's also the size of images in image net so that's the reason we use this number here because and that and layer number four is actually the layer that the ry pooling is performed on on the original faster are CNN with ResNet 101 it's pulled from layer four and since ResNet was pre trained on image net on and all of the images in in image net are pretty much objects of this scale then if we get an object of this scale we would like to pull it from like the default layer to pull to pull object that was that has proven itself so far so that's the intuition for the 224 a number and that's also why they have four here because then when you have an image an object of size 224 by 224 the log goes to zero and you remain with the default layer index yes explain about the what did I mention of the score the units of the score you mean sorry previous okay yes yeah okay so okay is the number of bounding boxes that is predicted for each special location it's also called the number of anchors if you if you know it so for each special occasion I don't only predict a single bounding box I predict something like nine bounding boxes or fifteen bounding boxes so for each of these these bounding box for each of these nine bounding boxes I want four coordinates and two probabilities so this is the 2k and the 4k it's not two thousand or four or four thousand it just like two times the number of bounding boxes that I'm predicting in that special occasion okay okay let's move on and we're close to finishing this so regarding European experiment results they try to test it it's relevant to solve the question that you asked before so they tried not to use the top-down connections only use the lateral connections and not which means predict from each level of features but don't combine features from deep layers and shallower layers and they saw they saw that they didn't even improve on top of faster are CNN and they try to use only the top down connection without the lateral connection and it also didn't improve the only thing the only other thing that they tried the did improve was doing creating all of this pyramid but not using all of these predict heads using only a single predict head from the bottom and since the bottom contains a combination of all of the features intuitively maybe it's enough just to predict from it and it's more efficient but in practice they saw that it's not enough and that but it's not sorry that it's not enough and that it does improve but a lags far behind the full feature pyramid Network solution regarding faster CNN they asked themselves maybe we if we use feature pyramid networks on the RPM maybe it's enough maybe it's too much to put it on the faster CNN as well maybe it's not improving anything else so they trained the RPN separately using feature pyramid networks recorded for each image the best proposals that they got for that image and then train faster CNN separately using these proposals from the beginning of the training faster CNN only got the best proposals it was not trained end to end and still faster CNN improved the results by another five to ten percent so this component of faster CNN is important and we talked about this decision rule here and it's it's nice but I saw that even if we don't use it in investors in festersen and in the second stage we only pull from this layer from the deepest layer we don't get a large difference in results so faster CNN is less sensitive to which layer you pull from you just need to pull from the last layer with the most information and another some other neat things that I saw of course it improves especially for small objects the results for small object and the test time per image on a single GPU is less than a single scale non feature pyramid network of the same architecture so the the feature pyramid network although it it is a more complex architecture it's faster and it's out of the scope to to explain exactly why but but Ross gear cheek which is one of the writers of this paper and is very well-known wrote like a comment on github explaining that so this is feature pyramid networks and before we need to summarize if someone has a really really important question otherwise you can come ask me I'll stay here okay great so we won't have time to cover focal loss but I will publish my slides online and hopefully maybe there will be enough self contained for you to look at them and maybe we will see each other some other time so to summarize object detection I really believe that it's a revolutionary technology with that is progressing really fast and has the potential to change every industry and the latest major adventures advantages the latest major advances in object detection except for being really creative and mind-blowing and fascinating to learn about also address some of the most major problems in many data domains but also specifically in medical imaging and the solutions that we talked about earlier except for focal loss but we didn't have time to cover were small objects which feature pyramid networks to give a very elegant and efficient solution for deformable shapes which the deformable convolutions and deformable are i pulling give a very elegant solution to and extreme class imbalance which vocalist is a very nice and innovative solution for so as usual in deep learning like what we said in the beginning it's not rocket science and I think that many engineers can understand the concepts and implementations but the question is what can we do better in order to make it this information more friendly and reduce the time that is required to understand it by a factor of 10 thank you very much [Applause]

Info

Channel: Aidoc

Views: 2,638

Rating: undefined out of 5

Keywords: artificial intelligence, ai, deep learning, object detection, telaviv

Id: N3WBsnkOY3Y

Channel Id: undefined

Length: 89min 39sec (5379 seconds)

Published: Mon May 07 2018