Mask Region based Convolution Neural Networks - EXPLAINED!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the most intriguing advancements brought by deep learning and neural networks is in the field of computer vision we associate any problem that has an image or camera input to encompass problems within computer vision self-driving cars fMRI analysis Mars exploration Rovers facial recognition systems object detection and augmented reality are just a few breakthroughs in the field in this video we will take a look at a new type of neural network architecture called masked region based convolutional neural networks masked are CNN for short and in the process highlight some key subproblems in computer vision as well masked our CNN works towards the problem of instant segmentation the process of detecting and delineating each distinct object of interest in an image and so instant segmentation is a combination of two subproblems the first is object detection and this is the problem of finding and classifying a variable number of objects in an image they are variable numbered because the number of objects detected in an image can vary image to image and the second part of instant segmentation is semantic segmentation semantic segmentation is the understanding of an image at the pixel level that is we want to assign an object class to each pixel in the image in this figure with the motorcyclists apart from recognizing the bike and the person riding it we also have to delineate the boundaries of each object using object detection and semantic segmentation together we get instant segmentation in these images the bounding box is created from object detection and the shaded masks are the output of semantic segmentation now that we have a high-level intuition of instant segmentation we'll take a look at the architecture behind massed our CNN since there are two phases we have two parts for object detection it uses an architecture similar to faster our cnn's for semantic segmentation it uses fully convolution networks so first off what is an AR CNN it is an approach to bounding box object detection thus creating a number of object regions or regions of interest ro eyes the next version faster our CNN performs a better job by incorporating an attention mechanism using a region proposal network an RPN faster our CNN performs object detection in two stages first determine the bounding box and hence determining the regions of interest this is done using the RPM protocol like I just said before and second for each ROI we determine the class label of the object this is done with ROI pooling masked our CNN does incorporate these tasks but there is a problem of data loss in ROI pooling this involves the applying of pooling usually max polling on a region of interest the bounding box computer during object detection hence the name ROI pool in this method the stride is quantized now pooling is used for down sampling of features and is used to introduce invariance to minor distortions and input these minor distortions could be something as simple as rotation of an image so consider this five but even if it is rotated our models should still consider this image as a five that is the same input so polling enables a model to become invariant to such rotation in this case stride is the number of cells by which we move our sliding window during pooling or during convolution if you want more information about pooling and the intuition on stride check out my video on convolution neural networks now coming back to ROI polling when I say that the stride is quantized what do I mean consider a region of interest of say 17 crossed 17 and we need to map it to a space of seven cross seven the required stride is 17 divided by 7 which is 2.4 - since a stride of 2.4 - is meaningless ROI pulling will quantize this value by rounding it down to 2 so it will use a stride of 2 along the width and the height however in doing so it only considers the top 14 cross 14 pixels in the 17 cross 17 region the remaining points are lost not only is there a loss of data but this can also lead to misalignment if we use an 18 cross 18 input and map it to a 7 cross 7 output the required stride becomes 2 point 5 7 this rounds up to 3 in ROI pooling so you can see that there's a misalignment when we perform polling here now to address this problem roi align is used no quantization takes place so in the case of the 17 car 17 input region we consider a 2.4 - stride as it is however this value is meaningless each cell is divided into a 2 cross 2 bin so that creates 4 regions in the top left the top right the bottom left and the bottom right and each of these sub cells is pulled through by linear interpolation leading to 4 values per cell and the final cell value is then computed by either an average or the maximum over the 4 sub values by addressing the loss and misalignment problems of ROI pooling the new ROI aligned leads to improve results ROI align is thus better than ROI pool as it allows us to preserve spatial pixel to pixel alignment for every region of interest and there is no information lost as there is no quantization conceptually the Mast our CNN is similar to the faster our CNN master our CNN additionally outputs the object mask using pixel to pixel alignment this mask is a binary mask outputted for each region of interest overhead isn't incurred when computing this mass as it is done in parallel with the bounding box creation and classification consider a region of interest of M cross M pixels let's assume that there is K possible objects that it could be for example in an image if we were trying to categorize humans dogs and cats then K would be equal to 3 for each type K a binary mask M cross M is constructed analog is 2 a 1 versus rest approach hence while computing the mask a loss of km square is incurred this is different from the typical approach of constructing a single mask from K classes as the classes would compete in the mask this lack of competition is the key to good performance in instance segmentation in each region of interest ROI determined in the object detection phase let's take a look at the semantic segmentation with FC ends fully convolution networks FC ends are used to predict the mass from each ROI so why are we using convolutional layers this is because convolution layers retain spatial orientation such information is crucial for location specific tasks like creating an object mask so you can see why the traditional use of fully connected layers won't work here in fully connected layers a spatial orientation of pixels with respect to each other is lost as there are squished together to form a feature vector in facebook AI research the cocoa dataset is used it's a large-scale data set for object detection segmentation and captioning there are over 200,000 labelled images consisting of 1.5 billion objects masked our CNN takes about 1 to 2 days to train on this data set using an a GPU machine it achieves good results even for challenging images here's a comparison with respect to the state-of-the-art fully convolution instant segmentation system FC is FC is is an alternate framework that also uses semantic segmentation and object detection to categorize box and mask objects in an image and it does it fast but FC is exhibits systematic errors on overlapping instances and creates spurs edges showing that it is challenged by the fundamental difficulties of segmenting instances here are some key things to remember in segmentation is object detection with semantic segmentation Masdar CN n is an architecture to achieve instance segmentation it combines faster our CN NS with fully convolution networks fcns masked our CN n uses ry align which preserves the spatial orientation of features and leads to no loss of information and there is that the new masked our CN n for instant segmentation I'll leave a link to the main paper their code and links to other cool blog posts and papers in the description down below so check that out to leave a like and comment down below on your thoughts on this new technology subscribe to the channel for more SuperDuper content and I will see you in the next one see you
Info
Channel: CodeEmporium
Views: 96,429
Rating: 4.8989515 out of 5
Keywords: Machine Learning, Data Science, Deep Learning, computer vision, convolution neural networks, artificial intelligence, artificial neural networks, computer vision research, fully convolution networks, faster rcnn, region based convolution neural networks, facebook research, facebook ai research, FAIR, google research, google ai research, new neural network technology
Id: 4tkgOzQ9yyo
Channel Id: undefined
Length: 9min 34sec (574 seconds)
Published: Mon Feb 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.