Lecture 25 - Semantic Segmentation and Lane Detection [PoM-CPS]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so welcome back I hope everybody had some time to unwind or catch up on their work over the Thanksgiving break so we are on the final stretch of our course and our third module and you know since I skipped two lectures because I was traveling to different conferences this module will extend into the last lecture that we also have right so typically I reserved the last lecture as just a full recap but I think this time we'll have to do a bit of both wrap up the module and the lecture so we will continue you know in this module so far we began with this motivation for the use of data-driven algorithms for autonomous CPS so it's a very high-level sort of task and then we made it more specific and tangible by focusing on a certain class of data-driven models which don't require any physical domain nor do we care about correctness and things like that so far and so we just want to map some raw data to some model output and to do so we stuck with the example of convolution neural networks which build upon fully connected neural networks to begin with and we had a little detour as always to go deep into understanding the framework of the modeling and the principle of modeling but now we come back to the domain and look at how what we have learned a price to self-driving cars and in particular I want to focus on you know deep learning for autonomous driving once again deep learning for autonomous driving is like a big topic on its own and and in the final lecture I will give you a preview of all different sort of applications of what you have learned how most of what we see in self-driving car prototypes are using you know these are Alec's net and other vgg 16 like networks to do a lot of the scene understanding part but to again keep us focused on the task of mapping what we have learned on to the task of autonomous driving I will begin with the specific application on lane detection okay so as always shout out to everybody who's prepared a lot of slides and I'm just reusing most of them so they have done a good job of preparing a lot of slides for me it saves me a lot of time okay so so so far you know we've done what you can see in this picture more or less where we have seen that architectures like image Alex net which are trained on the image net database and similar databases they can do this task where we have input image and we have a series of gone full release and there may be some other type of layers that we haven't yet discussed but the essence of the test that you have to produce the output class label or the most likely label going through softmax and things like that and I have been showing you this video or several times where if you think about it every vehicle that you detect is going through the same pipeline as the previous picture what is not clear is how are you searching for where the vehicle is in the scene and that's slightly more complicated task and I think someone had actually brought up this question earlier so baby you'll find your answer by the end of the lecture today but to get us started with the applications of deep neural networks for self-driving I want to look at the task of Lane detection it's the 101 of autonomous driving you have to be able to detect what Lane you are driving and humans don't even think about it we can see and you know very easily figure out what is our lane if there are lane markings or if there is no snow on the road subject to you know all that fine print but for autonomous Lane keep assist and other technologies even if you're the car is not controlling itself on the lane many cars have lane departure warning they are also detecting where the léna FS especially what is the lane of the car the lane in which the car is driving and the car is sometimes referred to as an ego vehicle so that we refer to this as detecting the ego lanes and this is not always right so here is a you know self-driving car prototype or level-2 level-3 prototype from Tesla autopilot I don't have anything against Tesla but this video conveys my point from YouTube so so let's watch this video I wonder if we have audio but even without audio it should be interesting ok so there's no audio what you see this car is driving and all of a sudden the left barrier is going to there's no lane and the car doesn't figure it out and this person has to correct itself and actually I'm kind of glad there is no audio cause this person is just the cussing all over the place it's so so so even you know these cars are being driven they are in commercial production they are getting tested on public roads and they do face these sort of problems actually want to play this video again just to convey a different point altogether this time in favor of Tesla so if you if you watch this video again so here's your first clue right it says much and the person still relies on autopilot to figure it out and then you drove forward there's another sign which says two lanes ahead instead of three there's another sign which says merge again right and this person is just ignoring all those actual traffic signs and over trusting this ability of lane keep assist right so so that the meta point I want to convey is that these systems still require driver oversight right and so you should pay attention and treat them as you treat regular cruise control where you have to take over at any time and we've sort of seen this example even before when the car was entering a tunnel so here is the same sort of view and what we have over lived on top of this is the actual image from the front camera of the car and you can already see that the the way the front camera processes images is different because you know it's much darker it's also because it wants to reject all the direct sunlight that's why the overall image is darker but if you notice on this image the sunlight reflecting on the top of the barrier can be you know interpreted as a lien marking and so that can explain why the car still thinks that there's two lane markings on both sides it's a point being that we have come a long way in doing lane keep assist and Lane detection but they can still be mistakes and point of the lecture today is how can we use the machinery that we have learned to solve and attack this problem but in actually in order to explain you and you know you have to experience the pain of not using deep learning to appreciate why it is better and we have made this point at a high level before I know I have said many a times that the human doesn't have to design the features themselves and the filters of the conv layer are essentially learnable so just based on data and labels you can learn features to do classification I actually want to first show you the entire pipeline of how you would do Lane detection without using any neural networks or any even data-driven algorithm for that matter just so that you experienced all these steps and then you will have a much more appreciation for why deep learning is a preferred method or maybe if you don't have an appreciation you at least have a comparison of you know whether or not you think it's it's useful at all so here is a roughly dozen steps that you would want to take and I'll very quickly go over this this is not a computer vision lecture but I will go over most of these steps with many many examples to quickly give you intuition for how you would accomplish this task without using any convolution neural networks right so so the task roughly is you have the input image on the left and you want to build an algorithm not a network to generate something which resembles the output on the right where the output is that you have figured out where all the lane markings are so it's actually even a higher task than what we started before at the very least you have to figure out where the ego lanes are so over the left and right boundaries of where you are driving so to do so you know the very first thing which comes to mind is you don't actually want to use RGB color space so red green and blue channels is typically what camera outputs by default but you can transform that information in two different color space so once again I'll be fairly quick about what this is and why it matters so you can think of the left image as a typical state space of the camera output in the RGB Channel right so every pixel at some point in this RGB 3d space another color space is what is called HSV or hue saturation and value space hue is really telling you what is the dominant wavelength or color again of any individual pixel in the image so you can visualize this cylinder on the right where colors are sort of on the periphery of you know or the radial axis of the cylinder saturation tells you how intense that particular color wavelength is so saturation is you know again for intensity and then value tells you the darkness or the brightness of that pixel right so you can go from RGB to HSV in fact another color space is also called very similar to HSV is called HSL where hue and saturation is still the same but the L channel instead of giving you just the brightness of the pixel it gives you how much white content is present in the pixel right so it's slightly different so why why do we care about these color spaces well we can take our image on the very left and we can convert them into the equivalent HSV and HSL maps so on the very right most image is one which appears to us because you can see in the in the hue saturation and light color space of the original RGB the lanes are very predominant there's like prop out in in other words and so you would manually want to convert everything into HSL before you do all of the rest of the image processing to detect the lines so how does the Lane detection already work to begin with so let's say we you know change everything to HSV or HSL then what you would want to do is lanes in the United States are either yellow or white and so you want to isolate all the pixels which are either yellow or white from so the center mask is telling you which pixels in your original image were yellow and we don't have any yellow lanes in our original image but the the traffic signs in the horizon are yellow and they are getting picked up and then the rightmost image is the same thing which is just of byte filter Isis telling you everything which was white in the original image and so we pick up lanes we pick up the rear of the sedan or car and in the adjacent lane and so forth okay so we we are still processing our input space and you can actually merge both of these into a single single sort of image or composite with you know the layers of the white and the yellow on the same image and you get the output on the right-hand side then another sort of trick you do is you apply Gaussian blur so again ad won't go into details of the math behind Gaussian blur the effect is on the left-hand side you have your combined white and yellow spaced image which was derived from the HSL view and you can see there's noise right there's all these sort of dot pixels or line segments in the air Gaussian filters roughly speaking is a way to get rid of this noise and so output is shown on the right with slightly less noise you apply another filter where you are now going from the yellow and white pixel space in HSL to grayscale images right so we have seen before grayscale images are images where pixel values are between negative 1 and 1 and gray is 0 so just easier to work with numerically and it also actually gets rid of some of the noise as well so you after so many steps you are finally at the image on the right hand side which is grayscale image of the filtered white and yellow channels of the HSL of your RGB okay that's the pipeline so far then you actually do your first sort of real algorithm which is called canny edge detection it's one of the most popular edge detection algorithms the way edge detection works is simply it looks at all the pixels of your image and it's going to keep track of gradient of pixels what do we mean by that we want to keep track of which pixels are the ones where the value of the pixel abruptly changes by a large magnitude so we are scanning all the pixels in our image to find those pixels where the difference between left and right is one or negative one guys all the background is all black and the white is the positive one value let's say and so that's the output which is shown on the right hand side and that's akin to saying that you have detected edges in your image and so once you have the edge detected image on the right it's still not clear you know where the lanes are we can tell as a human but for a computer we have to get rid of all these other points which are also getting detected with edges so another trick you do is you select a region of interest in your image and you only preserve whatever pixels are within the region of interest and you neglect everything else so I'll show you how this works in RGB space so here's your residual image and on the right is my region of interest mask so essentially I'm manually selecting some value based on where the camera was on the car at the time and I'm going to ignore everything which is outside of this region of interest so this can help you to go from the left-hand edge detection image to the right-hand side yeah good question in this case you could have done but in general for Lane detection you actually don't want to use any region of interest also it depends on whether your lanes are curved or not that's another reason but for this example it the order doesn't matter so another property is the region of interest would only work if your lanes are you know always within this region of interest or the camera is always in the same place from maybe one data set to the other and things like that one more trick that you want to do is you want to take your original image and let's say you have a region of interest the problem is for a flat image it doesn't it's not big of a problem but your if your lanes are not parallel in the original field of view and so it's easy to fit straight lines on data is the intuition behind it so what you can do is you have a region of interest or maybe even an initial guess of where the lanes are and you can transform this red polygon on the left-hand side such that the left and right boundaries of this polygon are parallel and so you when you do this perspective transformation you are actually distorting your original image and many times in the world of self-driving the right hand side images misinterpreted or called like a bird's eye view of you know what the scene looks like because it actually looks like someone is observing where the lanes are on the top of the car but this is just a perspective transform so why did you do this you do this because you again have you know either the output of the canny edge detection or the region of interest you can do this perspective transform and what you will get is something which is similar to what is shown on the left hand side and then you can do a simple search or a histogram the histogram on the right is plotting the values of the pixels of air so far for each column of the original image from 0 to 1200 we are plotting the intensity of the column the white channel or the you know the the white bitmap for that column and we do this for all 1200 columns and you will get a histogram which will tell you where in your original image on the x-axis why do we care about the x-axis because this is a perspective transformed image so everything is like parallel looking forward very relative to the x axis of the image and so this is going to tell you where is the lane information or which pixels have the two lane boundary so the histogram is already looking like the width of a lane for you and so the final thing you do again I'm going over this in very you know high level detail you take this information of where you suspect the lane markings are based on the histogram and then you can fit a sliding window or a box beginning from the bottom of your image add those X values and just keep sliding the box wherever you find this white band in the original image okay I don't know if I did a good job of explaining it but I hope the image here makes it clear on what this sliding is doing finally once you have all these sliding windows which may look like something on the left-hand side here you can fit a smooth line through them and this is something called a hue transformation as well and once you have your left smooth line and the right smooth line you can project it back using this inverse you know perspective transformation onto the original image and that will give you an idea of where the lanes are in your image okay so so let's look at how well this would work so you know on a on a relatively straight lines it's doing a good job there is some jitter but now you can see we are approaching a very slight left turn and while we are doing a good job of figuring where the lanes are in the near vicinity of the vehicle the you can see on occasion we just completely lose track but overall I would say it's it's not that bad and you can actually see the output of all of the individual tasks that we did at the you know the top of this this video so there are some limitations of course of this work where you know the protective transformation is very specific to the type of camera its focal length and its setting and specifications all these thresholds you are using the region of interest all the sliding mechanism it is very sort of you know individual individually hand tuned or hand picked there's no way of like learning it automatically and also this algorithm takes a lot of computation right you are doing so many steps so in real time you have to process like at least 60 to 100 images per second and we don't really can be cannot really do that with such you know computer vision architecture but most importantly I want to emphasize that all of these steps how well they work and you know you can make a case for improving the computation time and the accuracy it's still very manual and this is what I mean by saying that traditionally all these computer vision tasks were done you know stitching up these manual pieces together it still solves the problem we saw the output it's it's descent it's not very accurate but it's not that bad either and another limitation is that you are there's no way of saying how well you will perform on other data sets this is very specific your choices of these parameters depend upon lighting conditions and all these things right so maybe one choice of the parameter would work for just daylight and it wouldn't work at all when there's fog or at night when there's glare right so in fact even on this very simple supposedly right turn you actually have a true lane but then you have this tarmac I mean highlight it here so the true white line is here but then this black thing is just some tarmac Lane artificial Lane on the line and then this can be interpreted as the shadow can be interpreted as a straight line the top of this barrier can be interpreted as a line as well so it really depends upon your you know region of interest and very all of these choices that you have to make to get it right and even a straightforward scenario there's no other vehicles nothing is occluded and so if we are struggling on even this scenario you know this is reality right so you know this shake on the camera all images are not super high definition they get blurred so you will lose this information in fact this is even difficult for a human to figure out what's going to happen right so yeah so you can do something you can try doing some perspective transforms but they don't work that well on that previous image and I don't even know what's going on here but a self-driving car has to you know deal with these situations as well this is interesting too for some reason you know this line just curves around and if you have only looked at the sliding window to detect straight lines you will miss this curve okay so there are problems it's doable but there are problems and when it when it is doable it takes a lot of expertise and what CNN's on the other hand is let the data be the expert okay give me good data which is labeled and has good lane markings and everything and let the network figure out what the best features are so in some sense the layers of the network are doing their own internal transformations and filtering and edge detection unknown to us or we can probe and visualize the layers of course but it cannot control what layer does what so that was just a brief detour to convince you the pains of solving the simplest problem or the straightforward problem with self-driving cars and we are struggling using domain expertise so now let's jump ship and look at where deep learning can help us and so this has been area of lot of attention that's always improving I'm going to describe something called Lane net not surprising anytime someone builds a network you look at what it does and then just append net after that so one idea has one idea right so think about what would the labels of an image that you want to detect lanes in would like and here's one idea that for every RGB image my label or output that I want to generate would be a mask where the only non black pixels in the mask are my drivable lane in other words I want to train a network that takes the RGB image and learns this mask which is similar to saying you have learned where the lane is and then you can even impose this mask back on your RGB domain right so let me show you the output first and then we'll see how to do that so I think I showed you this before as well but now you know what I mean this is the output of a network which has learned to project this green mask onto every frame of this video right so instead of just predicting our output which is my color or not or this is not a classification problem anymore we are not saying cross or 0 we are actually generating which pixels in my image belong to the class lane okay so it's very interesting and for the remainder of this lecture I want to show some of the ideas behind this and you can see it performs a reasonably well this is not the only solution by the way another thing you may want to do and this kind of overlaps with this video that I have shown you of vehicle detection you can think about what would be the ground truth or what's the data set that we can use to learn these bounding boxes around objects of interest or learn where the lanes are so here's one idea you have your let's say the plane image you have on your on the left hand side and then someone manually has to go in and label where where the vehicles are so the yellow boxes are where the vehicles are then the traffic light and the pedestrian so this is a manual labeling task and you can do that for lanes as well so in this picture someone has labeled where the vehicle is and you can draw you know this polynomial line or this line segment on top of the lane markings and all this MA all this met data that you are adding on top of the image is getting stored as the ground truth labels right so in this picture we combine everything together we have vehicles we have a label for left lane a label for right lane traffic lights and pedestrians and you can see all these labels are here with their corresponding colors and so you have to painfully you know draw these ground truths onto a real image and then you can train some network to predict these ground Kreutz for an unseen image or a test data right so in case of lanes one idea was this segmentation mask that again you have to paint over the original image to generate the ground truth for another idea is if you are just putting polynomials you can just store the coefficients of these polynomials right so in the previous image the left lane has three coefficients and the right lane has has three coefficients and we are just learning for every image what were the coefficients and so if you have a CNN you can modify it to instead of predicting the class of the object it will predict the values of these coefficients and that will be the lane prediction so these are two different ideas one is let me draw line markings as polynomials and represent them as polynomials the other is let me just use a mask on my image but they are very related so questions so far on you know we haven't really looked at how to solve this but is the setting clear what we want to achieve we want to figure out what sort of networks to train to solve this problem of either the bounding box or the polynomial or the segmentation mask okay so so far what you have done and what you are also doing in your homework with the crosses and the circles is similar to this problem of what Alex nett did on image net right we have some some data you have your convolution you don't network you have fully connected layers and then you predict your class course with the softmax you can convert them into probabilities but what the kind of Lane detection task that we want to achieve are slightly different of different nature they are not just classification tasks of weather or object is present in this image or not so to begin with we have semantic segmentation where the idea is that I want to assign every pixel of my image its own class ok so it is still related to classification but the output is not just a single label you are producing a label for every pixel of your image and the output would look something like this a slightly more different task is yes I do care what object is present in my image so here is a cat but we care about vehicles however I also care about can you draw a bounding box around where that object is present in the label and so we've seen that too to be able to do something like this you need the data where someone has labeled not just which object was present but what's the bounding box around that as well so this is what is called classification thus localization means where this bounding box is slightly more complicated so here you actually assume that there is going to be a single object right so I know there's one object I'm looking for I need to tell you which object it is and where is it located of slightly difficult variation of this as generic object detection where I don't know how many objects exist in my image but I need to sort of find and localize all of them and then the the most complicated task is what is called instance segmentation where first of all I care that if there are multiple objects present of the same class I have to differentiate between them so I care about that this red one is dog one and the green is dog two and instead of bounding boxes I actually want to paint the pixels of where that objectives right so these are progressively difficult tasks and so we will understand a few of them today and then map it to the lane detection problem ok so we'll begin with this task of semantic segment where the idea is we want to like I said instead of one label which is cross or zero in your homework or in general imaginet outputs you know the sort of label with the highest score instead of one label for the entire image I want a label for every pixel of the image which is different from by the way instant segmentation so look at the image on the right we don't care which one is cow one and cow - they're all everything which is painted Brown is cows so that's the difference but then what we do care is where's the sky where's the grass because every pixel will have its own label so think about how can we use what we have learned so far to attempt this problem first of all it's very desirable right so goes without saying that this is a very useful thing for scene understanding for self-driving car if I can do a semantic segmentation in real time on my sort of camera stream I instantly know where there are vehicles I can even do instant segmentation so I'll differentiate between vehicle one bar equal to blah blah blah I know where I can drive which is say purple I know where lane markings are in this yellow where pedestrians are where buildings are so on so forth so it's very very useful in fact here's again the output of semantic segmentation it's not detecting everything but it is detecting I think vehicles and pedestrians and buildings in this case so one idea is the following and let's see what you think about this idea I have an image I want to label every pixel in the image what I can propose is let me begin with some window or some patch size I will overlay that patch onto some part of my image so I've given you some three examples here and then this becomes my input to a regular object detector and it just has to produce one output so this is just business as usual CNN objected classification whatever label it produces I'm going to assign that label to the center pixel of my patch okay so I do all this work to get one label for one pixel when I pass it through the network and then I have to move the patch around and keep relabeling until I cover the entire image so it is kind of brute force right so it will work but can someone tell me a reason why we shouldn't use this in practice yeah so it's obviously very computationally inefficient and takes a lot of time but I would even say that you know when you move the patch ever so slightly let's say with the stride of one or two you know the CNN here has learned some features or has broken down this image into some features towards the end of the network and so when you just move it slightly you are not using all of that overlapping information you are treating every patch independently to get one label for the centermost pixel so so not not only do you need a lot of instances of this which is where the computational inefficiency comes from but you are actually neglecting useful information so this is a really bad idea and nobody uses it in practice to do segmentation but this is in theory feasible right this is how you would boot for set so what's a better idea okay did you have a question yeah yeah it is possible if the animal is green and the grass is green you would struggle so it depends upon you know can you actually learn not just the color information but actual features of what the animal looks like and that's what the CNN is doing here's another idea okay let's wrap our heads on this what you can do is instead of pixel wise imaging you can actually just train a layer of just convolution after convolution so the purpose of a convolution is to learn these filters automatically right so I can take an input image and just do multiple convolutions so I will get a big volume of my output and the way I'm going to Train this is for one forward pass I will only train on let's say the grass channel on another pass I will only train on the couch annal for all possible labels I'm going to train this fully it's not a fully connected Network it's just convolutions after convolutions to learn these internal weights and so at the end I will have this volume of the same resolution as my input image because there's no max pooling or you know any subsampling and then since I'm doing this for all my channels let's say I have C channels I can simply look at the highest score in the image to produce my pixels so I know that may not have been clear so lets me break this down into a much simpler example and then I'll take some questions so let's say this is my input my goal is this is a obviously a toy representation of this image but my goal is that I want to allocate every pixel a class label where the class labels are these five possibilities okay so I want to produce this is my semantic segmentation output I want to produce this basically a map where every pixel has some class label and the way I would do this is I treat every class label as its own ground truth or a different masks and through different convolution layers I can train them so that the loss for one of the layers was just on the person channel the loss for another pass was on the post channel the loss of another pass was on a different layer in other words this picture might help you to figure it out in one forward pass let's say I want to predict the second channel of my segmentation layers and this is the output filtered image that is produced at the end of all those convolutions but the ground truth is just this mask so I can calculate a pixel wise loss in this case and do backdrop on that convolution is this intuition clear on what we are trying to do so if you do this for all the possible layers of the segmentation map you can train a fully convoluted network to produce this output and so what I meant earlier was that let me actually go back so let's say at the end of your training you get some predictions for for a single image you will get some prediction let's look at this topmost pixel you will get some prediction of this pixel for the person class you will get some prediction for the first class some prediction for the grass some for sidewalk some power building so you're actually generating what you can call an entire vector of predictions for every pixel through these convolutions and so what you can simply do is if you are normalizing it and passing it through softmax you can simply allocate that pixel the class which has the highest value of the score so this is what I meant by saying that you can just do convolutions after convolution so that you don't perturb the input image size you want to have the same image size because you want to generate a score for every pixel and then we just read out the maximum score and allocate that class so this image is actually like a stacked image of different layers so questions on on this idea of segmentation is it clear what we are trying to do and how we do it you're still doing it is clever because you are still doing label predictions but again the problem is it is very computationally intensive you have to train these different layers by not doing max pooling you are preserving the same resolution throughout the network so your filters and your feature maps are going to no stack up and there's no way to reduce them yeah correct I lost you so I'll say that again no the ground truth the the ground truth will come from someone who has painted over the actual image so you won't have two possible labels for a single pixel yeah yeah any other questions so this is the again intuition we can obviously spend more time to break this down but I want to tell you how semantic segmentation really works I mean I wonder if yet actually solved it because that this still has a problem that we are always dealing in the original image resolution right so if we have this 250 by 250 pixels that's a lot of weights that's a lot of forward and backward passes and lot of bad propagations for different channels as well so finally what we can do is the following we still want to have the property where the output image or the filter map has the same resolution as my input but I don't want to just do convolutions because it's computationally prohibitive so instead what I do is I first just train a regular CNN you have regular Khan pull layers to subsample and reduce this image space so we are learning the features of this map and at the end when we have this low-level representation of our original image we do another operation that we haven't learned so far call up sampling instead of down sampling where we want to generate the original resolution as our input image ok so we know at least how this down something works right so it's just pooling and strides and convolution it will reduce your input from this image resolution space to something which is much more easy to manage and of low resolution but still preserves the patterns of the original image what we don't know is that what the heck is this up sampling now okay so that's the final piece that I'll explain what is up sampling and then all of this is going to make sense why we are doing it okay so here's one idea called UNPO leg and pulling is the opposite of like max pooling or pooling in general where you have input which is low dimension and you want to generate an output which is of a higher dimension so some ideas are the following so this idea is called nearest neighbors where what I'm doing is for every pixel I'm unpooled it by a factor of two so I am just creating four pixels with the same value okay so all these nearest neighbors have the same value as the input it's naive but it is one way to increase the resolution another way could be what is called bed of nails where you take the input value and you buffer everything else with zero and just preserve the input value in one of the locations in your unpooled image size okay now so in this case I'm looking at the input and it's always registered at the top left corner of this sort of 2x2 and pooling so these seem like heuristics right and they're kind of are and it's completely reasonable to ask or feel while this seems very arbitrary why did you choose top left why not top right or whatever by the way this is called bed of nails because you know most of the pixels are zero or sparse and then you just have some spikes or nails that's that's loosely why the term came from so can we do something better and smarter here's another idea during max pooling remembered so this is what max polling looks like if if you have forgotten you you had let's say a 2x2 puling window you simply look at the maximum value which is 5 and you register that as the output so we went from a four by four to a two by two what you do is in this segmentation architecture for every pooling there in your input you have a none polling layer as though a symmetric sort of uncool in layer and instead of doing bed of nails or nearest neighbors one idea is I look at the input so let's say you know we are at this small scale and we go to want to go to the large scale for 2 by 2 to 4 by 4 I will look at this input and I'm going to keep track of which location was the maximum value at when the corresponding max pooling layer was activated ok so when the image was being down sampled we keep track of that 5 was present in this bottom right location in the top quadrant and when the symmetric up-up-up unpooled occurs i want to allocate the pixel value to that same location ok so if you look at where the maximum values were on the input we want to just keep track of where those values occurred in the symmetric pooling layer and that's the location we will use to input our pixel values that's another idea so this is what is called in in network up sampling so you are simply keeping like a library of locations of where the maximum occurred and this is the unfolding process so any questions about on pooling so so far on pooling is just a function where we make a choice nearest neighbor bed of nails this fancy stuff but we are not we are not learning anything when we are upscaling so far so this is just a fixed choice so the learning actually occurs with something called transpose convolution some people call it D convolution which is not the correct term so I think transpose convolution conveys a so let me quickly explain what transpose convolution is so let's say you have an input four by four and we have a three by three filter which tried one pad one and so the output will never change so by the way I hope if you haven't noticed this already the output size is your image - filter size divided by stride plus one so when you use a pad one you are automatically increasing the image size by two because you have a pad one on either sides if you use a filter of size three you will have n minus n plus two minus three if you use the stride of one we it gets divided by one and then you add one so all of this will get cancelled and you will always be left with the same output size as your inputs right so I hope you sort of figured it out on your own during the assignment so that's why it's very difficult to use three by three stride one pad one thank thanks to VG g16 you know that's how we discovered it but to recap you know what was convolution it was this idea that you have to do this filtering over all possible values so this dot product and here is actually a animation which shows how this can be done in RGB space by the way it's another quick detour once you have these filters one idea is to just sum them up into a single filter as well okay if you haven't so some of the architectures do this as though so you can you know not you can prevent your network from having the - too many stacks at each of the layers by summing it up and also when you sum it up you can also add a bias term okay this was a quick hyperlink since we revisited convolution so convolution was this idea that you know we will slide this filter over our image and take this element by this product or dot product and we will get the output filter but notice that if we change the stride of the convolution we will also automatically subsample right we have seen this before that in this case for the stride - we are not going to be able to cover all the pixels in the same manner and we are dividing by 2 in the denominator so since we are skipping some some pixels the output dimension is also reduced so this is just a recap of regular convolution nothing new here transpose convolution is the following idea let me explain it and then I'll show you another example you want to go from a 2 by 2 image to a 4 by 4 one way is to use a fixed mapping call on pulling and you will use on pooling in the same order where you used pooling during the down sampling but what do you use in lieu of convolution in the down sampling what you use is transpose convolution the idea is I will still have a filter which will have its own weights I will read the input value of my input image which is low the low resolution and I will simply multiply the weights of my filter with that value okay so let's say the input value was this red pixel I will have a 3x3 convolution filter and it will have some its own values of the pixels I'm just going to multiply all of those values with the input and log it as an output because I have pad 1 I am also you know logging some things that I don't really care about then I move over I will multiply everything with the value in the blue space and log those here and there would be some pixels where there's overlap in that case we just simply sum the new value with the previous value the trick is that the weights of these filters are learnable just like regular convolution so let me give you another example to make it clear actually a 1d example will make it very obvious so let's say this is a very simple image it's not symmetric and this is my filter so what I'm doing is for the first part I am going to multiply this input with my entire filter and store that as the output for the second part I'm going to multiply the same filter with another value and store that as the output and there's a part where they overlap I will sum them up so what is learnable in this entire picture it's the XYZ which can be tuned using back prop so not only are we learning filters that learn features from our original high-resolution image to the low-level representation during up sampling as well we can learn filters like XYZ that do sum up sampling which leads to the best minimum loss at the output so let me pause for questions cause this is new is it clear on what this is still convolution it's not just element-wise product that's the difference right we are not taking a element-wise dot product we are just scaling the filter with the value of the input in the hope that during back propagation the value of the filters themselves will adjust accordingly yeah yeah so finally there's another picture which hopefully makes it clear a little kind of you know just explain it once you have your input image let's say four by four don't it doesn't matter what the actual values are the actually the block diagram matters more and you have your convolution filter so your filter is three by three your input is four by four here actually we don't have any zero padding so you can already visualize that there's only four opportunities for you to slide the filter over that's why the output would be two by two so if you look at this convolution you know these are these are the four possible operations so if we look at this filter the top row of the filter first gets multiplied by the top row of my image and I don't care about this fourth value then the second row gets multiplied by the first three elements of the second row I don't care about this same for the third row and then I don't care about the fourth row so here's another way of looking at convolution this operation which is four distinct parts I can actually represent in a single shot in one single calculation in the following manner I take my filter and I expand it into a large matrix such that every row look at how many elements are in this row there are sixteen elements from zero to fifteen which is the same number of elements as in the original four by four image so all this is saying is every row of this matrix gets multiplied by my entire flat image to give me the output feature directly so let's convince we convince you of why that is happening so let's look at this top row the top row says this first row of the filter one for one and then zero so why does it say zero because one for one only gets multiplied by the first three elements and not with the fourth element so the next part of my top row of the matrix would be this part of the filter multiplied this part of the filter followed by zero because I want to ignore the fourth value again so just look at the top row it says one four one zero then the second row followed by zero then the third row followed by zero and then four zeros so if I take this top row and multiply it by my image this is by the way a four by sixteen matrix and my image has 16 pixels so I can interpret that as just a sixteen by one vector so I have four by sixteen multiplied by 16 by one my output would be four by one so this row when multiplied by this vector will give me this value so instead of doing the striding and everything I can just build this convolution matrix once and then this becomes a matrix multiplication okay so this is what convolution is really convolution is a big convolution matrix which is derived based on the filters and the padding and whatnot multiplied by the image you get an output which is four by one but we can reshape it into two by two so what is transpose convolution it's going backwards so if someone gives me these weights of the transpose convolution I can multiply that by my small image so these weights are now sixteen by four I multiplied that by four my four by one image to get a sixteen by one flat image that I can reshape into four by four so eventually I have gone from two by two to four by four and the question is which Oracle is giving you these wayde's it's through backpropagation so this is another way to look at this deconvolution also your transpose convolution layer so let's go back to where we started right so the goal was we want to generate scores at the final layer for every pixel the problem was if every layer is convolution it's very expensive so let's use subsampling to just extract important features but we do care about the original resolution so we up sample to the original resolution and not only are you learning the filters of the down sampling you are also learning the filters of the up sampling convolution layers and at the end you do the same thing you take the maximum score of all the segmentation layers ok so all of that to just tell you the intuition for it's still all CNN's by the way nothing really new has happened yes we have introduced this transpose convolution which is very similar to regular convolution in matrix space but other than that we haven't really done anything you know drastically different so far and this machinery is enough to do semantic segmentation so questions on this confirmed concerns whether this is going to work or not so the output here is indeed for every pixel we will you know generate a prediction for what class that pixel belongs to okay so very quickly how do we solve this problem it's actually more straightforward than you wouldn't expect the idea is not only do we want to detect what object is in the image which we already know how to do it's a regular CNN you also want to draw a bounding box so if you had label data off bounding boxes and what does a label of a bounding box looks like it looks like this you give me the XY coordinates of the top left pixel of where the bounding box starts so that's two parameters I have to predict and you give me the width and the height of the box that's all I need to generate a rectangle one coordinate and the width and the height equals a rectangle so I can describe this box with four parameters so what you do is you do the regular Alex net or CNN layer and you know you have the fully connected layer which will give you the prediction of what object it is but then you have a clone of this fully connected layer a separate one which is often called a different head of the network and this head is just like a regression so the job of this head is to predict these four parameters for that image and it's trained on examples of images where these four parameters were also present so not so with the same actual network you are accomplishing two tasks you are predicting the both the object type and you are estimating its location right so you have a lab error softmax loss and then you have just your prediction - the ground truth loss over here and you can back propagate on the combination of this loss okay so it's actually more straightforward and one thing I want to cover in the next lecture is how you don't have to retrain the entire network every time you can exploit already existing networks so that jump ship back to where I started this lecture and let me close the loop on this lecture itself so going back to nein detection here is a data set of what ground truth may look like so this is the data set called two simple it's not too simple to actually solve this problem so the name is ironic it has thousands of images of 20 FPS frames per second split into I think point seven fraction of training and testing the type of annotations here is the polyline so it's not the the sort of the image mask there's just like image net competition there's two Lane Lane detection competitions as now and you know one of the networks that did really well and started this field is what is called Lane net now we have other networks which are slightly outperforming Lane net so these are some examples of the input image and the ground truth so even though the ground truth is polynomial lines you can easily generate masks for the input image so now this becomes a segmentation problem I have a bunch of these images each of them has a mask as the output and I want to learn a CNN with the down sampling of sampling that produces this as the output over here is a slightly different image different masks they're different types of labels as well so two simple uses this kind of labels there are other data sets which actually also captured the width of the lanes if that's what you care about so it's a minor detail but there's data sets out there that will do that for you so so you have a test image from two simple you can produce sort of your you know output of where you think the lanes are on that image so you are producing this mask or segmented image why is this a bitmap image because you only have one class lanes or no lanes in fact I haven't told you how but you can also do instant segmentation where you can differentiate between which lanes are you know they're not they're all but you can order them from left to right so it also does pretty well actually does better than the non deep learning stuff that I showed you before both on straights and curves so that's promising the burden here is not on the features and everything that's easy right the burden is labeling that data or to train this network right that's where all the cost has gone so at the end of the day what you are doing is you have a network which is taking the input image and either it is producing a mask or it is producing these waypoints or polynomials as the regression outputs right so in the previous case we produced image masks as the output you can also train a CNN such that it produces waypoints or coordinates for the lane markings right so you can again you have the ground truth and you have your prediction you can train the entire network on some loss Euclidean loss between these two so the final thing I will touch today is when it comes to predicting lanes or bounding boxes which is no different than predicting these lanes curves you know you are just predicting these four regression parameters one very good question is well what is a good prediction here right so the green is the ground truth and the red does your prediction and so would you say this is a good prediction or do we really need like exact bounding boxes it's a very good question so maybe your response as well I only care about figuring out if it is a stop sign or not I don't really care about where in space as a stop sign I'm not really flying a drone here I'm just driving okay and I would say yeah that's perfect but then what about this okay here is something you care about a lot the green is the ground truth of where the car is so on the top left you are neglecting the front of the car here you are neglecting the rear of this truck here you are over approximating so that you don't make an error and then this looks pretty good so the question arises is how do we even evaluate these sort of outputs of localization or segmentation what is a good prediction because we will never get the exact mask or the exact bounding box and so one popular idea is what is called intersection over Union so let's say one of these boxes was the ground truth and one was your prediction you compute the area of the overlap between these two and you divide it by the total area so the higher the total area the smaller your IOU and the higher the overlap the higher your IOU right so for IOU 0.5 it may look something like this and for 0.9 it's already like fun visual inspection looking like it's a very good metric so you can train on this as the loss what is the IOU intersection over Union in fact you know I don't want to spend too much time here but this is what is happening for the lane detection problem to your blue is the true lane and the red is the prediction and the green is the intersection of that so you can use IOU for the part where you are generating the segmentation map as the output in the other case when you're generating waypoints you can use Euclidian loss between the distances of these waypoints so either one works got so yeah we can obviously do this on many many images you can compare it to other networks as well so Lane that does very well one problem with Lane net is that you are treating every Lane or every image as its separate instance there is no notion of hysteresis or there is no time aspect you can actually exploit the fact that from one image to the other especially at 60 frames per second the lane marking is not going to bend abruptly it's going to have to be consistent with the previous prediction so every image shouldn't be treated independently to predict its own lane markings we should use some history or Auto regression or some lag from the previous predictions to inform our next prediction and so that's the idea with this modified lane net this is from Nvidia so on the left you will see the raw lane net predictions which are accurate but jittery or at least not high resolution and on the right you will see an output which uses this idea of using history or previous frames to inform the prediction of the next frame so so you can see the right-hand side the prediction is very accurate even in the presence of shadows on the lane markings the prediction on the right-hand side but uses historical information is much much more accurate so you see we have a lot of significant shadows leonetta sort of struggling it actually overlaps the lanes with some cars this is a case of an extreme Bend and looks leonetta struggling at the horizon but this history is again very accurate for an extreme right hander and this is in low visibility the right-hand network is just modified lane net with some temporal aspects so this is pretty remarkable right that you can do that under very low visibility foggy conditions the prediction is very accurate for a long horizon or a long distance as though this is at night with some glare and some water on the lens as well and some you know reflections from the road and still if we are able to predict with very high accuracy okay so I will stop here because I don't want to jump into the last topic of this module so on in the next lecture I will wrap up this discussion of how we can train our own network to do Lane detection and then I will also want to cover some other topics about data-driven and deep learning for autonomous driving so I'll just present you a assortment of all different sort of networks that people have tried and are interesting and then we will wrap up the course and I can't believe that's it's already three months maybe you feel it but I don't so I will see you on that
Info
Channel: Madhur Behl
Views: 9,604
Rating: 4.9767442 out of 5
Keywords: Automotive cyber physical systems, Self driving cars, Data driven modeling, Madhur Behl, uva, Link lab, Cps, Perception, Planning, Control, Edge cases, Neural network, End to end driving, Cnn, Convolution neural network, semantic segmentation, unpooling, tusimple, lane detection
Id: Gs5HlHKqAYQ
Channel Id: undefined
Length: 68min 59sec (4139 seconds)
Published: Tue Dec 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.