Lecture 11 | Detection and Segmentation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hello, hi. So I want to get started. Welcome to CS 231N Lecture 11. We're going to talk about today detection segmentation and a whole bunch of other really exciting topics around core computer vision tasks. But as usual, a couple administrative notes. So last time you obviously took the midterm, we didn't have lecture, hopefully that went okay for all of you but so we're going to work on grading the midterm this week, but as a reminder please don't make any public discussions about the midterm questions or answers or whatever until at least tomorrow because there are still some people taking makeup midterms today and throughout the rest of the week so we just ask you that you refrain from talking publicly about midterm questions. Why don't you wait until Monday? [laughing] Okay, great. So we're also starting to work on midterm grading. We'll get those back to you as soon as you can, as soon as we can. We're also starting to work on grading assignment two so there's a lot of grading being done this week. The TA's are pretty busy. Also a reminder for you guys, hopefully you've been working hard on your projects now that most of you are done with the midterm so your project milestones will be due on Tuesday so any sort of last minute changes that you had in your projects, I know some people decided to switch projects after the proposal, some teams reshuffled a little bit, that's fine but your milestone should reflect the project that you're actually doing for the rest of the quarter. So hopefully that's going out well. I know there's been a lot of worry and stress on Piazza, wondering about assignment three. So we're working on that as hard as we can but that's actually a bit of a new assignment, it's changing a bit from last year so it will be out as soon as possible, hopefully today or tomorrow. Although we promise that whenever it comes out you'll have two weeks to finish it so try not to stress out about that too much. But I'm pretty excited, I think assignment three will be really cool, has a lot of cool, it'll cover a lot of really cool material. So another thing, last time in lecture we mentioned this thing called the Train Game which is this really cool thing we've been working on sort of as a side project a little bit. So this is an interactive tool that you guys can go on and use to explore a little bit the process of tuning hyperparameters in practice so we hope that, so this is again totally not required for the course. Totally optional, but if you do we will offer a small amount of extra credit for those of you who want to do well and participate on this. And we'll send out exactly some more details later this afternoon on Piazza. But just a bit of a demo for what exactly is this thing. So you'll get to go in and we've changed the name from Train Game to HyperQuest because you're questing to solve, to find the best hyperparameters for your model so this is really cool, it'll be an interactive tool that you can use to explore the training of hyperparameters interactively in your browser. So you'll login with your student ID and name. You'll fill out a little survey with some of your experience on deep learning then you'll read some instructions. So in this game you'll be shown some random data set on every trial. This data set might be images or it might be vectors and your goal is to train a model by picking the right hyperparameters interactively to perform as well as you can on the validation set of this random data set. And it'll sort of keep track of your performance over time and there'll be a leaderboard, it'll be really cool. So every time you play the game, you'll get some statistics about your data set. In this case we're doing a classification problem with 10 classes. You can see down at the bottom you have these statistics about random data set, we have 10 classes. The input data size is three by 32 by 32 so this is some image data set and we can see that in this case we have 8500 examples in the training set and 1500 examples in the validation set. These are all random, they'll change a little bit every time. Based on these data set statistics you'll make some choices on your initial learning rate, your initial network size, and your initial dropout rate. Then you'll see a screen like this where it'll run one epoch with those chosen hyperparameters, show you on the right here you'll see two plots. One is your training and validation loss for that first epoch. Then you'll see your training and validation accuracy for that first epoch and based on the gaps that you see in these two graphs you can make choices interactively to change the learning rates and hyperparameters for the next epoch. So then you can either choose to continue training with the current or changed hyperparameters, you can also stop training, or you can revert to go back to the previous checkpoint in case things got really messed up. So then you'll get to make some choice, so here we'll decide to continue training and in this case you could go and set new learning rates and new hyperparameters for the next epoch of training. You can also, kind of interesting here, you can actually grow the network interactively during training in this demo. There's this cool trick from a couple recent papers where you can either take existing layers and make them wider or add new layers to the network in the middle of training while still maintaining the same function in the network so you can do that to increase the size of your network in the middle of training here which is kind of cool. So then you'll make choices over several epochs and eventually your final validation accuracy will be recorded and we'll have some leaderboard that compares your score on that data set to some simple baseline models. And depending on how well you do on this leaderboard we'll again offer some small amounts of extra credit for those of you who choose to participate. So this is again, totally optional, but I think it can be a really cool learning experience for you guys to play around with and explore how hyperparameters affect the learning process. Also, it's really useful for us. You'll help science out by participating in this experiment. We're pretty interested in seeing how people behave when they train neural networks so you'll be helping us out as well if you decide to play this. But again, totally optional, up to you. Any questions on that? Hopefully at some point but it's. So the question was will this be a paper or whatever eventually? Hopefully but it's really early stages of this project so I can't make any promises but I hope so. But I think it'll be really cool. [laughing] Yeah, so the question is how can you add layers during training? I don't really want to get into that right now but the paper to read is Net2Net by Ian Goodfellow's one of the authors and there's another paper from Microsoft called Network Morphism. So if you read those two papers you can see how this works. Okay, so last time, a bit of a reminder before we had the midterm last time we talked about recurrent neural networks. We saw that recurrent neural networks can be used for different types of problems. In addition to one to one we can do one to many, many to one, many to many. We saw how this can apply to language modeling and we saw some cool examples of applying neural networks to model different sorts of languages at the character level and we sampled these artificial math and Shakespeare and C source code. We also saw how similar things could be applied to image captioning by connecting a CNN feature extractor together with an RNN language model. And we saw some really cool examples of that. We also talked about the different types of RNN's. We talked about this Vanilla RNN. I also want to mention that this is sometimes called a Simple RNN or an Elman RNN so you'll see all of these different terms in literature. We also talked about the Long Short Term Memory or LSTM. And we talked about how the gradient, the LSTM has this crazy set of equations but it makes sense because it helps improve gradient flow during back propagation and helps this thing model more longer term dependencies in our sequences. So today we're going to switch gears and talk about a whole bunch of different exciting tasks. We're going to talk about, so so far we've been talking about mostly the image classification problem. Today we're going to talk about various types of other computer vision tasks where you actually want to go in and say things about the spatial pixels inside your images so we'll see segmentation, localization, detection, a couple other different computer vision tasks and how you can approach these with convolutional neural networks. So as a bit of refresher, so far the main thing we've been talking about in this class is image classification so here we're going to have some input image come in. That input image will go through some deep convolutional network, that network will give us some feature vector of maybe 4096 dimensions in the case of AlexNet RGB and then from that final feature vector we'll have some fully-connected, some final fully-connected layer that gives us 1000 numbers for the different class scores that we care about where 1000 is maybe the number of classes in ImageNet in this example. And then at the end of the day what the network does is we input an image and then we output a single category label saying what is the content of this entire image as a whole. But this is maybe the most basic possible task in computer vision and there's a whole bunch of other interesting types of tasks that we might want to solve using deep learning. So today we're going to talk about several of these different tasks and step through each of these and see how they all work with deep learning. So we'll talk about these more in detail about what each problem is as we get to it but this is kind of a summary slide that we'll talk first about semantic segmentation. We'll talk about classification and localization, then we'll talk about object detection, and finally a couple brief words about instance segmentation. So first is the problem of semantic segmentation. In the problem of semantic segmentation, we want to input an image and then output a decision of a category for every pixel in that image so for every pixel in this, so this input image for example is this cat walking through the field, he's very cute. And in the output we want to say for every pixel is that pixel a cat or grass or sky or trees or background or some other set of categories. So we're going to have some set of categories just like we did in the image classification case but now rather than assigning a single category labeled to the entire image, we want to produce a category label for each pixel of the input image. And this is called semantic segmentation. So one interesting thing about semantic segmentation is that it does not differentiate instances so in this example on the right we have this image with two cows where they're standing right next to each other and when we're talking about semantic segmentation we're just labeling all the pixels independently for what is the category of that pixel. So in the case like this where we have two cows right next to each other the output does not make any distinguishing, does not distinguish between these two cows. Instead we just get a whole mass of pixels that are all labeled as cow. So this is a bit of a shortcoming of semantic segmentation and we'll see how we can fix this later when we move to instance segmentation. But at least for now we'll just talk about semantic segmentation first. So you can imagine maybe using a class, so one potential approach for attacking semantic segmentation might be through classification. So there's this, you could use this idea of a sliding window approach to semantic segmentation. So you might imagine that we take our input image and we break it up into many many small, tiny local crops of the image so in this example we've taken maybe three crops from around the head of this cow and then you could imagine taking each of those crops and now treating this as a classification problem. Saying for this crop, what is the category of the central pixel of the crop? And then we could use all the same machinery that we've developed for classifying entire images but now just apply it on crops rather than on the entire image. And this would probably work to some extent but it's probably not a very good idea. So this would end up being super super computationally expensive because we want to label every pixel in the image, we would need a separate crop for every pixel in that image and this would be super super expensive to run forward and backward passes through. And moreover, we're actually, if you think about this we can actually share computation between different patches so if you're trying to classify two patches that are right next to each other and actually overlap then the convolutional features of those patches will end up going through the same convolutional layers and we can actually share a lot of the computation when applying this to separate passes or when applying this type of approach to separate patches in the image. So this is actually a terrible idea and nobody does this and you should probably not do this but it's at least the first thing you might think of if you were trying to think about semantic segmentation. Then the next idea that works a bit better is this idea of a fully convolutional network right. So rather than extracting individual patches from the image and classifying these patches independently, we can imagine just having our network be a whole giant stack of convolutional layers with no fully connected layers or anything so in this case we just have a bunch of convolutional layers that are all maybe three by three with zero padding or something like that so that each convolutional layer preserves the spatial size of the input and now if we pass our image through a whole stack of these convolutional layers, then the final convolutional layer could just output a tensor of something by C by H by W where C is the number of categories that we care about and you could see this tensor as just giving our classification scores for every pixel in the input image at every location in the input image. And we could compute this all at once with just some giant stack of convolutional layers. And then you could imagine training this thing by putting a classification loss at every pixel of this output, taking an average over those pixels in space, and just training this kind of network through normal, regular back propagation. Question? Oh, the question is how do you develop training data for this? It's very expensive right. So the training data for this would be we need to label every pixel in those input images so there's tools that people sometimes have online where you can go in and sort of draw contours around the objects and then fill in regions but in general getting this kind of training data is very expensive. Yeah, the question is what is the loss function? So here since we're making a classification decision per pixel then we put a cross entropy loss on every pixel of the output. So we have the ground truth category label for every pixel in the output, then we compute across entropy loss between every pixel in the output and the ground truth pixels and then take either a sum or an average over space and then sum or average over the mini-batch. Question? Yeah, yeah. Yeah, the question is do we assume that we know the categories? So yes, we do assume that we know the categories up front so this is just like the image classification case. So an image classification we know at the start of training based on our data set that maybe there's 10 or 20 or 100 or 1000 classes that we care about for this data set and then here we are fixed to that set of classes that are fixed for the data set. So this model is relatively simple and you can imagine this working reasonably well assuming that you tuned all the hyperparameters right but it's kind of a problem right. So in this setup, since we're applying a bunch of convolutions that are all keeping the same spatial size of the input image, this would be super super expensive right. If you wanted to do convolutions that maybe have 64 or 128 or 256 channels for those convolutional filters which is pretty common in a lot of these networks, then running those convolutions on this high resolution input image over a sequence of layers would be extremely computationally expensive and would take a ton of memory. So in practice, you don't usually see networks with this architecture. Instead you tend to see networks that look something like this where we have some downsampling and then some upsampling of the feature map inside the image. So rather than doing all the convolutions of the full spatial resolution of the image, we'll maybe go through a small number of convolutional layers at the original resolution then downsample that feature map using something like max pooling or strided convolutions and sort of downsample, downsample, so we have convolutions in downsampling and convolutions in downsampling that look much like a lot of the classification networks that you see but now the difference is that rather than transitioning to a fully connected layer like you might do in an image classification setup, instead we want to increase the spatial resolution of our predictions in the second half of the network so that our output image can now be the same size as our input image and this ends up being much more computationally efficient because you can make the network very deep and work at a lower spatial resolution for many of the layers at the inside of the network. So we've already seen examples of downsampling when it comes to convolutional networks. We've seen that you can do strided convolutions or various types of pooling to reduce the spatial size of the image inside a network but we haven't really talked about upsampling and the question you might be wondering is what are these upsampling layers actually look like inside the network? And what are our strategies for increasing the size of a feature map inside the network? Sorry, was there a question in the back? Yeah, so the question is how do we upsample? And the answer is that's the topic of the next couple slides. [laughing] So one strategy for upsampling is something like unpooling so we have this notion of pooling to downsample so we talked about average pooling or max pooling so when we talked about average pooling we're kind of taking a spatial average within a receptive field of each pooling region. One kind of analog for upsampling is this idea of nearest neighbor unpooling. So here on the left we see this example of nearest neighbor unpooling where our input is maybe some two by two grid and our output is a four by four grid and now in our output we've done a two by two stride two nearest neighbor unpooling or upsampling where we've just duplicated that element for every point in our two by two receptive field of the unpooling region. Another thing you might see is this bed of nails unpooling or bed of nails upsampling where you'll just take, again we have a two by two receptive field for our unpooling regions and then you'll take the, in this case you make it all zeros except for one element of the unpooling region so in this case we've taken all of our inputs and always put them in the upper left hand corner of this unpooling region and everything else is zeros. And this is kind of like a bed of nails because the zeros are very flat, then you've got these things poking up for the values at these various non-zero regions. Another thing that you see sometimes which was alluded to by the question a minute ago is this idea of max unpooling so in a lot of these networks they tend to be symmetrical where we have a downsampling portion of the network and then an upsampling portion of the network with a symmetry between those two portions of the network. So sometimes what you'll see is this idea of max unpooling where for each unpooling, for each upsampling layer, it is associated with one of the pooling layers in the first half of the network and now in the first half, in the downsampling when we do max pooling we'll actually remember which element of the receptive field during max pooling was used to do the max pooling and now when we go through the rest of the network then we'll do something that looks like this bed of nails upsampling except rather than always putting the elements in the same position, instead we'll stick it into the position that was used in the corresponding max pooling step earlier in the network. I'm not sure if that explanation was clear but hopefully the picture makes sense. Yeah, so then you just end up filling the rest with zeros. So then you fill the rest with zeros and then you stick the elements from the low resolution patch up into the high resolution patch at the points where the max pooling took place at the corresponding max pooling there. Okay, so that's kind of an interesting idea. Sorry, question? Oh yeah, so the question is why is this a good idea? Why might this matter? So the idea is that when we're doing semantic segmentation we want our predictions to be pixel perfect right. We kind of want to get those sharp boundaries and those tiny details in our predictive segmentation so now if you're doing this max pooling, there's this sort of heterogeneity that's happening inside the feature map due to the max pooling where from the low resolution image you don't know, you're sort of losing spatial information in some sense by you don't know where that feature vector came from in the local receptive field after max pooling. So if you actually unpool by putting the vector in the same slot you might think that that might help us handle these fine details a little bit better and help us preserve some of that spatial information that was lost during max pooling. Question? The question is does this make things easier for back prop? Yeah, I guess, I don't think it changes the back prop dynamics too much because storing these indices is not a huge computational overhead. They're pretty small in comparison to everything else. So another thing that you'll see sometimes is this idea of transpose convolution. So transpose convolution, so for these various types of unpooling that we just talked about, these bed of nails, this nearest neighbor, this max unpooling, all of these are kind of a fixed function, they're not really learning exactly how to do the upsampling so if you think about something like strided convolution, strided convolution is kind of like a learnable layer that learns the way that the network wants to perform downsampling at that layer. And by analogy with that there's this type of layer called a transpose convolution that lets us do kind of learnable upsampling. So it will both upsample the feature map and learn some weights about how it wants to do that upsampling. And this is really just another type of convolution so to see how this works remember how a normal three by three stride one pad one convolution would work. That for this kind of normal convolution that we've seen many times now in this class, our input might by four by four, our output might be four by four, and now we'll have this three by three kernel and we'll take an inner product between, we'll plop down that kernel at the corner of the image, take an inner product, and that inner product will give us the value and the activation in the upper left hand corner of our output. And we'll repeat this for every receptive field in the image. Now if we talk about strided convolution then strided convolution ends up looking pretty similar. However, our input is maybe a four by four region and our output is a two by two region. But we still have this idea of taking, of there being some three by three filter or kernel that we plop down in the corner of the image, take an inner product and use that to compute a value of the activation and the output. But now with strided convolution the idea is that we're moving that, rather than popping down that filter at every possible point in the input, instead we're going to move the filter by two pixels in the input every time we move the filter by one pixel, every time we move by one pixel in the output. Right so this stride of two gives us a ratio between how much do we move in the input versus how much do we move in the output. So when you do a strided convolution with stride two this ends up downsampling the image or the feature map by a factor of two in kind of a learnable way. And now a transpose convolution is sort of the opposite in a way so here our input will be a two by two region and our output will be a four by four region. But now the operation that we perform with transpose convolution is a little bit different. Now so rather than taking an inner product instead what we're going to do is we're going to take the value of our input feature map at that upper left hand corner and that'll be some scalar value in the upper left hand corner. We're going to multiply the filter by that scalar value and then copy those values over to this three by three region in the output so rather than taking an inner product with our filter and the input, instead our input gives weights that we will use to weight the filter and then our output will be weighted copies of the filter that are weighted by the values in the input. And now we can do this sort of same ratio trick in order to upsample so now when we move one pixel in the input now we can plop our filter down two pixels away in the output and it's the same trick that now the blue pixel in the input is some scalar value and we'll take that scalar value, multiply it by the values in the filter, and copy those weighted filter values into this new region in the output. The tricky part is that sometimes these receptive fields in the output can overlap now and now when these receptive fields in the output overlap we just sum the results in the output. So then you can imagine repeating this everywhere and repeating this process everywhere and this ends up doing sort of a learnable upsampling where we use these learned convolutional filter weights to upsample the image and increase the spatial size. By the way, you'll see this operation go by a lot of different names in literature. Sometimes this gets called things like deconvolution which I think is kind of a bad name but you'll see it out there in papers so from a signal processing perspective deconvolution means the inverse operation to convolution which this is not however you'll frequently see this type of layer called a deconvolution layer in some deep learning papers so be aware of that, watch out for that terminology. You'll also sometimes see this called upconvolution which is kind of a cute name. Sometimes it gets called fractionally strided convolution because if we think of the stride as the ratio in step between the input and the output then now this is something like a stride one half convolution because of this ratio of one to two between steps in the input and steps in the output. This also sometimes gets called a backwards strided convolution because if you think about it and work through the math this ends up being the same, the forward pass of a transpose convolution ends up being the same mathematical operation as the backwards pass in a normal convolution so you might have to take my word for it, that might not be super obvious when you first look at this but that's kind of a neat fact so you'll sometimes see that name as well. And as maybe a bit of a more concrete example of what this looks like I think it's maybe a little easier to see in one dimension so if we imagine, so here we're doing a three by three transpose convolution in one dimension. Sorry, not three by three, a three by one transpose convolution in one dimension. So our filter here is just three numbers. Our input is two numbers and now you can see that in our output we've taken the values in the input, used them to weight the values of the filter and plopped down those weighted filters in the output with a stride of two and now where these receptive fields overlap in the output then we sum. So you might be wondering, this is kind of a funny name. Where does the name transpose convolution come from and why is that actually my preferred name for this operation? So that comes from this kind of neat interpretation of convolution. So it turns out that any time you do convolution you can always write convolution as a matrix multiplication. So again, this is kind of easier to see with a one-dimensional example but here we've got some weight. So we're doing a one-dimensional convolution of a weight vector x which has three elements, and an input vector, a vector, which has four elements, A, B, C, D. So here we're doing a three by one convolution with stride one and you can see that we can frame this whole operation as a matrix multiplication where we take our convolutional kernel x and turn it into some matrix capital X which contains copies of that convolutional kernel that are offset by different regions. And now we can take this giant weight matrix X and do a matrix vector multiplication between x and our input a and this just produces the same result as convolution. And now with transpose convolution means that we're going to take this same weight matrix but now we're going to multiply by the transpose of that same weight matrix. So here you can see the same example for this stride one convolution on the left and the corresponding stride one transpose convolution on the right. And if you work through the details you'll see that when it comes to stride one, a stride one transpose convolution also ends up being a stride one normal convolution so there's a little bit of details in the way that the border and the padding are handled but it's fundamentally the same operation. But now things look different when you talk about a stride of two. So again, here on the left we can take a stride two convolution and write out this stride two convolution as a matrix multiplication. And now the corresponding transpose convolution is no longer a convolution so if you look through this weight matrix and think about how convolutions end up getting represented in this way then now this transposed matrix for the stride two convolution is something fundamentally different from the original normal convolution operation so that's kind of the reasoning behind the name and that's why I think that's kind of the nicest name to call this operation by. Sorry, was there a question? Sorry? It's very possible there's a typo in the slide so please point out on Piazza and I'll fix it but I hope the idea was clear. Is there another question? Okay, thank you [laughing]. Yeah, so, oh no lots of questions. Yeah, so the issue is why do we sum and not average? So the reason we sum is due to this transpose convolution formula zone so that's the reason why we sum but you're right that you actually, this is kind of a problem that the magnitudes will actually vary in the output depending on how many receptive fields were in the output. So actually in practice this is something that people started to point out very recently and somewhat switched away from this stride, so using three by three stride two transpose convolution upsampling can sometimes produce some checkerboard artifacts in the output exactly due to that problem. So what I've seen in a couple more recent papers is maybe to use four by four stride two or two by two stride two transpose convolution for upsampling and that helps alleviate that problem a little bit. Yeah, so the question is what is a stride half convolution and where does that terminology come from? I think that was from my paper. So that was actually, yes that was definitely this. So at the time I was writing that paper I was kind of into the name fractionally strided convolution but after thinking about it a bit more I think transpose convolution is probably the right name. So then this idea of semantic segmentation actually ends up being pretty natural. You just have this giant convolutional network with downsampling and upsampling inside the network and now our downsampling will be by strided convolution or pooling, our upsampling will be by transpose convolution or various types of unpooling or upsampling and we can train this whole thing end to end with back propagation using this cross entropy loss over every pixel. So this is actually pretty cool that we can take a lot of the same machinery that we already learned for image classification and now just apply it very easily to extend to new types of problems so that's super cool. So the next task that I want to talk about is this idea of classification plus localization. So we've talked about image classification a lot where we want to just assign a category label to the input image but sometimes you might want to know a little bit more about the image. In addition to predicting what the category is, in this case the cat, you might also want to know where is that object in the image? So in addition to predicting the category label cat, you might also want to draw a bounding box around the region of the cat in that image. And classification plus localization, the distinction here between this and object detection is that in the localization scenario you assume ahead of time that you know there's exactly one object in the image that you're looking for or maybe more than one but you know ahead of time that we're going to make some classification decision about this image and we're going to produce exactly one bounding box that's going to tell us where that object is located in the image so we sometimes call that task classification plus localization. And again, we can reuse a lot of the same machinery that we've already learned from image classification in order to tackle this problem. So kind of a basic architecture for this problem looks something like this. So again, we have our input image, we feed our input image through some giant convolutional network, this is Alex, this is AlexNet for example, which will give us some final vector summarizing the content of the image. Then just like before we'll have some fully connected layer that goes from that final vector to our class scores. But now we'll also have another fully connected layer that goes from that vector to four numbers. Where the four numbers are something like the height, the width, and the x and y positions of that bounding box. And now our network will produce these two different outputs, one is this set of class scores, and the other are these four numbers giving the coordinates of the bounding box in the input image. And now during training time, when we train this network we'll actually have two losses so in this scenario we're sort of assuming a fully supervised setting so we assume that each of our training images is annotated with both a category label and also a ground truth bounding box for that category in the image. So now we have two loss functions. We have our favorite softmax loss that we compute using the ground truth category label and the predicted class scores, and we also have some kind of loss that gives us some measure of dissimilarity between our predicted coordinates for the bounding box and our actual coordinates for the bounding box. So one very simple thing is to just take an L2 loss between those two and that's kind of the simplest thing that you'll see in practice although sometimes people play around with this and maybe use L1 or smooth L1 or they parametrize the bounding box a little bit differently but the idea is always the same, that you have some regression loss between your predicted bounding box coordinates and the ground truth bounding box coordinates. Question? Sorry, go ahead. So the question is, is this a good idea to do all at the same time? Like what happens if you misclassify, should you even look at the box coordinates? So sometimes people get fancy with it, so in general it works okay. It's not a big problem, you can actually train a network to do both of these things at the same time and it'll figure it out but sometimes things can get tricky in terms of misclassification so sometimes what you'll see for example is that rather than predicting a single box you might make predictions like a separate prediction of the box for each category and then only apply loss to the predicted box corresponding to the ground truth category. So people do get a little bit fancy with these things that sometimes helps a bit in practice. But at least this basic setup, it might not be perfect or it might not be optimal but it will work and it will do something. Was there a question in the back? Yeah, so that's the question is do these losses have different units, do they dominate the gradient? So this is what we call a multi-task loss so whenever we're taking derivatives we always want to take derivative of a scalar with respect to our network parameters and use that derivative to take gradient steps. But now we've got two scalars that we want to both minimize so what you tend to do in practice is have some additional hyperparameter that gives you some weighting between these two losses so you'll take a weighted sum of these two different loss functions to give our final scalar loss. And then you'll take your gradients with respect to this weighted sum of the two losses. And this ends up being really really tricky because this weighting parameter is a hyperparameter that you need to set but it's kind of different from some of the other hyperparameters that we've seen so far in the past right because this weighting hyperparameter actually changes the value of the loss function so one thing that you might often look at when you're trying to set hyperparameters is you might make different hyperparameter choices and see what happens to the loss under different choices of hyperparameters. But in this case because the loss actually, because the hyperparameter affects the absolute value of the loss making those comparisons becomes kind of tricky. So setting that hyperparameter is somewhat difficult. And in practice, you kind of need to take it on a case by case basis for exactly the problem you're solving but my general strategy for this is to have some other metric of performance that you care about other than the actual loss value which then you actually use that final performance metric to make your cross validation choices rather than looking at the value of the loss to make those choices. Question? So the question is why do we do this all at once? Why not do this separately? Yeah, so the question is why don't we fix the big network and then just only learn separate fully connected layers for these two tasks? People do do that sometimes and in fact that's probably the first thing you should try if you're faced with a situation like this but in general whenever you're doing transfer learning you always get better performance if you fine tune the whole system jointly because there's probably some mismatch between the features, if you train on ImageNet and then you use that network for your data set you're going to get better performance on your data set if you can also change the network. But one trick you might see in practice sometimes is that you might freeze that network then train those two things separately until convergence and then after they converge then you go back and jointly fine tune the whole system. So that's a trick that sometimes people do in practice in that situation. And as I've kind of alluded to this big network is often a pre-trained network that is taken from ImageNet for example. So a bit of an aside, this idea of predicting some fixed number of positions in the image can be applied to a lot of different problems beyond just classification plus localization. One kind of cool example is human pose estimation. So here we want to take an input image is a picture of a person. We want to output the positions of the joints for that person and this actually allows the network to predict what is the pose of the human. Where are his arms, where are his legs, stuff like that, and generally most people have the same number of joints. That's a bit of a simplifying assumption, it might not always be true but it works for the network. So for example one parameterization that you might see in some data sets is define a person's pose by 14 joint positions. Their feet and their knees and their hips and something like that and now when we train the network then we're going to input this image of a person and now we're going to output 14 numbers in this case giving the x and y coordinates for each of those 14 joints. And then you apply some kind of regression loss on each of those 14 different predicted points and just train this network with back propagation again. Yeah, so you might see an L2 loss but people play around with other regression losses here as well. Question? So the question is what do I mean when I say regression loss? So I mean something other than cross entropy or softmax right. When I say regression loss I usually mean like an L2 Euclidean loss or an L1 loss or sometimes a smooth L1 loss. But in general classification versus regression is whether your output is categorical or continuous so if you're expecting a categorical output like you ultimately want to make a classification decision over some fixed number of categories then you'll think about a cross entropy loss, softmax loss or these SVM margin type losses that we talked about already in the class. But if your expected output is to be some continuous value, in this case the position of these points, then your output is continuous so you tend to use different types of losses in those situations. Typically an L2, L1, different kinds of things there. So sorry for not clarifying that earlier. But the bigger point here is that for any time you know that you want to make some fixed number of outputs from your network, if you know for example. Maybe you knew that you wanted to, you knew that you always are going to have pictures of a cat and a dog and you want to predict both the bounding box of the cat and the bounding box of the dog in that case you'd know that you have a fixed number of outputs for each input so you might imagine hooking up this type of regression classification plus localization framework for that problem as well. So this idea of some fixed number of regression outputs can be applied to a lot of different problems including pose estimation. So the next task that I want to talk about is object detection and this is a really meaty topic. This is kind of a core problem in computer vision and you could probably teach a whole seminar class on just the history of object detection and various techniques applied there. So I'll be relatively brief and try to go over the main big ideas of object detection plus deep learning that have been used in the last couple of years. But the idea in object detection is that we again start with some fixed set of categories that we care about, maybe cats and dogs and fish or whatever but some fixed set of categories that we're interested in. And now our task is that given our input image, every time one of those categories appears in the image, we want to draw a box around it and we want to predict the category of that box so this is different from classification plus localization because there might be a varying number of outputs for every input image. You don't know ahead of time how many objects you expect to find in each image so that's, this ends up being a pretty challenging problem. So we've seen graphs, so this is kind of interesting. We've seen this graph many times of the ImageNet classification performance as a function of years and we saw that it just got better and better every year and there's been a similar trend with object detection because object detection has again been one of these core problems in computer vision that people have cared about for a very long time. So this slide is due to Ross Girshick who's worked on this problem a lot and it shows the progression of object detection performance on this one particular data set called PASCAL VOC which has been relatively used for a long time in the object detection community. And you can see that up until about 2012 performance on object detection started to stagnate and slow down a little bit and then in 2013 was when some of the first deep learning approaches to object detection came around and you could see that performance just shot up very quickly getting better and better year over year. One thing you might notice is that this plot ends in 2015 and it's actually continued to go up since then so the current state of the art in this data set is well over 80 and in fact a lot of recent papers don't even report results on this data set anymore because it's considered too easy. So it's a little bit hard to know, I'm not actually sure what is the state of the art number on this data set but it's off the top of this plot. Sorry, did you have a question? Nevermind. Okay, so as I already said this is different from localization because there might be differing numbers of objects for each image. So for example in this cat on the upper left there's only one object so we only need to predict four numbers but now for this image in the middle there's three animals there so we need our network to predict 12 numbers, four coordinates for each bounding box. Or in this example of many many ducks then you want your network to predict a whole bunch of numbers. Again, four numbers for each duck. So this is quite different from object detection. Sorry object detection is quite different from localization because in object detection you might have varying numbers of objects in the image and you don't know ahead of time how many you expect to find. So as a result, it's kind of tricky if you want to think of object detection as a regression problem. So instead, people tend to work, use kind of a different paradigm when thinking about object detection. So one approach that's very common and has been used for a long time in computer vision is this idea of sliding window approaches to object detection. So this is kind of similar to this idea of taking small patches and applying that for semantic segmentation and we can apply a similar idea for object detection. So the ideas is that we'll take different crops from the input image, in this case we've got this crop in the lower left hand corner of our image and now we take that crop, feed it through our convolutional network and our convolutional network does a classification decision on that input crop. It'll say that there's no dog here, there's no cat here, and then in addition to the categories that we care about we'll add an additional category called background and now our network can predict background in case it doesn't see any of the categories that we care about, so then when we take this crop from the lower left hand corner here then our network would hopefully predict background and say that no, there's no object here. Now if we take a different crop then our network would predict dog yes, cat no, background no. We take a different crop we get dog yes, cat no, background no. Or a different crop, dog no, cat yes, background no. Does anyone see a problem here? Yeah, the question is how do you choose the crops? So this is a huge problem right. Because there could be any number of objects in this image, these objects could appear at any location in the image, these objects could appear at any size in the image, these objects could also appear at any aspect ratio in the image, so if you want to do kind of a brute force sliding window approach you'd end up having to test thousands, tens of thousands, many many many many different crops in order to tackle this problem with a brute force sliding window approach. And in the case where every one of those crops is going to be fed through a giant convolutional network, this would be completely computationally intractable. So in practice people don't ever do this sort of brute force sliding window approach for object detection using convolutional networks. Instead there's this cool line of work called region proposals that comes from, this is not using deep learning typically. These are slightly more traditional computer vision techniques but the idea is that a region proposal network kind of uses more traditional signal processing, image processing type things to make some list of proposals for where, so given an input image, a region proposal network will then give you something like a thousand boxes where an object might be present. So you can imagine that maybe we do some local, we look for edges in the image and try to draw boxes that contain closed edges or something like that. These various types of image processing approaches, but these region proposal networks will basically look for blobby regions in our input image and then give us some set of candidate proposal regions where objects might be potentially found. And these are relatively fast-ish to run so one common example of a region proposal method that you might see is something called Selective Search which I think actually gives you 2000 region proposals, not the 1000 that it says on the slide. So you kind of run this thing and then after about two seconds of turning on your CPU it'll spit out 2000 region proposals in the input image where objects are likely to be found so there'll be a lot of noise in those. Most of them will not be true objects but there's a pretty high recall. If there is an object in the image then it does tend to get covered by these region proposals from Selective Search. So now rather than applying our classification network to every possible location and scale in the image instead what we can do is first apply one of these region proposal networks to get some set of proposal regions where objects are likely located and now apply a convolutional network for classification to each of these proposal regions and this will end up being much more computationally tractable than trying to do all possible locations and scales. And this idea all came together in this paper called R-CNN from a few years ago that does exactly that. So given our input image in this case we'll run some region proposal network to get our proposals, these are also sometimes called regions of interest or ROI's so again Selective Search gives you something like 2000 regions of interest. Now one of the problems here is that these input, these regions in the input image could have different sizes but if we're going to run them all through a convolutional network our classification, our convolutional networks for classification all want images of the same input size typically due to the fully connected net layers and whatnot so we need to take each of these region proposals and warp them to that fixed square size that is expected as input to our downstream network. So we'll crop out those region proposal, those regions corresponding to the region proposals, we'll warp them to that fixed size, and then we'll run each of them through a convolutional network which will then use in this case an SVM to make a classification decision for each of those, to predict categories for each of those crops. And then I lost a slide. But it'll also, not shown in the slide right now but in addition R-CNN also predicts a regression, like a correction to the bounding box in addition for each of these input region proposals because the problem is that your input region proposals are kind of generally in the right position for an object but they might not be perfect so in addition R-CNN will, in addition to category labels for each of these proposals, it'll also predict four numbers that are kind of an offset or a correction to the box that was predicted at the region proposal stage. So then again, this is a multi-task loss and you would train this whole thing. Sorry was there a question? The question is how much does the change in aspect ratio impact accuracy? It's a little bit hard to say. I think there's some controlled experiments in some of these papers but I'm not sure I can give a generic answer to that. Question? The question is is it necessary for regions of interest to be rectangles? So they typically are because it's tough to warp these non-region things but once you move to something like instant segmentation then you sometimes get proposals that are not rectangles. If you actually do care about predicting things that are not rectangles. Is there another question? Yeah, so the question is are the region proposals learned so in R-CNN it's a traditional thing. These are not learned, this is kind of some fixed algorithm that someone wrote down but we'll see in a couple minutes that we can actually, we've changed that a little bit in the last couple of years. Is there another question? The question is is the offset always inside the region of interest? The answer is no, it doesn't have to be. You might imagine that suppose the region of interest put a box around a person but missed the head then you could imagine the network inferring that oh this is a person but people usually have heads so the network showed the box should be a little bit higher. So sometimes the final predicted boxes will be outside the region of interest. Question? Yeah. Yeah the question is you have a lot of ROI's that don't correspond to true objects? And like we said, in addition to the classes that you actually care about you add an additional background class so your class scores can also predict background to say that there was no object here. Question? Yeah, so the question is what kind of data do we need and yeah, this is fully supervised in the sense that our training data has each image, consists of images. Each image has all the object categories marked with bounding boxes for each instance of that category. There are definitely papers that try to approach this like oh what if you don't have the data. What if you only have that data for some images? Or what if that data is noisy but at least in the generic case you assume full supervision of all objects in the images at training time. Okay, so I think we've kind of alluded to this but there's kind of a lot of problems with this R-CNN framework. And actually if you look at the figure here on the right you can see that additional bounding box head so I'll put it back. But this is kind of still computationally pretty expensive because if we've got 2000 region proposals, we're running each of those proposals independently, that can be pretty expensive. There's also this question of relying on this fixed region proposal network, this fixed region proposals, we're not learning them so that's kind of a problem. And just in practice it ends up being pretty slow so in the original implementation R-CNN would actually dump all the features to disk so it'd take hundreds of gigabytes of disk space to store all these features. Then training would be super slow since you have to make all these different forward and backward passes through the image and it took something like 84 hours is one number they've recorded for training time so this is super super slow. And now at test time it's also super slow, something like roughly 30 seconds minute per image because you need to run thousands of forward passes through the convolutional network for each of these region proposals so this ends up being pretty slow. Thankfully we have fast R-CNN that fixed a lot of these problems so when we do fast R-CNN then it's going to look kind of the same. We're going to start with our input image but now rather than processing each region of interest separately instead we're going to run the entire image through some convolutional layers all at once to give this high resolution convolutional feature map corresponding to the entire image. And now we still are using some region proposals from some fixed thing like Selective Search but rather than cropping out the pixels of the image corresponding to the region proposals, instead we imagine projecting those region proposals onto this convolutional feature map and then taking crops from the convolutional feature map corresponding to each proposal rather than taking crops directly from the image. And this allows us to reuse a lot of this expensive convolutional computation across the entire image when we have many many crops per image. But again, if we have some fully connected layers downstream those fully connected layers are expecting some fixed-size input so now we need to do some reshaping of those crops from the convolutional feature map and they do that in a differentiable way using something they call an ROI pooling layer. Once you have these warped crops from the convolutional feature map then you can run these things through some fully connected layers and predict your classification scores and your linear regression offsets to the bounding boxes. And now when we train this thing then we again have a multi-task loss that trades off between these two constraints and during back propagation we can back prop through this entire thing and learn it all jointly. This ROI pooling, it looks kind of like max pooling. I don't really want to get into the details of that right now. And in terms of speed if we look at R-CNN versus fast R-CNN versus this other model called SPP net which is kind of in between the two, then you can see that at training time fast R-CNN is something like 10 times faster to train because we're sharing all this computation between different feature maps. And now at test time fast R-CNN is super fast and in fact fast R-CNN is so fast at test time that its computation time is actually dominated by computing region proposals. So we said that computing these 2000 region proposals using Selective Search takes something like two seconds and now once we've got all these region proposals then because we're processing them all sort of in a shared way by sharing these expensive convolutions across the entire image that we can process all of these region proposals in less than a second altogether. So fast R-CNN ends up being bottlenecked by just the computing of these region proposals. Thankfully we've solved this problem with faster R-CNN. So the idea in faster R-CNN is to just make, so the problem was the computing the region proposals using this fixed function was a bottleneck. So instead we'll just make the network itself predict its own region proposals. And so the way that this sort of works is that again, we take our input image, run the entire input image altogether through some convolutional layers to get some convolutional feature map representing the entire high resolution image and now there's a separate region proposal network which works on top of those convolutional features and predicts its own region proposals inside the network. Now once we have those predicted region proposals then it looks just like fast R-CNN where now we take crops from those region proposals from the convolutional features, pass them up to the rest of the network. And now we talked about multi-task losses and multi-task training networks to do multiple things at once. Well now we're telling the network to do four things all at once so balancing out this four-way multi-task loss is kind of tricky. But because the region proposal network needs to do two things: it needs to say for each potential proposal is it an object or not an object, it needs to actually regress the bounding box coordinates for each of those proposals, and now the final network at the end needs to do these two things again. Make final classification decisions for what are the class scores for each of these proposals, and also have a second round of bounding box regression to again correct any errors that may have come from the region proposal stage. Question? So the question is that sometimes multi-task learning might be seen as regularization and are we getting that affect here? I'm not sure if there's been super controlled studies on that but actually in the original version of the faster R-CNN paper they did a little bit of experimentation like what if we share the region proposal network, what if we don't share? What if we learn separate convolutional networks for the region proposal network versus the classification network? And I think there were minor differences but it wasn't a dramatic difference either way. So in practice it's kind of nicer to only learn one because it's computationally cheaper. Sorry, question? Yeah the question is how do you train this region proposal network because you don't know, you don't have ground truth region proposals for the region proposal network. So that's a little bit hairy. I don't want to get too much into those details but the idea is that at any time you have a region proposal which has more than some threshold of overlap with any of the ground truth objects then you say that that is the positive region proposal and you should predict that as the region proposal and any potential proposal which has very low overlap with any ground truth objects should be predicted as a negative. But there's a lot of dark magic hyperparameters in that process and that's a little bit hairy. Question? Yeah, so the question is what is the classification loss on the region proposal network and the answer is that it's making a binary, so I didn't want to get into too much of the details of that architecture 'cause it's a little bit hairy but it's making binary decisions. So it has some set of potential regions that it's considering and it's making a binary decision for each one. Is this an object or not an object? So it's like a binary classification loss. So once you train this thing then faster R-CNN ends up being pretty darn fast. So now because we've eliminated this overhead from computing region proposals outside the network, now faster R-CNN ends up being very very fast compared to these other alternatives. Also, one interesting thing is that because we're learning the region proposals here you might imagine maybe what if there was some mismatch between this fixed region proposal algorithm and my data? So in this case once you're learning your own region proposals then you can overcome that mismatch if your region proposals are somewhat weird or different than other data sets. So this whole family of R-CNN methods, R stands for region, so these are all region-based methods because there's some kind of region proposal and then we're doing some processing, some independent processing for each of those potential regions. So this whole family of methods are called these region-based methods for object detection. But there's another family of methods that you sometimes see for object detection which is sort of all feed forward in a single pass. So one of these is YOLO for You Only Look Once. And another is SSD for Single Shot Detection and these two came out somewhat around the same time. But the idea is that rather than doing independent processing for each of these potential regions instead we want to try to treat this like a regression problem and just make all these predictions all at once with some big convolutional network. So now given our input image you imagine dividing that input image into some coarse grid, in this case it's a seven by seven grid and now within each of those grid cells you imagine some set of base bounding boxes. Here I've drawn three base bounding boxes like a tall one, a wide one, and a square one but in practice you would use more than three. So now for each of these grid cells and for each of these base bounding boxes you want to predict several things. One, you want to predict an offset off the base bounding box to predict what is the true location of the object off this base bounding box. And you also want to predict classification scores so maybe a classification score for each of these base bounding boxes. How likely is it that an object of this category appears in this bounding box. So then at the end we end up predicting from our input image, we end up predicting this giant tensor of seven by seven grid by 5B + C. So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories. So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network. And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do. And by the way, the region proposal network that gets used in faster R-CNN ends up looking quite similar to these where they have some set of base bounding boxes over some gridded image, another region proposal network does some regression plus some classification. So there's kind of some overlapping ideas here. So in faster R-CNN we're kind of treating the object, the region proposal step as kind of this fixed end-to-end regression problem and then we do the separate per region processing but now with these single shot methods we only do that first step and just do all of our object detection with a single forward pass. So object detection has a ton of different variables. There could be different base networks like VGG, ResNet, we've seen different metastrategies for object detection including this faster R-CNN type region based family of methods, this single shot detection family of methods. There's kind of a hybrid that I didn't talk about called R-FCN which is somewhat in between. There's a lot of different hyperparameters like what is the image size, how many region proposals do you use. And there's actually this really cool paper that will appear at CVPR this summer that does a really controlled experimentation around a lot of these different variables and tries to tell you how do these methods all perform under these different variables. So if you're interested I'd encourage you to check it out but kind of one of the key takeaways is that the faster R-CNN style of region based methods tends to give higher accuracies but ends up being much slower than the single shot methods because the single shot methods don't require this per region processing. But I encourage you to check out this paper if you want more details. Also as a bit of aside, I had this fun paper with Andre a couple years ago that kind of combined object detection with image captioning and did this problem called dense captioning so now the idea is that rather than predicting a fixed category label for each region, instead we want to write a caption for each region. And again, we had some data set that had this sort of data where we had a data set of regions together with captions and then we sort of trained this giant end-to-end model that just predicted these captions all jointly. And this ends up looking somewhat like faster R-CNN where you have some region proposal stage then a bounding box, then some per region processing. But rather than a SVM or a softmax loss instead those per region processing has a whole RNN language model that predicts a caption for each region. So that ends up looking quite a bit like faster R-CNN. There's a video here but I think we're running out of time so I'll skip it. But the idea here is that once you have this, you can kind of tie together a lot of these ideas and if you have some new problem that you're interested in tackling like dense captioning, you can recycle a lot of the components that you've learned from other problems like object detection and image captioning and kind of stitch together one end-to-end network that produces the outputs that you care about for your problem. So the last task that I want to talk about is this idea of instance segmentation. So here instance segmentation is in some ways like the full problem We're given an input image and we want to predict one, the locations and identities of objects in that image similar to object detection, but rather than just predicting a bounding box for each of those objects, instead we want to predict a whole segmentation mask for each of those objects and predict which pixels in the input image corresponds to each object instance. So this is kind of like a hybrid between semantic segmentation and object detection because like object detection we can handle multiple objects and we differentiate the identities of different instances so in this example since there are two dogs in the image and instance segmentation method actually distinguishes between the two dog instances and the output and kind of like semantic segmentation we have this pixel wise accuracy where for each of these objects we want to say which pixels belong to that object. So there's been a lot of different methods that people have tackled, for instance segmentation as well, but the current state of the art is this new paper called Mask R-CNN that actually just came out on archive about a month ago so this is not yet published, this is like super fresh stuff. And this ends up looking a lot like faster R-CNN. So it has this multi-stage processing approach where we take our whole input image, that whole input image goes into some convolutional network and some learned region proposal network that's exactly the same as faster R-CNN and now once we have our learned region proposals then we project those proposals onto our convolutional feature map just like we did in fast and faster R-CNN. But now rather than just making a classification and a bounding box for regression decision for each of those boxes we in addition want to predict a segmentation mask for each of those bounding box, for each of those region proposals. So now it kind of looks like a mini, like a semantic segmentation problem inside each of the region proposals that we're getting from our region proposal network. So now after we do this ROI aligning to warp our features corresponding to the region of proposal into the right shape, then we have two different branches. One branch will come up that looks exact, and this first branch at the top looks just like faster R-CNN and it will predict classification scores telling us what is the category corresponding to that region of proposal or alternatively whether or not it's background. And we'll also predict some bounding box coordinates that regressed off the region proposal coordinates. And now in addition we'll have this branch at the bottom which looks basically like a semantic segmentation mini network which will classify for each pixel in that input region proposal whether or not it's an object so this mask R-CNN problem, this mask R-CNN architecture just kind of unifies all of these different problems that we've been talking about today into one nice jointly end-to-end trainable model. And it's really cool and it actually works really really well so when you look at the examples in the paper they're kind of amazing. They look kind of indistinguishable from ground truth. So in this example on the left you can see that there are these two people standing in front of motorcycles, it's drawn the boxes around these people, it's also gone in and labeled all the pixels of those people and it's really small but actually in the background on that image on the left there's also a whole crowd of people standing very small in the background. It's also drawn boxes around each of those and grabbed the pixels of each of those images. And you can see that this is just, it ends up working really really well and it's a relatively simple addition on top of the existing faster R-CNN framework. So I told you that mask R-CNN unifies everything we talked about today and it also does pose estimation by the way. So we talked about, you can do pose estimation by predicting these joint coordinates for each of the joints of the person so you can do mask R-CNN to do joint object detection, pose estimation, and instance segmentation. And the only addition we need to make is that for each of these region proposals we add an additional little branch that predicts these coordinates of the joints for the instance of the current region proposal. So now this is just another loss, like another layer that we add, another head coming out of the network and an additional term in our multi-task loss. But once we add this one little branch then you can do all of these different problems jointly and you get results looking something like this. Where now this network, like a single feed forward network is deciding how many people are in the image, detecting where those people are, figuring out the pixels corresponding to each of those people and also drawing a skeleton estimating the pose of those people and this works really well even in crowded scenes like this classroom where there's a ton of people sitting and they all overlap each other and it just seems to work incredibly well. And because it's built on the faster R-CNN framework it also runs relatively close to real time so this is running something like five frames per second on a GPU because this is all sort of done in the single forward pass of the network. So this is again, a super new paper but I think that this will probably get a lot of attention in the coming months. So just to recap, we've talked. Sorry question? The question is how much training data do you need? So all of these instant segmentation results were trained on the Microsoft Coco data set so Microsoft Coco is roughly 200,000 training images. It has 80 categories that it cares about so in each of those 200,000 training images it has all the instances of those 80 categories labeled. So there's something like 200,000 images for training and there's something like I think an average of fivee or six instances per image. So it actually is quite a lot of data. And for Microsoft Coco for all the people in Microsoft Coco they also have all the joints annotated as well so this actually does have quite a lot of supervision at training time you're right, and actually is trained with quite a lot of data. So I think one really interesting topic to study moving forward is that we kind of know that if you have a lot of data to solve some problem, at this point we're relatively confident that you can stitch up some convolutional network that can probably do a reasonable job at that problem but figuring out ways to get performance like this with less training data is a super interesting and active area of research and I think that's something people will be spending a lot of their efforts working on in the next few years. So just to recap, today we had kind of a whirlwind tour of a whole bunch of different computer vision topics and we saw how a lot of the machinery that we built up from image classification can be applied relatively easily to tackle these different computer vision topics. And next time we'll talk about, we'll have a really fun lecture on visualizing CNN features. Well also talk about DeepDream and neural style transfer.
Info
Channel: Stanford University School of Engineering
Views: 471,649
Rating: undefined out of 5
Keywords:
Id: nDPWywWRIRo
Channel Id: undefined
Length: 74min 26sec (4466 seconds)
Published: Fri Aug 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.