Lecture 12 | Visualizing and Understanding

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Good morning. So, it's 12:03 so, I want to get started. Welcome to Lecture 12, of CS-231N. Today we are going to talk about Visualizing and Understanding convolutional networks. This is always a super fun lecture to give because we get to look a lot of pretty pictures. So, it's, it's one of my favorites. As usual a couple administrative things. So, hopefully your projects are all going well, because as a reminder your milestones are due on Canvas tonight. It is Canvas, right? Okay, so want to double check, yeah. Due on Canvas tonight, we are working on furiously grading your midterms. So, we'll hope to have those midterms grades to you back by on grade scope this week. So, I know that was little confusion, you all got registration email's for grade scope probably in the last week. Something like that, we start couple of questions on piazo. So, we've decided to use grade scope to grade the midterms. So, don't be confused, if you get some emails about that. Another reminder is that assignment three was released last week on Friday. It will be due, a week from this Friday, on the 26th. This is, an assignment three, is almost entirely brand new this year. So, it we apologize for taking a little bit longer than expected to get it out. But I think it's super cool. A lot of that stuff, we'll talk about in today's lecture. You'll actually be implementing on your assignment. And for the assignment, you'll get the choice of either Pi torch or tensure flow. To work through these different examples. So, we hope that's really useful experience for you guys. We also saw a lot of activity on HyperQuest over the weekend. So that's, that's really awesome. The leader board went up yesterday. It seems like you guys are really trying to battle it out to show off your deep learning neural network training skills. So that's super cool. And we because due to the high interest in HyperQuest and due to the conflicts with the, with the Milestones submission time. We decided to extend the deadline for extra credit through Sunday. So, anyone who does at least 12 runs on HyperQuest by Sunday will get little bit of extra credit in the class. Also those of you who are, at the top of leader board doing really well, will get may be little bit extra, extra credit. So, I thanks for participating we got lot of interest and that was really cool. Final reminder is about the poster session. So, we have the poster session will be on June 6th. That date is finalized, I think that, I don't remember the exact time. But it is June 6th. So that, we have some questions about when exactly that poster session is for those of you who are traveling at the end of quarter or starting internships or something like that. So, it will be June 6th. Any questions on the admin notes. No, totally clear. So, last time we talked. So, last time we had a pretty jam packed lecture, when we talked about lot of different computer vision tasks, as a reminder. We talked about semantic segmentation which is this problem, where you want to sign labels to every pixel in the input image. But does not differentiate the object instances in those images. We talked about classification plus localization. Where in addition to a class label you also want to draw a box or perhaps several boxes in the image. Where the distinction here is that, in a classification plus localization setup. You have some fix number of objects that you are looking for So, we also saw that this type of paradigm can be applied to the things like pose recognition. Where you want to regress to different numbers of joints in the human body. We also talked about the object detection where you start with some fixed set of category labels that you are interested in. Like dogs and cats. And then the task is to draw a boxes around every instance of those objects that appear in the input image. And object detection is really distinct from classification plus localization because with object detection, we don't know ahead of time, how many object instances we're looking for in the image. And we saw that there's this whole family of methods based on RCNN, Fast RCNN and faster RCNN, as well as the single shot detection methods for addressing this problem of object detection. Then finally we talked pretty briefly about instance segmentation, which is kind of combining aspects of a semantic segmentation and object detection where the goal is to detect all the instances of the categories we care about, as well as label the pixels belonging to each instance. So, in this case, we detected two dogs and one cat and for each of those instances we wanted to label all the pixels. So, these are we kind of covered a lot last lecture but those are really interesting and exciting problems that you guys might consider to using in parts of your projects. But today we are going to shift gears a little bit and ask another question. Which is, what's really going on inside convolutional networks. We've seen by this point in the class how to train convolutional networks. How to stitch up different types of architectures to attack different problems. But one question that you might have had in your mind, is what exactly is going on inside these networks? How did they do the things that they do? What kinds of features are they looking for? And all this source of related questions. So, so far we've sort of seen ConvNets as a little bit of a black box. Where some input image of raw pixels is coming in on one side. It goes to the many layers of convulsion and pooling in different sorts of transformations. And on the outside, we end up with some set of class scores or some types of understandable interpretable output. Such as class scores or bounding box positions or labeled pixels or something like that. But the question is. What are all these other layers in the middle doing? What kinds of things in the input image are they looking for? And can we try again intuition for. How ConvNets are working? What types of things in the image they are looking for? And what kinds of techniques do we have for analyzing this internals of the network? So, one relatively simple thing is the first layer. So, we've seen, we've talked about this before. But recalled that, the first convolutional layer consists of a filters that, so, for example in AlexNet. The first convolutional layer consists of a number of convolutional filters. Each convolutional of filter has shape 3 by 11 by 11. And these convolutional filters gets slid over the input image. We take inner products between some chunk of the image. And the weights of the convolutional filter. And that gives us our output of the at, at after that first convolutional layer. So, in AlexNet then we have 64 of these filters. But now in the first layer because we are taking in a direct inner product between the weights of the convolutional layer and the pixels of the image. We can get some since for what these filters are looking for by simply visualizing the learned weights of these filters as images themselves. So, for each of those 11 by 11 by 3 filters in AlexNet, we can just visualize that filter as a little 11 by 11 image with a three channels give you the red, green and blue values. And then because there are 64 of these filters we just visualize 64 little 11 by 11 images. And we can repeat... So we have shown here at the. So, these are filters taken from the prechain models, in the pi torch model zoo. And we are looking at the convolutional filters. The weights of the convolutional filters. at the first layer of AlexNet, ResNet-18, ResNet-101 and DenseNet-121. And you can see, kind of what all these layers what this filters looking for. You see the lot of things looking for oriented edges. Likes bars of light and dark. At various angles, in various angles and various positions in the input, we can see opposing colors. Like this are green and pink. opposing colors or this orange and blue opposing colors. So, this, this kind of connects back to what we talked about with Hugh and Wiesel. All the way in the first lecture. That remember the human visual system is known to the detect things like oriented edges. At the very early layers of the human visual system. And it turns out of that these convolutional networks tend to do something, somewhat similar. At their first convolutional layers as well. And what's kind of interesting is that pretty much no matter what type of architecture you hook up or whatever type of training data you are train it on. You almost always get the first layers of your. The first convolutional weights of any pretty much any convolutional network looking at images. Ends up looking something like this with oriented edges and opposing colors. Looking at that input image. But this really only, sorry what was that question? Yes, these are showing the learned weights of the first convolutional layer. Oh, so that the question is. Why does visualizing the weights of the filters? Tell you what the filter is looking for. So this intuition comes from sort of template matching and inner products. That if you imagine you have some, some template vector. And then you imagine you compute a scaler output by taking inner product between your template vector and some arbitrary piece of data. Then, the input which maximizes that activation. Under a norm constraint on the input is exactly when those two vectors match up. So, in that since that, when, whenever you're taking inner products, the thing causes an inner product to excite maximally is a copy of the thing you are taking an inner product with. So, that, that's why we can actually visualize these weights and that, why that shows us, what this first layer is looking for. So, for these networks the first layers always was a convolutional layer. So, generally whenever you are looking at image. Whenever you are thinking about image data and training convolutional networks, you generally put a convolutional layer at the first, at the first stop. Yeah, so the question is, can we do this same type of procedure in the middle open network. That's actually the next slide. So, good anticipation. So, if we do, if we draw this exact same visualization for the intermediate convolutional layers. It's actually a lot less interpretable. So, this is, this is performing exact same visualization. So, remember for this using the tiny ConvNets demo network that's running on the course website whenever you go there. So, for that network, the first layer is 7 by 7 convulsion 16 filters. So, after the top visualizing the first layer weights for this network just like we saw in a previous slide. But now at the second layer weights. After we do a convulsion then there's some relu and some other non-linearity perhaps. But the second convolutional layer, now receives the 16 channel input. And does 7 by 7 convulsion with 20 convolutional filters. And we've actually, so the problem is that you can't really visualize these directly as images. So, you can try, so, here if you this 16 by, so the input is this has 16 dimensions in depth. And we have these convolutional filters, each convolutional filter is 7 by 7, and is extending along the full depth so has 16 elements. Then we've 20 such of these convolutional filters, that are producing the output planes of the next layer. But the problem here is that we can't, looking at the, looking directly at the weights of these filters, doesn't really tell us much. So, we, that's really done here is that, now for this single 16 by 7 by 7 convolutional filter. We can spread out those 167 by 7 planes of the filter into a 167 by 7 grayscale images. So, that's what we've done. Up here, which is these little tiny gray scale images here show us what is, what are the weights in one of the convolutional filters of the second layer. And now, because there are 20 outputs from this layer. Then this second convolutional layer, has 2o such of these 16 by 16 or 16 by 7 by 7 filters. So if we visualize the weights of those convolutional filters as images, you can see that there are some kind of spacial structures here. But it doesn't really give you good intuition for what they are looking at. Because these filters are not looking, are not connected directly to the input image. Instead recall that the second layer convolutional filters are connected to the output of the first layer. So, this is giving visualization of, what type of activation pattern after the first convulsion, would cause the second layer convulsion to maximally activate. But, that's not very interpretable because we don't have a good sense for what those first layer convulsions look like in terms of image pixels. So we'll need to develop some slightly more fancy technique to get a sense for what is going on in the intermediate layers. Question in the back. Yeah. So the question is that for... all the visualization on this on the previous slide. We've had the scale the weights to the zero to 255 range. So in practice those weights could be unbounded. They could have any range. But to get nice visualizations we need to scale those. These visualizations also do not take in to account the bias is in these layers. So you should keep that in mind when and not take these HEPS visualizations to, to literally. Now at the last layer remember when we looking at the last layer of convolutional network. We have these maybe 1000 class scores that are telling us what are the predicted scores for each of the classes in our training data set and immediately before the last layer we often have some fully connected layer. In the case of Alex net we have some 4096- dimensional features representation of our image that then gets fed into that final our final layer to predict our final class scores. And one another, another kind of route for tackling the problem of visual, visualizing and understanding ConvNets is to try to understand what's happening at the last layer of a convolutional network. So what we can do is how to take some, some data set of images run a bunch of, run a bunch of images through our trained convolutional network and recorded that 4096 dimensional vector for each of those images. And now go through and try to figure out and visualize that last layer, that last hidden layer rather than those rather than the first convolutional layer. So, one thing you might imagine is, is trying a nearest neighbor approach. So, remember, way back in the second lecture we saw this graphic on the left where we, where we had a nearest neighbor classifier. Where we were looking at nearest neighbors in pixels space between CIFAR 10 images. And then when you look at nearest neighbors in pixel space between CIFAR 10 images you see that you pull up images that looks quite similar to the query image. So again on the left column here is some CIFAR 10 image from the CIFAR 10 data set and then these, these next five columns are showing the nearest neighbors in pixel space to those test set images. And so for example this white dog that you see here, it's nearest neighbors are in pixel space are these kinds of white blobby things that may, may or may not be dogs, but at least the raw pixels of the image are quite similar. So now we can do the same type of visualization computing and visualizing these nearest neighbor images. But rather than computing the nearest neighbors in pixel space, instead we can compute nearest neighbors in that 4096 dimensional feature space. Which is computed by the convolutional network. So here on the right we see some examples. So this, this first column shows us some examples of images from the test set of image that... Of the image net classification data set and now the, these subsequent columns show us nearest neighbors to those test set images in the 4096, in the 4096th dimensional features space computed by Alex net. And you can see here that this is quite different from the pixel space nearest neighbors, because the pixels are often quite different. between the image in it's nearest neighbors and feature space. However, the semantic content of those images tends to be similar in this feature space. So for example, if you look at this second layer the query image is this elephant standing on the left side of the image with a screen grass behind him. and now one of these, one of these... it's third nearest neighbor in the tough set is actually an elephant standing on the right side of the image. So this is really interesting. Because between this elephant standing on the left and this element stand, elephant standing on the right the pixels between those two images are almost entirely different. However, in the feature space which is learned by the network those two images and that being very close to each other. Which means that somehow this, this last their features is capturing some of those semantic content of these images. That's really cool and really exciting and, and in general looking at these kind of nearest neighbor visualizations is really quick and easy way to visualize something about what's going on here. Yes. So the question is that through the... the standard supervised learning procedure for classific training, classification network There's nothing in the loss encouraging these features to be close together. So that, that's true. It just kind of a happy accident that they end up being close to each other. Because we didn't tell the network during training these features should be close. However there are sometimes people do train networks using things called either contrastive loss or a triplet loss. Which actually explicitly make... assumptions and constraints on the network such that those last their features end up having some metric space interpretation. But Alex net at least was not trained specifically for that. The question is, what is the nearest... What is this nearest neighbor thing have to do at the last layer? So we're taking this image we're running it through the network and then the, the second to last like the last hidden layer of the network is of 4096th dimensional vector. Because there's this, this is... This is there, there are these fully connected layers at the end of the network. So we are doing is... We're writing down that 4096th dimensional vector for each of the images and then we are computing nearest neighbors according to that 4096th dimensional vector. Which is computed by, computed by the network. Maybe, maybe we can chat offline. So another, another, another another angle that we might have for visualizing what's going on in this last layer is by some concept of dimensionality reduction. So those of you who have taken CS229 for example you've seen something like PCA. Which let's you take some high dimensional representation like these 4096th dimensional features and then compress it down to two-dimensions. So then you can visualize that feature space more directly. So, Principle Component Analysis or PCA is kind of one way to do that. But there's real another really powerful algorithm called t-SNE. Standing for t-distributed stochastic neighbor embeddings. Which is slightly more powerful method. Which is a non-linear dimensionality reduction method that people in deep often use for visualizing features. So here as an, just an example of what t-SNE can do. This visualization here is, is showing a t-SNE dimensionality reduction on the emnest data set. So, emnest remember is this date set of hand written digits between zero and nine. Each image is a gray scale image 20... 28 by 28 gray scale image and now we're... So that Now we've, we've used t-SNE to take that 28 times 28 dimensional features space of the raw pixels for m-nest and now compress it down to two- dimensions ans then visualize each of those m-nest digits in this compress two-dimensional representation and when you do, when you run t-SNE on the raw pixels and m-nest You can see these natural clusters appearing. Which corresponds to the, the digits of these m-nest of, of these m-nest data set. So now we can do a similar type of visualization. Where we apply this t-SNE dimensionality reduction technique to the features from the last layer of our trained image net classifier. So...To be a little bit more concrete here what we've done is that we take, a large set of images we run them off convolutional network. We record that final 4096th dimensional feature vector for, from the last layer of each of those images. Which gives us large collection of 4096th dimensional vectors. Now we apply t-SNE dimensionality reduction to compute, sort of compress that 4096the dimensional features space down into a two-dimensional feature space and now we, layout a grid in that compressed two-dimensional feature space and visualize what types of images appear at each location in the grid in this two-dimensional feature space. So by doing this you get some very close rough sense of what the geometry of this learned feature space looks like. So these images are little bit hard to see. So I'd encourage you to check out the high resolution versions online. But at least maybe on the left you can see that there's sort of one cluster in the bottom here of, of green things, is a different kind of flowers and there's other types of clusters for different types of dog breeds and another types of animals and, and locations. So there's sort of discontinuous semantic notion in this feature space. Which we can explore by looking through this t-SNE dimensionality reduction version of the, of the features. Is there question? Yeah. So the basic idea is that we're we, we have an image so now we end up with three different pieces of information about each image. We have the pixels of the image. We have the 4096th dimensional vector. Then we use t-SNE to convert the 4096th dimensional vector into a two-dimensional coordinate and then we take the original pixels of the image and place that at the two-dimensional coordinate corresponding to the dimensionality reduced version of the 4096th dimensional feature. Yeah, little bit involved here. Question in the front. The question is Roughly how much variants do these two-dimension explain? Well, I'm not sure of the exact number and I get little bit muddy when you're talking about t-SNE, because it's a non-linear dimensionality reduction technique. So, I'd have to look offline and I'm not sure of exactly how much it explains. Question? Question is, can you do the same analysis of upper layers of the network? And yes, you can. But no, I don't have those visualizations here. Sorry. Question? The question is, Shouldn't we have overlaps of images once we do this dimensionality reduction? And yes, of course, you would. So this is just kind of taking a, nearest neighbor in our, in our regular grid and then picking an image close to that grid point. So, so... they, yeah. this is not showing you the kind of density in different parts of the feature space. So that's, that's another thing to look at and again at the link you, there's a couple more visualizations of this nature that, that address that a little bit. Okay. So another, another thing that you can do for some of these intermediate features is, so we talked a couple of slides ago that visualizing the weights of these intermediate layers is not so interpretable. But actually visualizing the activation maps of those intermediate layers is kind of interpretable in some cases. So for, so I, again an example of Alex Net. Remember the, the conv5 layers of Alex Net. Gives us this 128 by... The for...The conv5 features for any image is now 128 by 13 by 13 dimensional tensor. But we can think of that as 128 different 13 by 132-D grids. So now we can actually go and visualize each of those 13 by 13 elements slices of the feature map as a grayscale image and this gives us some sense for what types of things in the input are each of those features in that convolutional layer looking for. So this is a, a really cool interactive tool by Jason Yasenski you can just download. So it's run, so I don't have the video, it has a video on his website. But it's running a convolutional network on the inputs stream of webcam and then visualizing in real time each of those slices of that intermediate feature map give you a sense of what it's looking for and you can see that, so here the input image is this, this picture up in, settings... of this picture of a person in front of the camera and most of these intermediate features are kind of noisy, not much going on. But there's a, but there's this one highlighted intermediate feature where that is also shown larger here that seems that it's activating on the portions of the feature map corresponding to the person's face. Which is really interesting and that kind of, suggests that maybe this, this particular slice of the feature map of this layer of this particular network is maybe looking for human faces or something like that. Which is kind of a nice, kind of a nice and cool finding. Question? The question is, Are the black activations dead relu's? So you got to be... a little careful with terminology. We usually say dead relu to mean something that's dead over the entire training data set. Here I would say that it's a relu, that, it's not active for this particular input. Question? The question is, If there's no humans in image net how can it recognize a human face? There definitely are humans in image net I don't think it's, it's one of the cat... I don't think it's one of the thousand categories for the classification challenge. But people definitely appear in a lot of these images and that can be useful signal for detecting other types of things. So that's actually kind of nice results because that shows that, it's sort of can learn features that are useful for the classification task at hand. That are even maybe a little bit different from the explicit classification task that we told it to perform. So it's actually really cool results. Okay, question? So at each layer in the convolutional network our input image is of three, it's like 3 by 224 by 224 and then it goes through many stages of convolution. And then, it, after each convolutional layer is some three dimensional chunk of numbers. Which are the outputs from that layer of the convolutional network. And that into the entire three dimensional chunk of numbers which are the output of the previous convolutional layer, we call, we call, like an activation volume and then one of those, one of those slices is a, it's an activation map. So the question is, If the image is K by K will the activation map be K by K? Not always because there can be sub sampling due to pool, straight convolution and pooling. But in general, the, the size of each activation map will be linear in the size of the input image. So another, another kind of useful thing we can do for visualizing intermediate features is... Visualizing what types of patches from input images cause maximal activation in different, different features, different neurons. So what we've done here is that, we pick... Maybe again the con five layer from Alex Net? And remember each of these activation volumes at the con, at the con five in Alex net gives us a 128 by 13 by 13 chunk of numbers. Then we'll pick one of those 128 channels. Maybe channel 17 and now what we'll do is run many images through this convolutional network. And then, for each of those images record the con five features and then look at the... Right, so, then, then look at the, the... The parts of that 17th feature map that are maximally activated over our data set of images. And now, because again this is a convolutional layer each of those neurons in the convolutional layer has some small receptive field in the input. Each of those neurons is not looking at the whole image. They're only looking at the sub set of the image. Then what we'll do is, is visualize the patches from the, from this large data set of images corresponding to the maximal activations of that, of that feature, of that particular feature in that particular layer. And then we can sorts these out, sort these patches by their activation at that, at that particular layer. So here is a, some examples from this... Network called a, fully... The network doesn't matter. But these are some visualizations of these kind of maximally activating patches. So, each, each row gives... We've chosen one layer from or one neuron from one layer of a network and then each, and then, the, they're sorted of these are the patches from some large data set of images. That maximally activated this one neuron. And these can give you a sense for what type of features these, these neurons might be looking for. So for example, this top row we see a lot of circly kinds of things in the image. Some eyes, some, mostly eyes. But also this, kind of blue circly region. So then, maybe this, this particular neuron in this particular layer of this network is looking for kind of blue circly things in the input. Or maybe in the middle here we have neurons that are looking for text in different colors or, or maybe curving, curving edges of different colors and orientations. Yeah, so, I've been a little bit loose with terminology here. So, I'm saying that a neuron is one scaler value in that con five activation map. But because it's convolutional, all the neurons in one channel are all using the same weights. So we've chosen one channel and then, right? So, you get a lot of neurons for each convolutional filter at any one layer. So, we, we could have been, so this patches could've been drawn from anywhere in the image due to the convolutional nature of the thing. And now at the bottom we also see some maximally activating patches for neurons from a higher up layer in the same network. And now because they are coming from higher in the network they have a larger receptive field. So, they're looking at larger patches of the input image and we can also see that they're looking for maybe larger structures in the input image. So this, this second row is maybe looking, it seems to be looking for human, humans or maybe human faces. We have maybe something looking for... Parts of cameras or different types of larger, larger, larger object like type things, types of things. Another, another cool experiment we can do which comes from Zeiler and Fergus ECCV 2014 paper. is this idea of an exclusion experiment. So, what we want to do is figure out which parts of the input, of the input image cause the network to make it's classification decision. So, what we'll do is, we'll take our input image in this case an elephant and then we'll block out some part of that, some region in that input image and just replace it with the mean pixel value from the data set. And now, run that occluded image throughout, through the network and then record what is the predicted probability of this occluded image? And now slide this occluded patch over every position in the input image and then repeat the same process. And then draw this heat map showing, what was the predicted probability output from the network as a function of where did, which part of the input image did we occlude? And the idea is that if when we block out some part of the image if that causes the network score to change drastically. Then probably that part of the input image was really important for the classification decision. So here we've shown... I've shown three different examples of... Of this occlusion type experiment. So, maybe this example of a Go-kart at the bottom, you can see over here that when we, so here, red, the, the red corresponds to a low probability and the white and yellow corresponds to a high probability. So when we block out the region of the image corresponding to this Go-kart in front. Then the predicted probability for the Go-kart class drops a lot. So that gives us some sense that the network is actually caring a lot about these, these pixels in the input image in order to make it's classification decision. Question? Yes, the question is that, what's going on in the background? So maybe if the image is a little bit too small to tell but, there's, this is actually a Go-kart track and there's a couple other Go-karts in the background. So I think that, when you're blocking out these other Go-karts in the background, that's also influencing the score or maybe like the horizon is there and maybe the horizon is an useful feature for detecting Go-karts, it's a little bit hard to tell sometimes. But this is a pretty cool visualization. Yeah, was there another question? So the question is, sorry, sorry, what was the first question? So, the, so the question... So for, for this example we're taking one image and then masking all parts of one image. The second question was, how is this useful? It's not, maybe, you don't really take this information and then loop it directly into the training process. Instead, this is a way, a tool for humans to understand, what types of computations these train networks are doing. So it's more for your understanding than for improving performance per se. So another, another related idea is this concept of a Saliency Map. Which is something that you will see in your homeworks. So again, we have the same question of given an input image of a dog in this case and the predicted class label of dog we want to know which pixels in the input image are important for classification. We saw masking, is one way to get at this question. But Saliency Maps are another, another, angle for attacking this problem. And the question is, and one relatively simple idea from Karen Simonenian's paper, a couple years ago. Is, this is just computing the gradient of the predicted class score with respect to the pixels of the input image. And this will directly tell us in this sort of, first order approximation sense. For each input, for each pixel in the input image if we wiggle that pixel a little bit then how much will the classification score for the class change? And this is another way to get at this question of which pixels in the input matter for the classification. And when we, and when we run for example Saliency, where computer Saliency map for this dog, we see kind of a nice outline of a dog in the image. Which tells us that these are probably the pixels of that, network is actually looking at, for this image. And when we repeat this type of process for different images, we get some sense that the network is sort of looking at the right regions. Which is somewhat comforting. Question? The question is, do people use Saliency Maps for semantic segmentation? The answer is yes. That actually was... Yeah, you guys are like really on top of it this lecture. So that was another component, again in Karen's paper. Where there's this idea that maybe you can use these Saliency Maps to perform semantic segmentation without direct, without any labeled data for the, for these, for these segments. So here they're using this Grabcut Segmentation Algorithm which I don't really want to get into the details of. But it's kind of an interactive segmentation algorithm that you can use. So then when you combine this Saliency Map with this Grabcut Segmentation Algorithm then you can in fact, sometimes segment out the object in the image. Which is really cool. However I'd like to point out that this is a little bit brittle and in general if you, this will probably work much, much, much, worse than a network which did have access to supervision and training time. So, I don't, I'm not sure how, how practical this is. But it is pretty cool that it works at all. But it probably works much less than something trained explicitly to segment with supervision. So kind of another, another related idea is this idea of, of guided back propagation. So again, we still want to answer the question of for one particular, for one particular image. Then now instead of looking at the class score we want to know, we want to pick some intermediate neuron in the network and ask again, which parts of the input image influence the score of that neuron, that internal neuron in the network. And, and then you could imagine, again you could imagine computing a Saliency Map for this again, right? That rather than computing the gradient of the class scores with respect to the pixels of the image. You could compute the gradient of some intermediate value in the network with respect to the pixels of the image. And that would tell us again which parts, which pixels in the input image influence that value of that particular neuron. And that would be using normal back propagation. But it turns out that there is a slight tweak that we can do to this back propagation procedure that ends up giving some slightly cleaner images. So that's this idea of guided back propagation that again comes from Zeiler and Fergus's 2014 paper. And I don't really want to get into the details too much here but, it, you just, it's kind of weird tweak where you change the way that you back propagate through relu non-linearities. And you sort of, only, only back propagate positive gradients through relu's and you do not back propagate negative gradients through the relu's. So you're no longer computing the true gradient instead you're kind of only keeping track of positive influences on throughout the entire network. So maybe you should read through these, these papers reference to your, if you want a little bit more details about why that's a good idea. But empirically, when you do guided back propagation as appose to regular back propagation. You tend to get much cleaner, nicer images. that tells you, which part, which pixel of the input image influence that particular neuron. So, again we were seeing the same visualization we saw a few slides ago of the maximally activating patches. But now, in addition to visualizing these maximally activating patches. We've also performed guided back propagation, to tell us exactly which parts of these patches influence the score of that neuron. So, remember for this example at the top, we saw that, we thought this neuron is may be looking for circly tight things, in the input patch because there're allot of circly tight patches. Well, when we look at guided back propagation We can see with that intuition is somewhat confirmed because it is indeed the circly parts of that input patch which are influencing that, that neuron value. So, this is kind of a useful to all for synthesizing. For understanding what these different intermediates are looking for. But, one kind of interesting thing about guided back propagation or computing saliency maps. Is that there's always a function of fixed input image, right, they're telling us for a fixed input image, which pixel or which parts of that input image influence the value of the neuron. Another question you might answer is is remove this reliance, on that, on some input image. And then instead just ask what type of input in general would cause this neuron to activate and we can answer this question using a technical Gradient ascent so, remember we always use Gradient decent to train our convolutional networks by minimizing the loss. Instead now, we want to fix the, fix the weight of our trained convolutional network and instead synthesizing image by performing Gradient ascent on the pixels of the image to try and maximize the score of some intermediate neuron or of some class. So, in a process of Gradient ascent, we're no longer optimizing over the weights of the network those weights remained fixed instead we're trying to change pixels of some input image to cause this neuron, or this neuron value, or this class score to maximally, to be maximized but, instead but, in addition we need some regularization term so, remember we always a, we before seeing regularization terms to try to prevent the network weights from over fitting to the training data. Now, we need something kind of similar to prevent the pixels of our generated image from over fitting to the peculiarities of that particular network. So, here we'll often incorporate some regularization term that, we're kind of, we want a generated image of two properties one, we wanted to maximally activate some, some score or some neuron value. But, we also wanted to look like a natural image. we wanted to kind of have, the kind of statistics that we typically see in natural images. So, these regularization term in the subjective is something to enforce a generated image to look relatively natural. And we'll see a couple of different examples of regualizers as we go through. But, the general strategy for this is actually pretty simple and again informant allot of things of this nature on your assignment three. But, what we'll do is start with some initial image either initializing to zeros or to uniform or noise. But, initialize your image in some way and I'll repeat where you forward your image through 3D network and compute the score or, or neuron value that you're interested. Now, back propagate to compute the Gradient of that neuron score with respect to the pixels of the image and then make a small Gradient ascent or Gradient ascent update to the pixels of the images itself. To try and maximize that score. And I'll repeat this process over and over again, until you have a beautiful image. And, then we talked, we talked about the image regularizer, well a very simple, a very simple idea for image regularizer is simply to penalize L2 norm of a generated image This is not so semantically meaningful, it's just does something, and this was one of the, one of the earliest regularizer that we've seen in the literature for these type of generating images type of papers. And, when you run this on a trained network you can see that now we're trying to generate images that maximize the dumble score in the upper left hand corner here for example. And, then you can see that the synthesized image, it been, it's little bit hard to see may be but there're allot of different dumble like shapes, all kind of super impose that different portions of the image. or if we try to generate an image for cups then we can may be see a bunch of different cups all kind of super imposed the Dalmatian is pretty cool, because now we can see kind of this black and white spotted pattern that's kind of characteristics of Dalmatians or for lemons we can see these different kinds of yellow splotches in the image. And there's a couple of more examples here, I think may be the goose is kind of cool or the kitfox are actually may be looks like kitfox. Question? The question is, why are these all rainbow colored and in general getting true colors out of this visualization is pretty tricky. Right, because any, any actual image will be bounded in the range zero to 255. So, it really should be some kind of constrained optimization problem But, if, for using this generic methods for Gradient ascent then we, that's going to be unconstrained problem. So, you may be use like projector Gradient ascent algorithm or your rescaled image at the end. So, the colors that you see in this visualizations, sometimes are you cannot take them too seriously. Question? The question is what happens, if you let the thing loose and don't put any regularizer on it. Well, then you tend to get an image which maximize the score which is confidently classified as the class you wanted but, usually it doesn't look like anything. It kind of look likes random noise. So, that's kind of an interesting property in itself that will go into much more detail in a future lecture. But, that's why, that kind of doesn't help you so much for understanding what things the network is looking for. So, if we want to understand, why the network thing makes its decisions then it's kind of useful to put regularizer on there to generate an image to look more natural. A question in the back. Yeah, so the question is that we see a lot of multi modality here, and other ways to combat that. And actually yes, we'll see that, this is kind of first step in the whole line of work in improving these visualizations. So, another, another kind of, so then the angle here is a kind of to improve the regularizer to improve our visualized images. And there's a another paper from Jason Yesenski and some of his collaborators where they added some additional impressive regularizers. So, in addition to this L2 norm constraint, in addition we also periodically during optimization, and do some gauche and blurring on the image, we're also clip some,. some small value, some small pixel values all the way to zero, we're also clip some of the, some of the pixel values of low Gradients to zero So, you can see this is kind of a projector Gradient ascent algorithm where it reach periodically we're projecting our generated image onto some nicer set of images with some nicer properties. For example, special smoothness with respect to the gauchian blurring So, when you do this, you tend to get much nicer images that are much clear to see. So, now these flamingos look like flamingos the ground beetle is starting to look more beetle like or this black swan maybe looks like a black swan. These billiard tables actually look kind of impressive now, where you can definitely see this billiard table structure. So, you can see that once you add in nicer regularizers, then the generated images become a bit, a little bit cleaner. And, now we can perform this procedure not only for the final class course, but also for these intermediate neurons as well. So, instead of trying to maximize our billiard table score for example instead we can get maximize one of the neurons from some intermediate layer Question. So, the question is what's with the for example here, so those who remember initializing our image randomly so, these four images would be different random initialization of the input image. And again, we can use these same type of procedure to visualize, to synthesis images which maximally activate intermediate neurons of the network. And, then you can get a sense from some of these intermediate neurons are looking for, so may be at layer four there's neuron that's kind of looking for spirally things or there's neuron that's may be looking for like chunks of caterpillars it's a little bit harder to tell. But, in generally as you go larger up in the image then you can see that the one, the obviously receptive fields of these neurons are larger. So, you're looking at the larger patches in the image. And they tend to be looking for may be larger structures or more complex patterns in the input image. That's pretty cool. And, then people have really gone crazy with this and trying to, they basically improve these visualization by keeping on extra features So, this was a cool paper kind of explicitly trying to address this multi modality, there's someone asked question about a few minutes ago. So, here they were trying to explicitly take a count, take this multi modality into account in the optimization procedure where they did indeed, I think see the initial, so they for each of the classes, you run a clustering algorithm to try to separate the classes into different modes and then initialize with something that is close to one of those modes. And, then when you do that, you kind of account for this multi modality. so for intuition, on the right here these eight images are all of grocery stores. But, the top row, is kind of close up pictures of produce on the shelf and those are labeled as grocery stores And the bottom row kind of shows people walking around grocery stores or at the checkout line or something like that. And, those are also labeled those as grocery store, but their visual appearance is quiet different. So, a lot of these classes and that being sort multi modal And, if you can take, and if you explicitly take this more time mortality into account when generating images, then you can get nicer results. And now, then when you look at some of their example, synthesis images for classes, you can see like the bell pepper, the card on, strawberries, jackolantern now they end up with some very beautifully generated images. And now, I don't want to get to much into detail of the next slide. But, then you can even go crazier. and add an even stronger image prior and generate some very beautiful images indeed So, these are all synthesized images that are trying to maximize the class score or some image in a class. But, the general idea is that rather than optimizing directly the pixels of the input image, instead they're trying to optimize the FC6 representation of that image instead. And now they need to use some feature inversion network and I don't want to get into the details here. You should read the paper, it's actually really cool But, the point is that, when you start adding additional priors towards modeling natural images and you can end generating some quiet realistic images they gave you some sense of what the network is looking for So, that's, that's sort of one cool thing that we can do with this strategy, but this idea of trying to synthesis images by using Gradients on image pixels, is actually super powerful. And, another really cool thing we can do with this, is this concept of fooling image So, what we can do is pick some arbitrary image, and then try to maximize the, so, say we take it picture of an elephant and then we tell the network that we want to, change the image to maximize the score of Koala bear instead So, then what we were doing is trying to change that image of an elephant to try and instead cause the network to classify as a Koala bear. And, what you might hope for is that, maybe that elephant was sort of thought more thing into a Koala bear and maybe he would sprout little cute ears or something like that. But, that's not what happens in practice, which is pretty surprising. Instead if you take this picture of a elephant and tell them that, tell them that and try to change the elephant image to instead cause it to be classified as a koala bear What you'll find is that, you is that this second image on the right actually is classified as koala bear but it looks the same to us. So that's pretty fishy and pretty surprising. So also on the bottom we've taken this picture of a boat. Schooner is the image in that class and then we told the network to classified as an iPod. So now the second example looks just, still looks like a boat to us but the network thinks it's an iPod and the difference is in pixels between these two images are basically nothing. And if you magnify those differences you don't really see any iPod or Koala like features on these differences, they're just kind of like random patterns of noise. So the question is what's going here? And like how can this possibly the case? Well, we'll have a guest lecture from Ian Goodfellow in a week an half two weeks. And he's going to go in much more detail about this type of phenomenon and that will be really exciting. But I did want to mention it here because it is on your homework. Question? Yeah, so that's something, so the question is can we use fooled images as training data and I think, Ian's going to go in much more detail on all of these types of strategies. Because that's literally, that's really a whole lecture onto itself. Question? The question is why do we care about any of this stuff? Basically... Okay, maybe that was a mischaracterization, I am sorry. Yeah, the question is what is have in the... understanding this intermediate neurons how does that help our understanding of the final classification. So this is actually, this whole field of trying to visualize intermediates is kind of in response to a common criticism of deep learning. So a common criticism of deep learning is like, you've got this big black box network you trained it on gradient ascent, you get a good number and that's great but we don't trust the network because we don't understand as people why it's making the decisions, that's it's making. So a lot of these type of visualization techniques were developed to try and address that and try to understand as people why the network are making their various classification, classification decisions a bit more. Because if you contrast, if you contrast a deep convolutional neural network with other machine running techniques. Like linear models are much easier to interpret in general because you can look at the weights and kind of understand the interpretation between how much each input feature effect the decision or if you look at something like a random forest or decision tree. Some other machine learning models end up being a bit more interpretable just by their very nature then this sort of black box convolutional networks. So a lot of this is sort of in response to that criticism to say that, yes they are these large complex models but they are still doing some interesting and interpretable things under the hood. They are not just totally going out in randomly classifying things. They are doing something meaningful So another cool thing we can do with this gradient based optimization of images is this idea of DeepDream. So this was a really cool blog post that came out from Google a year or two ago. And the idea is that, this is, so we talked about scientific value, this is almost entirely for fun. So the point of this exercise is mostly to generate cool images. And aside, you also get some sense for what features images are looking at. Or these networks are looking at. So we can do is, we take our input image we run it through the convolutional network up to some layer and now we back propagate and set the gradient of that, at that layer equal to the activation value. And now back propagate, back to the image and update the image and repeat, repeat, repeat. So this has the interpretation of trying to amplify existing features that were detected by the network in this image. Right? Because whatever features existed on that layer now we set the gradient equal to the feature and we just tell the network to amplify whatever features you already saw in that image. And by the way you can also see this as trying to maximize the L2 norm of the features at that layer of the image. And it turns... And when you do this the code ends up looking really simple. So your code for many of your homework assignments will probably be about this complex or maybe even a little bit a less so. So the idea is that... But there's a couple of tricks here that you'll also see in your assignments. So one trick is to jitter the image before you compute your gradients. So rather than running the exact image through the network instead you'll shift the image over by two pixels and kind of wrap the other two pixels over here. And this is a kind of regularizer to prevent each of these [mumbling] it regularizers a little bit to encourage a little bit of extra special smoothness in the image. You also see they use L1 normalization of the gradients that's kind of a useful trick sometimes when doing this image generation problems. You also see them clipping the pixel values once in a while. So again we talked about images actually should be between zero to 2.55 so this is a kind of projected gradients decent where we project on to the space of actual valid images. But now when we do all this then we start, we might start with some image of a sky and then we get really cool results like this. So you can see that now we've taken these tiny features on the sky and they get amplified through this, through this process. And we can see things like this different mutant animals start to pop up or these kind of spiral shapes pop up. Different kinds of houses and cars pop up. So that's all, that's all pretty interesting. There's a couple patterns in particular that pop up all the time that people have named. Right, so there's this Admiral dog, that shows up allot. There's the pig snail, the camel bird this the dog fish. Right, so these are kind of interesting, but actually this fact that dog show up so much in these visualization, actually does tell us something about the data on which this network was trained. Right, because this is a network that was trained for image net classification, image that have thousand categories. But 200 of those categories are dogs. So, so it's kind of not surprising in a sense that when you do these kind of visualizations then network ends up hallucinating a lot of dog like stuff in the image often morphed with other types of animals. When you do this other layers of the network you get other types of results. So here we're taking one of these lower layers in the network, the previous example was relatively high up in the network and now again we have this interpretation that lower layers maybe computing edges and swirls and stuff like that and that's kind of borne out when we running DeepDream at a lower layer. Or if you run this thing for a long time and maybe add in some multiscale processing you can get some really, really crazy images. Right, so here they're doing a kind of multiscale processing where they start with a small image run DeepDream on the small image then make it bigger and continue DeepDream on the larger image and kind of repeat with this multiscale processing and then you can get, and then maybe after you complete the final scale then you restart from the beginning and you just go wild on this thing. And you can get some really crazy images. So these examples were all from networks trained on image net There's another data set from MIT called MIT Places Data set but instead of 1,000 categories of objects instead it's 200 different types of scenes like bedrooms and kitchens like stuff like that. And now if we repeat this DeepDream procedure using an network trained at MIT places. We get some really cool visualization as well. So now instead of dogs, slugs and admiral dogs and that's kind of stuff, instead we often get these kind of roof shapes of these kind of Japanese style building or these different types of bridges or mountain ranges. They're like really, really cool beautiful visualizations. So the code for DeepDream is online, released by Google you can go check it out and make your own beautiful pictures So there's another kind of... Sorry question? So the question is, what are taking gradient of? So like I say, if you, because like one over x squared on the gradient of that is x. So, if you send back the volume of activation as the gradient, that's equivalent to max, that's equivalent to taking the gradient with respect to the like one over x squared some... Some of the values. So it's equivalent to maximizing the norm of that of the features of that layer. But in practice many implementation you'll see not explicitly compute that instead of send gradient back. So another kind of useful, another kind of useful thing we can do is this concept of feature inversion. So this again gives us a sense for what types of, what types of elements of the image are captured at different layers of the network. So what we're going to do now is we're going to take an image, run that image through network record the feature value for one of those images and now we're going to try to reconstruct that image from its feature representation. And the question, and now based on the how much, how much like what that reconstructed image looks like that'll give us some sense for what type of information about the image was captured in that feature vector. So again, we can do this with gradient ascent with some regularizer. Where now rather than maximizing some score instead we want to minimize the distance between this catch feature vector. And between the computed features of our generated image. To try and again synthesize a new image that matches the feature back to that we computed before. And another kind of regularizer that you frequently see here is the total variation regularizer that you also see on your homework. So here with the total variation regularizer is panelizing differences between adjacent pixels on both of the left and adjacent in left and right and adjacent top to bottom. To again try to encourage special smoothness in the generated image. So now if we do this idea of feature inversion so this visualization here on the left we're showing some original image. The elephants or the fruits at the left. And then we run that, we run the image through a VGG-16 network. Record the features of that network at some layer and then try to synthesize a new image that matches the recorded features of that layer. And this is, this kind of give us a sense for what how much information is stored in this images. In these features of different layers. So for example if we try to reconstruct the image based on the relu2_2 features from VGC's, from VGG-16. We see that the image gets almost perfectly reconstructed. Which means that we're not really throwing away much information about the raw pixel values at that layer. But as we move up into the deeper parts of the network and try to reconstruct from relu4_3, relu5_1. We see that our reconstructed image now, we've kind of kept the general space, the general spatial structure of the image. You can still tell that, that it's a elephant or a banana or a, or an apple. But a lot of the low level details aren't exactly what the pixel values were and exactly what the colors were, exactly what the textures were. These are kind of low level details are kind of lost at these higher layers of this network. So that gives us some sense that maybe as we move up through the flairs of the network it's kind of throwing away this low level information about the exact pixels of the image and instead is maybe trying to keep around a little bit more semantic information, it's a little bit invariant for small changes in color and texture and things like that. So we're building towards a style transfer here which is really cool. So in addition to understand style transfer, in addition to feature inversion. We also need to talk about a related problem called texture synthesis. So in texture synthesis, this is kind of an old problem in computer graphics. Here the idea is that we're given some input patch of texture. Something like these little scales here and now we want to build some model and then generate a larger piece of that same texture. So for example, we might here want to generate a large image containing many scales that kind of look like input. And this is again a pretty old problem in computer graphics. There are nearest neighbor approaches to textual synthesis that work pretty well. So, there's no neural networks here. Instead, this kind of a simple algorithm where we march through the generated image one pixel at a time in scan line order. And then copy... And then look at a neighborhood around the current pixel based on the pixels that we've already generated and now compute a nearest neighbor of that neighborhood in the patches of the input image and then copy over one pixel from the input image. So, maybe you don't need to understand the details here just the idea is that there's a lot classical algorithms for texture synthesis, it's a pretty old problem but you can do this without neural networks basically. And when you run this kind of this kind of classical texture synthesis algorithm it actually works reasonably well for simple textures. But as we move to more complex textures these kinds of simple methods of maybe copying pixels from the input patch directly tend not to work so well. So, in 2015, there was a really cool paper that tried to apply neural network features to this problem of texture synthesis. And ended up framing it as kind of a gradient ascent procedure, kind of similar to the feature map, the various feature matching objectives that we've seen already. So, in order to perform neural texture synthesis they use this concept of a gram matrix. So, what we're going to do, is we're going to take our input texture and in this case some pictures of rocks and then take that input texture and pass it through some convolutional neural network and pull out convolutional features at some layer of the network. So, maybe then this convolutional feature volume that we've talked about, might be H by W by C or sorry, C by H by W at that layer of the network. So, you can think of this as an H by W spacial grid. And at each point of the grid, we have this C dimensional feature vector describing the rough appearance of that image at that point. And now, we're going to use this activation map to compute a descriptor of the texture of this input image. So, what we're going to do is take, pick out two of these different feature columns in the input volume. Each of these feature columns will be a C dimensional vector. And now take the outer product between those two vectors to give us a C by C matrix. This C by C matrix now tells us something about the co-occurrence of the different features at those two points in the image. Right, so, if an element, if like element IJ in the C by C matrix is large that means both elements I and J of those two input vectors were large and something like that. So, this somehow captures some second order statistics about which features, in that feature map tend to activate to together at different spacial volumes... At different spacial positions. And now we're going to repeat this procedure using all different pairs of feature vectors from all different points in this H by W grid. Average them all out, and that gives us our C by C gram matrix. And this is then used a descriptor to describe kind of the texture of that input image. So, what's interesting about this gram matrix is that it has now thrown away all spacial information that was in this feature volume. Because we've averaged over all pairs of feature vectors at every point in the image. Instead, it's just capturing the second order co-occurrence statistics between features. And this ends up being a nice descriptor for texture. And by the way, this is really efficient to compute. So, if you have a C by H by W three dimensional tensure you can just reshape it to see times H by W and take that times its own transpose and compute this all in one shot so it's super efficient. But you might be wondering why you don't use an actual covariance matrix or something like that instead of this funny gram matrix and the answer is that using covariance... Using true covariance matrices also works but it's a little bit more expensive to compute. So, in practice a lot of people just use this gram matrix descriptor. So then... Then there's this... Now once we have this sort of neural descriptor of texture then we use a similar type of gradient ascent procedure to synthesize a new image that matches the texture of the original image. So, now this looks kind of like the feature reconstruction that we saw a few slides ago. But instead, I'm trying to reconstruct the whole feature map from the input image. Instead, we're just going to try and reconstruct this gram matrix texture descriptor of the input image instead. So, in practice what this looks like is that well... You'll download some pretrained model, like in feature inversion. Often, people will use the VGG networks for this. You'll feed your... You'll take your texture image, feed it through the VGG network, compute the gram matrix and many different layers of this network. Then you'll initialize your new image from some random initialization and then it looks like gradient ascent again. Just like for these other methods that we've seen. So, you take that image, pass it through the same VGG network, Compute the gram matrix at various layers and now compute loss as the L2 norm between the gram matrices of your input texture and your generated image. And then you back prop, and compute pixel... A gradient of the pixels on your generated image. And then make a gradient ascent step to update the pixels of the image a little bit. And now, repeat this process many times, go forward, compute your gram matrices, compute your losses, back prop.. Gradient on the image and repeat. And once you do this, eventually you'll end up generating a texture that matches your input texture quite nicely. So, this was all from Nip's 2015 paper by a group in Germany. And they had some really cool results for texture synthesis. So, here on the top, we're showing four different input textures. And now, on the bottom, we're showing doing this texture synthesis approach by gram matrix matching. Using, by computing the gram matrix at different layers at this pretrained convolutional network. So, you can see that, if we use these very low layers in the convolutional network then we kind of match the general... We generally get splotches of the right colors but the overall spacial structure doesn't get preserved so much. And now, as we move to large down further in the image and you compute these gram matrices at higher layers you see that they tend to reconstruct larger patterns from the input image. For example, these whole rocks or these whole cranberries. And now, this works pretty well that now we can synthesize these new images that kind of match the general spacial statistics of your inputs. But they are quite different pixel wise from the actual input itself. Question? So, the question is, where do we compute the loss? And in practice, we want to get good results typically people will compute gram matrices at many different layers and then the final loss will be a sum of all those potentially a weighted sum. But I think for this visualization, to try to pin point the effect of the different layers I think these were doing reconstruction from just one layer. So, now something really... Then, then they had a really brilliant idea kind of after this paper which is, what if we do this texture synthesis approach but instead of using an image like rocks or cranberries what if we set it equal to a piece of artwork. So then, for example, if you... If you do the same texture synthesis algorithm by maximizing gram matrices, but instead of... But now we take, for example, Vincent Van Gogh's Starry night or the Muse by Picasso as our texture... As our input texture, and then run this same texture synthesis algorithm. Then we can see our generated images tend to reconstruct interesting pieces from those pieces of artwork. And now, something really interesting happens when you combine this idea of texture synthesis by gram matrix matching with feature inversion by feature matching. And then this brings us to this really cool algorithm called style transfer. So, in style transfer, we're going to take two images as input. One, we're going to take a content image that will guide like what type of thing we want. What we generally want our output to look like. Also, a style image that will tell us what is the general texture or style that we want our generated image to have and then we will jointly do feature recon... We will generate a new image by minimizing the feature reconstruction loss of the content image and the gram matrix loss of the style image. And when we do these two things we a get a really cool image that kind of renders the content image kind of in the artistic style of the style image. And now this is really cool. And you can get these really beautiful figures. So again, what this kind of looks like is that you'll take your style image and your content image pass them into your network to compute your gram matrices and your features. Now, you'll initialize your output image with some random noise. Go forward, compute your losses go backward, compute your gradients on the image and repeat this process over and over doing gradient ascent on the pixels of your generated image. And after a few hundred iterations, generally you'll get a beautiful image. So, I have implementation of this online on my Gethub, that a lot of people are using. And it's really cool. So, you can, this is kind of... Gives you a lot more control over the generated image as compared to DeepDream. Right, so in DeepDream, you don't have a lot of control about exactly what types of things are going to happen coming out at the end. You just kind of pick different layers of the networks maybe set different numbers of iterations and then dog slugs pop up everywhere. But with style transfer, you get a lot more fine grain control over what you want the result to look like. Right, by now, picking different style images with the same content image you can generate whole different types of results which is really cool. Also, you can play around with the hyper parameters here. Right, because we're doing a joint reconstruct... We're minimizing this feature reconstruction loss of the content image. And this gram matrix reconstruction loss of the style image. If you trade off the constant, the waiting between those two terms and the loss. Then you can get control about how much we want to match the content versus how much we want to match the style. There's a lot of other hyper parameters you can play with. For example, if you resize the style image before you compute the gram matrix that can give you some control over what the scale of features are that you want to reconstruct from the style image. So, you can see that here, we've done this same reconstruction the only difference is how big was the style image before we computed the gram matrix. And this gives you another axis over which you can control these things. You can also actually do style transfer with multiple style images if you just match sort of multiple gram matrices at the same time. And that's kind of a cool result. We also saw this multi-scale process... So, another cool thing you can do. We talked about this multi-scale processing for DeepDream and saw how multi scale processing in DeepDream can give you some really cool resolution results. And you can do a similar type of multi-scale processing in style transfer as well. So, then we can compute images like this. That a super high resolution, this is I think a 4k image of our favorite school, like rendered in the style of Starry night. But this is actually super expensive to compute. I think this one took four GPU's. So, a little expensive. We can also other style, other style images. And get some really cool results from the same content image. Again, at high resolution. Another fun thing you can do is you know, you can actually do joint style transfer and DeepDream at the same time. So, now we'll have three losses, the content loss the style loss and this... And this DeepDream loss that tries to maximize the norm. And get something like this. So, now it's Van Gogh with the dog slug's coming out everywhere. [laughing] So, that's really cool. But there's kind of a problem with this style transfer for algorithms which is that they are pretty slow. Right, you need to produce... You need to compute a lot of forward and backward passes through your pretrained network in order to complete these images. And especially for these high resolution results that we saw in the previous slide. Each forward and backward pass of a 4k image is going to take a lot of compute and a lot of memory. And if you need to do several hundred of those iterations generating these images could take many, like tons of minutes even on a powerful GPU. So, it's really not so practical to apply these things in practice. The solution is to now, train another neural network to do the style transfer for us. So, I had a paper about this last year and the idea is that we're going to fix some style that we care about at the beginning. In this case, Starry night. And now rather than running a separate optimization procedure for each image that we want to synthesize instead we're going to train a single feed forward network that can input the content image and then directly output the stylized result. And now the way that we train this network is that we compute the same content and style losses during training of our feed forward network and use that same gradient to update the weights of the feed forward network. And now this thing takes maybe a few hours to train but once it's trained, then in order to produce stylized images you just need to do a single forward pass through the trained network. So, I have a code for this online and you can see that it ends up looking about... Relatively comparable quality in some cases to this very slow optimization base method but now it runs in real time it's about a thousand times faster. So, here you can see, this is like a demo of it running live off my webcam. So, this is not running live right now obviously, but if you have a big GPU you can easily run four different styles in real time all simultaneously because it's so efficient. There was... There was another group from Russia that had a very similar out... That had a very similar paper concurrently and their results are about as good. They also had this kind of tweek on the algorithm. So, this feed forward network that we're training ends up looking a lot like these... These segmentation models that we saw. So, these segmentation networks, for semantic segmentation we're doing down sampling and then many, and then many layers then some up sampling [mumbling] With transposed convulsion in order to down sample an up sample to be more efficient. The only difference is that this final layer produces a three channel output for the RGB of that final image. And inside this network, we have batch normalization in the various layers. But in this paper, they introduce... They swap out the batch normalization for something else called instance normalization tends to give you much better results. So, one drawback of these types of methods is that we're now training one new style transfer network... For every... For style that we want to apply. So that could be expensive if now you need to keep a lot of different trained networks around. So, there was a paper from Google that just came... Pretty recently that addressed this by using one feed forward trained network to apply many different styles to the input image. So now, they can train one network to apply many different styles at test time using one trained network. So, here's it's going to take the content images input as well as the identity of the style you want to apply and then this is using one network to apply many different types of styles. And again, runs in real time. That same algorithm can also do this kind of style blending in real time with one trained network. So now, once you trained this network on these four different styles you can actually specify a blend of these styles to be applied at test time which is really cool. So, these kinds of real time style transfer methods are on various apps and you can see these out in practice a lot now these days. So, kind of the summary of what we've seen today is that we've talked about many different methods for understanding CNN representations. We've talked about some of these activation based methods like nearest neighbor, dimensionality reduction, maximal patches, occlusion images to try to understand based on the activation values of what the features are looking for. We also talked about a bunch of gradient based methods where you can use gradients to synthesize new images to understand your features such as saliency maps class visualizations, fooling images, feature inversion. And we also had fun by seeing how a lot of these similar ideas can be applied to things like Style Transfer and DeepDream to generate really cool images. So, next time, we'll talk about unsupervised learning Autoencoders, Variational Autoencoders and generative adversarial networks so that should be a fun lecture.
Info
Channel: Stanford University School of Engineering
Views: 185,712
Rating: 4.9130435 out of 5
Keywords:
Id: 6wcs6szJWMY
Channel Id: undefined
Length: 75min 47sec (4547 seconds)
Published: Fri Aug 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.