Lecture 2 | Image Classification

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Okay, so welcome to lecture two of CS231N. On Tuesday we, just recall, we, sort of, gave you the big picture view of what is computer vision, what is the history, and a little bit of the overview of the class. And today, we're really going to dive in, for the first time, into the details. And we'll start to see, in much more depth, exactly how some of these learning algorithms actually work in practice. So, the first lecture of the class is probably, sort of, the largest big picture vision. And the majority of the lectures in this class will be much more detail orientated, much more focused on the specific mechanics, of these different algorithms. So, today we'll see our first learning algorithm and that'll be really exciting, I think. But, before we get to that, I wanted to talk about a couple of administrative issues. One, is Piazza. So, I saw it when I checked yesterday, it seemed like we had maybe 500 students signed up on Piazza. Which means that there are several hundred of you who are not yet there. So, we really want Piazza to be the main source of communication between the students and the core staff. So, we've gotten a lot of questions to the staff list about project ideas or questions about midterm attendance or poster session attendance. And, any, sort of, questions like that should really go to Piazza. You'll probably get answers to your questions faster on Piazza, because all the TAs are knowing to check that. And it's, sort of, easy for emails to get lost in the shuffle if you just send to the course list. It's also come to my attention that some SCPD students are having a bit of a hard time signing up for Piazza. SCPD students are supposed to receive a @stanford.edu email address. So, once you get that email address, then you can use the Stanford email to sign into Piazza. Probably that doesn't affect those of you who are sitting in the room right now, but, for those students listening on SCPD. The next administrative issue is about assignment one. Assignment one will be up later today, probably sometime this afternoon, but I promise, before I go to sleep tonight, it'll be up. But, if you're getting a little bit antsy and really want to start working on it right now, then you can look at last year's version of assignment one. It'll be pretty much the same content. We're just reshuffling it a little bit to make it, like, for example, upgrading to work with Python 3, rather than Python 2.7. And some of these minor cosmetic changes, but the content of the assignment will still be the same as last year. So, in this assignment you'll be implementing your own k-nearest neighbor classifier, which we're going to talk about in this lecture. You'll also implement several different linear classifiers, including the SVM and Softmax, as well as a simple two-layer neural network. And we'll cover all this content over the next couple of lectures. So, all of our assignments are using Python and NumPy. If you aren't familiar with Python or NumPy, then we have written a tutorial that you can find on the course website to try and get you up to speed. But, this is, actually, pretty important. NumPy lets you write these very efficient vectorized operations that let you do quite a lot of computation in just a couple lines of code. So this is super important for pretty much all aspects of numerical computing and machine learning and everything like that, is efficiently implementing these vectorized operations. And you'll get a lot of practice with this on the first assignment. So, for those of you who don't have a lot of experience with Matlab or NumPy or other types of vectorized tensor computation, I recommend that you start looking at this assignment pretty early and also, read carefully through the tutorial. The other thing I wanted to talk about is that we're happy to announce that we're officially supported through Google Cloud for this class. So, Google Cloud is somewhat similar to Amazon AWS. You can go and start virtual machines up in the cloud. These virtual machines can have GPUs. We're working on the tutorial for exactly how to use Google Cloud and get it to work for the assignments. But our intention is that you'll be able to just download some image, and it'll be very seamless for you to work on the assignments on one of these instances on the cloud. And because Google has, very generously, supported this course, we'll be able to distribute to each of you coupons that let you use Google Cloud credits for free for the class. So you can feel free to use these for the assignments and also for the course projects when you want to start using GPUs and larger machines and whatnot. So, we'll post more details about that, probably, on Piazza later today. But, I just wanted to mention, because I know there had been a couple of questions about, can I use my laptop? Do I have to run on corn? Do I have to, whatever? And the answer is that, you'll be able to run on Google Cloud and we'll provide you some coupons for that. Yeah, so, those are, kind of, the major administrative issues I wanted to talk about today. And then, let's dive into the content. So, the last lecture we talked a little bit about this task of image classification, which is really a core task in computer vision. And this is something that we'll really focus on throughout the course of the class. Is, exactly, how do we work on this image classification task? So, a little bit more concretely, when you're doing image classification, your system receives some input image, which is this cute cat in this example, and the system is aware of some predetermined set of categories or labels. So, these might be, like, a dog or a cat or a truck or a plane, and there's some fixed set of category labels, and the job of the computer is to look at the picture and assign it one of these fixed category labels. This seems like a really easy problem, because so much of your own visual system in your brain is hardwired to doing these, sort of, visual recognition tasks. But this is actually a really, really hard problem for a machine. So, if you dig in and think about, actually, what does a computer see when it looks at this image, it definitely doesn't get this holistic idea of a cat that you see when you look at it. And the computer really is representing the image as this gigantic grid of numbers. So, the image might be something like 800 by 600 pixels. And each pixel is represented by three numbers, giving the red, green, and blue values for that pixel. So, to the computer, this is just a gigantic grid of numbers. And it's very difficult to distill the cat-ness out of this, like, giant array of thousands, or whatever, very many different numbers. So, we refer to this problem as the semantic gap. This idea of a cat, or this label of a cat, is a semantic label that we're assigning to this image, and there's this huge gap between the semantic idea of a cat and these pixel values that the computer is actually seeing. And this is a really hard problem because you can change the picture in very small, subtle ways that will cause this pixel grid to change entirely. So, for example, if we took this same cat, and if the cat happened to sit still and not even twitch, not move a muscle, which is never going to happen, but we moved the camera to the other side, then every single grid, every single pixel, in this giant grid of numbers would be completely different. But, somehow, it's still representing the same cat. And our algorithms need to be robust to this. But, not only viewpoint is one problem, another is illumination. There can be different lighting conditions going on in the scene. Whether the cat is appearing in this very dark, moody scene, or like is this very bright, sunlit scene, it's still a cat, and our algorithms need to be robust to that. Objects can also deform. I think cats are, maybe, among the more deformable of animals that you might see out there. And cats can really assume a lot of different, varied poses and positions. And our algorithms should be robust to these different kinds of transforms. There can also be problems of occlusion, where you might only see part of a cat, like, just the face, or in this extreme example, just a tail peeking out from under the couch cushion. But, in these cases, it's pretty easy for you, as a person, to realize that this is probably a cat, and you still recognize these images as cats. And this is something that our algorithms also must be robust to, which is quite difficult, I think. There can also be problems of background clutter, where maybe the foreground object of the cat, could actually look quite similar in appearance to the background. And this is another thing that we need to handle. There's also this problem of intraclass variation, that this one notion of cat-ness, actually spans a lot of different visual appearances. And cats can come in different shapes and sizes and colors and ages. And our algorithm, again, needs to work and handle all these different variations. So, this is actually a really, really challenging problem. And it's sort of easy to forget how easy this is because so much of your brain is specifically tuned for dealing with these things. But now if we want our computer programs to deal with all of these problems, all simultaneously, and not just for cats, by the way, but for just about any object category you can imagine, this is a fantastically challenging problem. And it's, actually, somewhat miraculous that this works at all, in my opinion. But, actually, not only does it work, but these things work very close to human accuracy in some limited situations. And take only hundreds of milliseconds to do so. So, this is some pretty amazing, incredible technology, in my opinion, and over the course of the rest of the class we will really see what kinds of advancements have made this possible. So now, if you, kind of, think about what is the API for writing an image classifier, you might sit down and try to write a method in Python like this. Where you want to take in an image and then do some crazy magic and then, eventually, spit out this class label to say cat or dog or whatnot. And there's really no obvious way to do this, right? If you're taking an algorithms class and your task is to sort numbers or compute a convex hull or, even, do something like RSA encryption, you, sort of, can write down an algorithm and enumerate all the steps that need to happen in order for this things to work. But, when we're trying to recognize objects, or recognize cats or images, there's no really clear, explicit algorithm that makes intuitive sense, for how you might go about recognizing these objects. So, this is, again, quite challenging, if you think about, if it was your first day programming and you had to sit down and write this function, I think most people would be in trouble. That being said, people have definitely made explicit attempts to try to write, sort of, high-end coded rules for recognizing different animals. So, we touched on this a little bit in the last lecture, but maybe one idea for cats is that, we know that cats have ears and eyes and mouths and noses. And we know that edges, from Hubel and Wiesel, we know that edges are pretty important when it comes to visual recognition. So one thing we might try to do is compute the edges of this image and then go in and try to categorize all the different corners and boundaries, and say that, if we have maybe three lines meeting this way, then it might be a corner, and an ear has one corner here and one corner there and one corner there, and then, kind of, write down this explicit set of rules for recognizing cats. But this turns out not to work very well. One, it's super brittle. And, two, say, if you want to start over for another object category, and maybe not worry about cats, but talk about trucks or dogs or fishes or something else, then you need to start all over again. So, this is really not a very scalable approach. We want to come up with some algorithm, or some method, for these recognition tasks which scales much more naturally to all the variety of objects in the world. So, the insight that, sort of, makes this all work is this idea of the data-driven approach. Rather than sitting down and writing these hand-specified rules to try to craft exactly what is a cat or a fish or what have you, instead, we'll go out onto the internet and collect a large dataset of many, many cats and many, many airplanes and many, many deer and different things like this. And we can actually use tools like Google Image Search, or something like that, to go out and collect a very large number of examples of these different categories. By the way, this actually takes quite a lot of effort to go out and actually collect these datasets but, luckily, there's a lot of really good, high quality datasets out there already for you to use. Then once we get this dataset, we train this machine learning classifier that is going to ingest all of the data, summarize it in some way, and then spit out a model that summarizes the knowledge of how to recognize these different object categories. Then finally, we'll use this training model and apply it on new images that will then be able to recognize cats and dogs and whatnot. So here our API has changed a little bit. Rather than a single function that just inputs an image and recognizes a cat, we have these two functions. One, called, train, that's going to input images and labels and then output a model, and then, separately, another function called, predict, which will input the model and than make predictions for images. And this is, kind of, the key insight that allowed all these things to start working really well over the last 10, 20 years or so. So, this class is primarily about neural networks and convolutional neural networks and deep learning and all that, but this idea of a data-driven approach is much more general than just deep learning. And I think it's useful to, sort of, step through this process for a very simple classifier first, before we get to these big, complex ones. So, probably, the simplest classifier you can imagine is something we call nearest neighbor. The algorithm is pretty dumb, honestly. So, during the training step we won't do anything, we'll just memorize all of the training data. So this is very simple. And now, during the prediction step, we're going to take some new image and go and try to find the most similar image in the training data to that new image, and now predict the label of that most similar image. A very simple algorithm. But it, sort of, has a lot of these nice properties with respect to data-drivenness and whatnot. So, to be a little bit more concrete, you might imagine working on this dataset called CIFAR-10, which is very commonly used in machine learning, as kind of a small test case. And you'll be working with this dataset on your homework. So, the CIFAR-10 dataset gives you 10 different classes, airplanes and automobiles and birds and cats and different things like that. And for each of those 10 categories it provides 50,000 training images, roughly evenly distributed across these 10 categories. And then 10,000 additional testing images that you're supposed to test your algorithm on. So here's an example of applying this simple nearest neighbor classifier to some of these test images on CIFAR-10. So, on this grid on the right, for the left most column, gives a test image in the CIFAR-10 dataset. And now on the right, we've sorted the training images and show the most similar training images to each of these test examples. And you can see that they look kind of visually similar to the training images, although they are not always correct, right? So, maybe on the second row, we see that the testing, this is kind of hard to see, because these images are 32 by 32 pixels, you need to really dive in there and try to make your best guess. But, this image is a dog and it's nearest neighbor is also a dog, but this next one, I think is actually a deer or a horse or something else. But, you can see that it looks quite visually similar, because there's kind of a white blob in the middle and whatnot. So, if we're applying the nearest neighbor algorithm to this image, we'll find the closest example in the training set. And now, the closest example, we know it's label, because it comes from the training set. And now, we'll simply say that this testing image is also a dog. You can see from these examples that is probably not going to work very well, but it's still kind of a nice example to work through. But then, one detail that we need to know is, given a pair of images, how can we actually compare them? Because, if we're going to take our test image and compare it to all the training images, we actually have many different choices for exactly what that comparison function should look like. So, in the example in the previous slide, we've used what's called the L1 distance, also sometimes called the Manhattan distance. So, this is a really sort of simple, easy idea for comparing images. And that's that we're going to just compare individual pixels in these images. So, supposing that our test image is maybe just a tiny four by four image of pixel values, then we're take this upper-left hand pixel of the test image, subtract off the value in the training image, take the absolute value, and get the difference in that pixel between the two images. And then, sum all these up across all the pixels in the image. So, this is kind of a stupid way to compare images, but it does some reasonable things sometimes. But, this gives us a very concrete way to measure the difference between two images. And in this case, we have this difference of 456 between these two images. So, here's some full Python code for implementing this nearest neighbor classifier and you can see it's pretty short and pretty concise because we've made use of many of these vectorized operations offered by NumPy. So, here we can see that this training function, that we talked about earlier, is, again, very simple, in the case of nearest neighbor, you just memorize the training data, there's not really much to do here. And now, at test time, we're going to take in our image and then go in and compare using this L1 distance function, our test image to each of these training examples and find the most similar example in the training set. And you can see that, we're actually able to do this in just one or two lines of Python code by utilizing these vectorized operations in NumPy. So, this is something that you'll get practice with on the first assignment. So now, a couple questions about this simple classifier. First, if we have N examples in our training set, then how fast can we expect training and testing to be? Well, training is probably constant because we don't really need to do anything, we just need to memorize the data. And if you're just copying a pointer, that's going to be constant time no matter how big your dataset is. But now, at test time we need to do this comparison stop and compare our test image to each of the N training examples in the dataset. And this is actually quite slow. So, this is actually somewhat backwards, if you think about it. Because, in practice, we want our classifiers to be slow at training time and then fast at testing time. Because, you might imagine, that a classifier might go and be trained in a data center somewhere and you can afford to spend a lot of computation at training time to make the classifier really good. But then, when you go and deploy the classifier at test time, you want it to run on your mobile phone or in a browser or some other low power device, and you really want the testing time performance of your classifier to be quite fast. So, from this perspective, this nearest neighbor algorithm, is, actually, a little bit backwards. And we'll see that once we move to convolutional neural networks, and other types of parametric models, they'll be the reverse of this. Where you'll spend a lot of compute at training time, but then they'll be quite fast at testing time. So then, the question is, what exactly does this nearest neighbor algorithm look like when you apply it in practice? So, here we've drawn, what we call the decision regions of a nearest neighbor classifier. So, here our training set consists of these points in the two dimensional plane, where the color of the point represents the category, or the class label, of that point. So, here we see we have five classes and some blue ones up in the corner here, some purple ones in the upper-right hand corner. And now for each pixel in this entire plane, we've gone and computed what is the nearest example in these training data, and then colored the point of the background corresponding to what is the class label. So, you can see that this nearest neighbor classifier is just sort of carving up the space and coloring the space according to the nearby points. But this classifier is maybe not so great. And by looking at this picture we can start to see some of the problems that might come out with a nearest neighbor classifier. For one, this central region actually contains mostly green points, but one little yellow point in the middle. But because we're just looking at the nearest neighbor, this causes a little yellow island to appear in this middle of this green cluster. And that's, maybe, not so great. Maybe those points actually should have been green. And then, similarly we also see these, sort of, fingers, like the green region pushing into the blue region, again, due to the presence of one point, which may have been noisy or spurious. So, this kind of motivates a slight generalization of this algorithm called k-nearest neighbors. So rather than just looking for the single nearest neighbor, instead we'll do something a little bit fancier and find K of our nearest neighbors, according to our distance metric, and then take a vote among each of our neighbors. And then predict the majority vote among our neighbors. You can imagine slightly more complex ways of doing this. Maybe you'd vote weighted on the distance, or something like that, but the simplest thing that tends to work pretty well is just taking a majority vote. So here we've shown the exact same set of points using this K=1 nearest neighbor classifier, as well as K=3 and K=5 in the middle and on the right. And once we move to K=3, you can see that that spurious yellow point in the middle of the green cluster is no longer causing the points near that region to be classified as yellow. Now this entire green portion in the middle is all being classified as green. You can also see that these fingers of the red and blue regions are starting to get smoothed out due to this majority voting. And then, once we move to the K=5 case, then these decision boundaries between the blue and red regions have become quite smooth and quite nice. So, generally when you're using nearest neighbors classifiers, you almost always want to use some value of K, which is larger than one because this tends to smooth out your decision boundaries and lead to better results. Question? [student asking a question] Yes, so the question is, what is the deal with these white regions? The white regions are where there was no majority among the k-nearest neighbors. You could imagine maybe doing something slightly fancier and maybe taking a guess or randomly selecting among the majority winners, but for this simple example we're just coloring it white to indicate there was no nearest neighbor in those points. Whenever we're thinking about computer vision I think it's really useful to kind of flip back and forth between several different viewpoints. One, is this idea of high dimensional points in the plane, and then the other is actually looking at concrete images. Because the pixels of the image actually allow us to think of these images as high dimensional vectors. And it's sort of useful to ping pong back and forth between these two different viewpoints. So then, sort of taking this k-nearest neighbor and going back to the images you can see that it's actually not very good. Here I've colored in red and green which images would actually be classified correctly or incorrectly according to their nearest neighbor. And you can see that it's really not very good. But maybe if we used a larger value of K then this would involve actually voting among maybe the top three or the top five or maybe even the whole row. And you could imagine that that would end up being a lot more robust to some of this noise that we see when retrieving neighbors in this way. So another choice we have when we're working with the k-nearest neighbor algorithm is determining exactly how we should be comparing our different points. For the examples so far we've just shown we've talked about this L1 distance which takes the sum of the absolute values between the pixels. But another common choice is the L2 or Euclidean distance where you take the square root of the sum of the squares and take this as your distance. Choosing different distance metrics actually is a pretty interesting topic because different distance metrics make different assumptions about the underlying geometry or topology that you'd expect in the space. So, this L1 distance, underneath this, this is actually a circle according to the L1 distance and it forms this square shape thing around the origin. Where each of the points on this, on the square, is equidistant from the origin according to L1, whereas with the L2 or Euclidean distance then this circle is a familiar circle, it looks like what you'd expect. So one interesting thing to point out between these two metrics in particular, is that the L1 distance depends on your choice of coordinates system. So if you were to rotate the coordinate frame that would actually change the L1 distance between the points. Whereas changing the coordinate frame in the L2 distance doesn't matter, it's the same thing no matter what your coordinate frame is. Maybe if your input features, if the individual entries in your vector have some important meaning for your task, then maybe somehow L1 might be a more natural fit. But if it's just a generic vector in some space and you don't know which of the different elements, you don't know what they actually mean, then maybe L2 is slightly more natural. And another point here is that by using different distance metrics we can actually generalize the k-nearest neighbor classifier to many, many different types of data, not just vectors, not just images. So, for example, imagine you wanted to classify pieces of text, then the only thing you need to do to use k-nearest neighbors is to specify some distance function that can measure distances between maybe two paragraphs or two sentences or something like that. So, simply by specifying different distance metrics we can actually apply this algorithm very generally to basically any type of data. Even though it's a kind of simple algorithm, in general, it's a very good thing to try first when you're looking at a new problem. So then, it's also kind of interesting to think about what is actually happening geometrically if we choose different distance metrics. So here we see the same set of points on the left using the L1, or Manhattan distance, and then, on the right, using the familiar L2, or Euclidean distance. And you can see that the shapes of these decision boundaries actually change quite a bit between the two metrics. So when you're looking at L1 these decision boundaries tend to follow the coordinate axes. And this is again because the L1 depends on our choice of coordinate system. Where the L2 sort of doesn't really care about the coordinate axis, it just puts the boundaries where they should fall naturally. My confession is that each of these examples that I've shown you is actually from this interactive web demo that I built, where you can go and play with this k-nearest neighbor classifier on your own. And this is really hard to work on a projector screen. So maybe we'll do that on your own time. So, let's just go back to here. Man, this is kind of embarrassing. Okay, that was way more trouble than it was worth. So, let's skip this, but I encourage you to go play with this in your browser. It's actually pretty fun and kind of nice to build intuition about how the decision boundary changes as you change the K and change your distance metric and all those sorts of things. Okay, so then the question is once you're actually trying to use this algorithm in practice, there's several choices you need to make. We talked about choosing different values of K. We talked about choosing different distance metrics. And the question becomes how do you actually make these choices for your problem and for your data? So, these choices, of things like K and the distance metric, we call hyperparameters, because they are not necessarily learned from the training data, instead these are choices about your algorithm that you make ahead of time and there's no way to learn them directly from the data. So, the question is how do you set these things in practice? And they turn out to be very problem-dependent. And the simple thing that most people do is simply try different values of hyperparameters for your data and for your problem, and figure out which one works best. There's a question? [student asking a question] So, the question is, where L1 distance might be preferable to using L2 distance? I think it's mainly problem-dependent, it's sort of difficult to say in which cases you think one might be better than the other. but I think that because L1 has this sort of coordinate dependency, it actually depends on the coordinate system of your data, if you know that you have a vector, and maybe the individual elements of the vector have meaning. Like maybe you're classifying employees for some reason and then the different elements of that vector correspond to different features or aspects of an employee. Like their salary or the number of years they've been working at the company or something like that. So I think when your individual elements actually have some meaning, is where I think maybe using L1 might make a little bit more sense. But in general, again, this is a hyperparameter and it really depends on your problem and your data so the best answer is just to try them both and see what works better. Even this idea of trying out different values of hyperparameters and seeing what works best, there are many different choices here. What exactly does it mean to try hyperparameters and see what works best? Well, the first idea you might think of is simply choosing the hyperparameters that give you the best accuracy or best performance on your training data. This is actually a really terrible idea. You should never do this. In the concrete case of the nearest neighbor classifier, for example, if we set K=1, we will always classify the training data perfectly. So if we use this strategy we'll always pick K=1, but, as we saw from the examples earlier, in practice it seems that setting K equals to larger values might cause us to misclassify some of the training data, but, in fact, lead to better performance on points that were not in the training data. And ultimately in machine learning we don't care about fitting the training data, we really care about how our classifier, or how our method, will perform on unseen data after training. So, this is a terrible idea, don't do this. So, another idea that you might think of, is maybe we'll take our full dataset and we'll split it into some training data and some test data. And now I'll try training my algorithm with different choices of hyperparameters on the training data and then I'll go and apply that trained classifier on the test data and now I will pick the set of hyperparameters that cause me to perform best on the test data. This seems like maybe a more reasonable strategy, but, in fact, this is also a terrible idea and you should never do this. Because, again, the point of machine learning systems is that we want to know how our algorithm will perform. So, the point of the test set is to give us some estimate of how our method will do on unseen data that's coming out from the wild. And if we use this strategy of training many different algorithms with different hyperparameters, and then, selecting the one which does the best on the test data, then, it's possible, that we may have just picked the right set of hyperparameters that caused our algorithm to work quite well on this testing set, but now our performance on this test set will no longer be representative of our performance of new, unseen data. So, again, you should not do this, this is a bad idea, you'll get in trouble if you do this. What is much more common, is to actually split your data into three different sets. You'll partition most of your data into a training set and then you'll create a validation set and a test set. And now what we typically do is go and train our algorithm with many different choices of hyperparameters on the training set, evaluate on the validation set, and now pick the set of hyperparameters which performs best on the validation set. And now, after you've done all your development, you've done all your debugging, after you've dome everything, then you'd take that best performing classifier on the validation set and run it once on the test set. And now that's the number that goes into your paper, that's the number that goes into your report, that's the number that actually is telling you how your algorithm is doing on unseen data. And this is actually really, really important that you keep a very strict separation between the validation data and the test data. So, for example, when we're working on research papers, we typically only touch the test set at the very last minute. So, when I'm writing papers, I tend to only touch the test set for my problem in maybe the week before the deadline or so to really insure that we're not being dishonest here and we're not reporting a number which is unfair. So, this is actually super important and you want to make sure to keep your test data quite under control. So another strategy for setting hyperparameters is called cross validation. And this is used a little bit more commonly for small data sets, not used so much in deep learning. So here the idea is we're going to take our test data, or we're going to take our dataset, as usual, hold out some test set to use at the very end, and now, for the rest of the data, rather than splitting it into a single training and validation partition, instead, we can split our training data into many different folds. And now, in this way, we've cycled through choosing which fold is going to be the validation set. So now, in this example, we're using five fold cross validation, so you would train your algorithm with one set of hyperparameters on the first four folds, evaluate the performance on fold four, and now go and retrain your algorithm on folds one, two, three, and five, evaluate on fold four, and cycle through all the different folds. And, when you do it this way, you get much higher confidence about which hyperparameters are going to perform more robustly. So this is kind of the gold standard to use, but, in practice in deep learning when we're training large models and training is very computationally expensive, these doesn't get used too much in practice. Question? [student asking a question] Yeah, so the question is, a little bit more concretely, what's the difference between the training and the validation set? So, if you think about the k-nearest neighbor classifier then the training set is this set of images with labels where we memorize the labels. And now, to classify an image, we're going to take the image and compare it to each element in the training data, and then transfer the label from the nearest training point. So now our algorithm will memorize everything in the training set, and now we'll take each element of the validation set and compare it to each element in the training data and then use this to determine what is the accuracy of our classifier when it's applied on the validation set. So this is the distinction between training and validation. Where your algorithm is able to see the labels of the training set, but for the validation set, your algorithm doesn't have direct access to the labels. We only use the labels of the validation set to check how well our algorithm is doing. A question? [student asking a question] The question is, whether the test set, is it possible that the test set might not be representative of data out there in the wild? This definitely can be a problem in practice, the underlying statistical assumption here is that your data are all independently and identically distributed, so that all of your data points should be drawn from the same underlying probability distribution. Of course, in practice, this might not always be the case, and you definitely can run into cases where the test set might not be super representative of what you see in the wild. So this is kind of a problem that dataset creators and dataset curators need to think about. But when I'm creating datasets, for example, one thing I do, is I'll go and collect a whole bunch of data all at once, using the exact same methodology for collecting the data, and then afterwards you go and partition it randomly between train and test. One thing that can screw you up here is maybe if you're collecting data over time and you make the earlier data, that you collect first, be the training data, and the later data that you collect be the test data, then you actually might run into this shift that could cause problems. But as long as this partition is random among your entire set of data points, then that's how we try to alleviate this problem in practice. So then, once you've gone through this cross validation procedure, then you end up with graphs that look something like this. So here, on the X axis, we are showing the value of K for a k-nearest neighbor classifier on some problem, and now on the Y axis, we are showing what is the accuracy of our classifier on some dataset for different values of K. And you can see that, in this case, we've done five fold cross validation over the data, so, for each value of K we have five different examples of how well this algorithm is doing. And, actually, going back to the question about having some test sets that are better or worse for your algorithm, using K fold cross validation is maybe one way to help quantify that a little bit. And, in that, we can see the variance of how this algorithm performs on different of the validation folds. And that gives you some sense of, not just what is the best, but, also, what is the distribution of that performance. So, whenever you're training machine learning models you end up making plots like this, where they show you what is your accuracy, or your performance as a function of your hyperparameters, and then you want to go and pick the model, or the set of hyperparameters, at the end of the day, that performs the best on the validation set. So, here we see that maybe about K=7 probably works about best for this problem. So, k-nearest neighbor classifiers on images are actually almost never used in practice. Because, with all of these problems that we've talked about. So, one problem is that it's very slow at test time, which is the reverse of what we want, which we talked about earlier. Another problem is that these things like Euclidean distance, or L1 distance, are really not a very good way to measure distances between images. These, sort of, vectorial distance functions do not correspond very well to perceptual similarity between images. How you perceive differences between images. So, in this example, we've constructed, there's this image on the left of a girl, and then three different distorted images on the right where we've blocked out her mouth, we've actually shifted down by a couple pixels, or tinted the entire image blue. And, actually, if you compute the Euclidean distance between the original and the boxed, the original and the shuffled, and original in the tinted, they all have the same L2 distance. Which is, maybe, not so good because it sort of gives you the sense that the L2 distance is really not doing a very good job at capturing these perceptional distances between images. Another, sort of, problem with the k-nearest neighbor classifier has to do with something we call the curse of dimensionality. So, if you recall back this viewpoint we had of the k-nearest neighbor classifier, it's sort of dropping paint around each of the training data points and using that to sort of partition the space. So that means that if we expect the k-nearest neighbor classifier to work well, we kind of need our training examples to cover the space quite densely. Otherwise our nearest neighbors could actually be quite far away and might not actually be very similar to our testing points. And the problem is, that actually densely covering the space, means that we need a number of training examples, which is exponential in the dimension of the problem. So this is very bad, exponential growth is always bad, basically, you're never going to get enough images to densely cover this space of pixels in this high dimensional space. So that's maybe another thing to keep in mind when you're thinking about using k-nearest neighbor. So, kind of the summary is that we're using k-nearest neighbor to introduce this idea of image classification. We have a training set of images and labels and then we use that to predict these labels on the test set. Question? [student asking a question] Oh, sorry, the question is, what was going on with this picture? What are the green and the blue dots? So here, we have some training samples which are represented by points, and the color of the dot maybe represents the category of the point, of this training sample. So, if we're in one dimension, then you maybe only need four training samples to densely cover the space, but if we move to two dimensions, then, we now need, four times four is 16 training examples to densely cover this space. And if we move to three, four, five, many more dimensions, the number of training examples that we need to densely cover the space, grows exponentially with the dimension. So, this is kind of giving you the sense, that maybe in two dimensions we might have this kind of funny curved shape, or you might have sort of arbitrary manifolds of labels in different dimensional spaces. Because the k-nearest neighbor algorithm doesn't really make any assumptions about these underlying manifolds, the only way it can perform properly is if it has quite a dense sample of training points to work with. So, this is kind of the overview of k-nearest neighbors and you'll get a chance to actually implement this and try it out on images in the first assignment. So, if there's any last minute questions about K and N, I'm going to move on to the next topic. Question? [student is asking a question] Sorry, say that again. [student is asking a question] Yeah, so the question is, why do these images have the same L2 distance? And the answer is that, I carefully constructed them to have the same L2 distance. [laughing] But it's just giving you the sense that the L2 distance is not a very good measure of similarity between images. And these images are actually all different from each other in quite disparate ways. If you're using K and N, then the only thing you have to measure distance between images, is this single distance metric. And this kind of gives you an example where that distance metric is actually not capturing the full description of distance or difference between images. So, if this case, I just sort of carefully constructed these translations and these offsets to match exactly. Question? [student asking a question] So, the question is, maybe this is actually good, because all of these things are actually having the same distance to the image. That's maybe true for this example, but I think you could also construct examples where maybe we have two original images and then by putting the boxes in the right places or tinting them, we could cause it to be nearer to pretty much anything that you want, right? Because in this example, we can kind of like do arbitrary shifting and tinting to kind of change these distances nearly arbitrarily without changing the perceptional nature of these images. So, I think that this can actually screw you up if you have many different original images. Question? [student is asking a question] The question is, whether or not it's common in real-world cases to go back and retrain the entire dataset once you've found those best hyperparameters? So, people do sometimes do this in practice, but it's somewhat a matter of taste. If you're really rushing for that deadline and you've really got to get this model out the door then, if it takes a long time to retrain the model on the whole dataset, then maybe you won't do it. But if you have a little bit more time to spare and a little bit more compute to spare, and you want to squeeze out that maybe that extra 1% of performance, then that is a trick you can use. So we kind of saw that the k-nearest neighbor has a lot of the nice properties of machine learning algorithms, but in practice it's not so great, and really not used very much in images. So the next thing I'd like to talk about is linear classification. And linear classification is, again, quite a simple learning algorithm, but this will become super important and help us build up to whole neural networks and whole convolutional networks. So, one analogy people often talk about when working with neural networks is we think of them as being kind of like Lego blocks. That you can have different kinds of components of neural networks and you can stick these components together to build these large different towers of convolutional networks. One of the most basic building blocks that we'll see in different types of deep learning applications is this linear classifier. So, I think it's actually really important to have a good understanding of what's happening with linear classification. Because these will end up generalizing quite nicely to whole neural networks. So another example of kind of this modular nature of neural networks comes from some research in our own lab on image captioning, just as a little bit of a preview. So here the setup is that we want to input an image and then output a descriptive sentence describing the image. And the way this kind of works is that we have one convolutional neural network that's looking at the image, and a recurrent neural network that knows about language. And we can kind of just stick these two pieces together like Lego blocks and train the whole thing together and end up with a pretty cool system that can do some non-trivial things. And we'll work through the details of this model as we go forward in the class, but this just gives you the sense that, these deep neural networks are kind of like Legos and this linear classifier is kind of like the most basic building blocks of these giant networks. But that's a little bit too exciting for lecture two, so we have to go back to CIFAR-10 for the moment. [laughing] So, recall that CIFAR-10 has these 50,000 training examples, each image is 32 by 32 pixels and three color channels. In linear classification, we're going to take a bit of a different approach from k-nearest neighbor. So, the linear classifier is one of the simplest examples of what we call a parametric model. So now, our parametric model actually has two different components. It's going to take in this image, maybe, of a cat on the left, and this, that we usually write as X for our input data, and also a set of parameters, or weights, which is usually called W, also sometimes theta, depending on the literature. And now we're going to write down some function which takes in both the data, X, and the parameters, W, and this'll spit out now 10 numbers describing what are the scores corresponding to each of those 10 categories in CIFAR-10. With the interpretation that, like the larger score for cat, indicates a larger probability of that input X being cat. And now, a question? [student asking a question] Sorry, can you repeat that? [student asking a question] Oh, so the question is what is the three? The three, in this example, corresponds to the three color channels, red, green, and blue. Because we typically work on color images, that's nice information that you don't want to throw away. So, in the k-nearest neighbor setup there was no parameters, instead, we just kind of keep around the whole training data, the whole training set, and use that at test time. But now, in a parametric approach, we're going to summarize our knowledge of the training data and stick all that knowledge into these parameters, W. And now, at test time, we no longer need the actual training data, we can throw it away. We only need these parameters, W, at test time. So this allows our models to now be more efficient and actually run on maybe small devices like phones. So, kind of, the whole story in deep learning is coming up with the right structure for this function, F. You can imagine writing down different functional forms for how to combine weights and data in different complex ways, and these could correspond to different network architectures. But the simplest possible example of combining these two things is just, maybe, to multiply them. And this is a linear classifier. So here our F of X, W is just equal to the W times X. Probably the simplest equation you can imagine. So here, if you kind of unpack the dimensions of these things, we recall that our image was maybe 32 by 32 by 3 values. So then, we're going to take those values and then stretch them out into a long column vector that has 3,072 by one entries. And now we want to end up with 10 class scores. We want to end up with 10 numbers for this image giving us the scores for each of the 10 categories. Which means that now our matrix, W, needs to be ten by 3072. So that once we multiply these two things out then we'll end up with a single column vector 10 by one, giving us our 10 class scores. Also sometimes, you'll typically see this, we'll often add a bias term which will be a constant vector of 10 elements that does not interact with the training data, and instead just gives us some sort of data independent preferences for some classes over another. So you might imagine that if you're dataset was unbalanced and had many more cats than dogs, for example, then the bias elements corresponding to cat would be higher than the other ones. So if you kind of think about pictorially what this function is doing, in this figure we have an example on the left of a simple image with just a two by two image, so it has four pixels total. So the way that the linear classifier works is that we take this two by two image, we stretch it out into a column vector with four elements, and now, in this example, we are just restricting to three classes, cat, dog, and ship, because you can't fit 10 on a slide, and now our weight matrix is going to be four by three, so we have four pixels and three classes. And now, again, we have a three element bias vector that gives us data independent bias terms for each category. Now we see that the cat score is going to be the enter product between the pixels of our image and this row in the weight matrix added together with this bias term. So, when you look at it this way you can kind of understand linear classification as almost a template matching approach. Where each of the rows in this matrix correspond to some template of the image. And now the enter product or dot product between the row of the matrix and the column giving the pixels of the image, computing this dot product kind of gives us a similarity between this template for the class and the pixels of our image. And then bias just, again, gives you this data independence scaling offset to each of the classes. If we think about linear classification from this viewpoint of template matching we can actually take the rows of that weight matrix and unravel them back into images and actually visualize those templates as images. And this gives us some sense of what a linear classifier might actually be doing to try to understand our data. So, in this example, we've gone ahead and trained a linear classifier on our images. And now on the bottom we're visualizing what are those rows in that learned weight matrix corresponding to each of the 10 categories in CIFAR-10. And in this way we kind of get a sense for what's going on in these images. So, for example, in the left, on the bottom left, we see the template for the plane class, kind of consists of this like blue blob, this kind of blobby thing in the middle and maybe blue in the background, which gives you the sense that this linear classifier for plane is maybe looking for blue stuff and blobby stuff, and those features are going to cause the classifier to like planes more. Or if we look at this car example, we kind of see that there's a red blobby thing through the middle and a blue blobby thing at the top that maybe is kind of a blurry windshield. But this is a little bit weird, this doesn't really look like a car. No individual car actually looks like this. So the problem is that the linear classifier is only learning one template for each class. So if there's sort of variations in how that class might appear, it's trying to average out all those different variations, all those different appearances, and use just one single template to recognize each of those categories. We can also see this pretty explicitly in the horse classifier. So in the horse classifier we see green stuff on the bottom because horses are usually on grass. And then, if you look carefully, the horse actually seems to have maybe two heads, one head on each side. And I've never seen a horse with two heads. But the linear classifier is just doing the best that it can, because it's only allowed to learn one template per category. And as we move forward into neural networks and more complex models, we'll be able to achieve much better accuracy because they no longer have this restriction of just learning a single template per category. Another viewpoint of the linear classifier is to go back to this idea of images as points and high dimensional space. And you can imagine that each of our images is something like a point in this high dimensional space. And now the linear classifier is putting in these linear decision boundaries to try to draw linear separation between one category and the rest of the categories. So maybe up on the upper-left hand side we see these training examples of airplanes and throughout the process of training the linear classier will go and try to draw this blue line to separate out with a single line the airplane class from all the rest of the classes. And it's actually kind of fun if you watch during the training process these lines will start out randomly and then go and snap into place to try to separate the data properly. But when you think about linear classification in this way, from this high dimensional point of view, you can start to see again what are some of the problems that might come up with linear classification. And it's not too hard to construct examples of datasets where a linear classifier will totally fail. So, one example, on the left here, is that, suppose we have a dataset of two categories, and these are all maybe somewhat artificial, but maybe our dataset has two categories, blue and red. And the blue categories are the number of pixels in the image, which are greater than zero, is odd. And anything where the number of pixels greater than zero is even, we want to classify as the red category. So if you actually go and draw what these different decisions regions look like in the plane, you can see that our blue class with an odd number of pixels is going to be these two quadrants in the plane, and even will be the opposite two quadrants. So now, there's no way that we can draw a single linear line to separate the blue from the red. So this would be an example where a linear classifier would really struggle. And this is maybe not such an artificial thing after all. Instead of counting pixels, maybe we're actually trying to count whether the number of animals or people in an image is odd or even. So this kind of a parity problem of separating odds from evens is something that linear classification really struggles with traditionally. Other situations where a linear classifier really struggles are multimodal situations. So here on the right, maybe our blue category has these three different islands of where the blue category lives, and then everything else is some other category. So, for something like horses, we saw on the previous example, is something where this actually might be happening in practice. Where there's maybe one island in the pixel space of horses looking to the left, and another island of horses looking to the right. And now there's no good way to draw a single linear boundary between these two isolated islands of data. So anytime where you have multimodal data, like one class that can appear in different regions of space, is another place where linear classifiers might struggle. So there's kind of a lot of problems with linear classifiers, but it is a super simple algorithm, super nice and easy to interpret and easy to understand. So you'll actually be implementing these things on your first homework assignment. At this point, we kind of talked about what is the functional form corresponding to a linear classifier. And we've seen that this functional form of matrix vector multiply corresponds this idea of template matching and learning a single template for each category in your data. And then once we have this trained matrix you can use it to actually go and get your scores for any new training example. But what we have not told you is how do you actually go about choosing the right W for your dataset. We've just talked about what is the functional form and what is going on with this thing. So that's something we'll really focus on next time. And next lecture we'll talk about what are the strategies and algorithms for choosing the right W. And this will lead us to questions of loss functions and optimization and eventually ConvNets. So, that's a bit of the preview for next week. And that's all we have for today.
Info
Channel: Stanford University School of Engineering
Views: 646,726
Rating: 4.9270997 out of 5
Keywords:
Id: OoUX-nOEjG0
Channel Id: undefined
Length: 59min 31sec (3571 seconds)
Published: Fri Aug 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.