CS231n Winter 2016: Lecture 2: Data-driven approach, kNN, Linear Classification 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and we're recording as well okay great just to remind you again hello we're recording the classes so if you're uncomfortable speaking in front in the camera you're not in the picture but your voice might be only on the recording okay great as you can see also the screen is wider than it should be and I'm not sure how to fix it so we'll have to a little bit luckily your visual cortex is very good it's very invariant to stretching so this is not a problem okay so we'll start out with some administrative things before we dive into the class the first assignment will come out tonight or early tomorrow it is due on January 20 so you have exactly two weeks you will be writing k-nearest neighbor classifier a linear classifier and a small two layer neural network and you'll be writing the entirety of backpropagation algorithm for a - new - layer neural network we'll cover all that material in the next two weeks and warning by the way there are summons from last year as well and we're changing the assignments so they will please do not complete a 2015 assignment that's something to be aware of and for your computation by the way we'll be using a Python and numpy and we're also be offering terminal comm which is which is basically these virtual machines in the cloud that you can use if you don't have a very good laptop and so on I'll go into detail of that in a bit I just like to point out that for the first assignment we assume that you'll be relatively familiar with Python and you'll be writing these optimized numpy expressions where you're manipulating these matrices and vectors in very efficient forms so for example if you're seeing this code and it doesn't mean anything to you then please have a look at our Python numpy tutorial that is up on the website as well it's written by Justin and it's very good and so go through that and familiarize yourself with the notation because you'll be seeing you'll be writing a lot of code that looks like this where we're doing all these optimized operations so they're fast enough to run on a CPU now in terms of terminal basically what this amounts to is that we'll give you a link to the assignment you'll go to a web page and you'll see something like this this is a virtual machine in the cloud that has been set up with all the dependencies of the assignment they're all installed already all the data is already there and so you click on launch a machine and this will basically bring you to something like this this is running in your browser and this is basically a thin layer on top of an AWS machine a UI lar here and so you have an iPad 2 notebook in a little terminal and you can go around and this is just like a machine in the cloud and so they have some CPU operands and they also have some GPU machines that you can use and so on you normally have to pay for terminal but we'll be distributing credits to you so you just email us to a specific ta that will decide in a bit you email to a TA and you ask for money we'll send you money and we keep track of how much money we've sent to all the people so you have to be responsible with the funds so this is also an option for you to use if you like ok any details about this yep good you can you can read it if you like it's not required for your segment but you can probably get that to run yeah ok ok so I'm just going to dive into the lecture now today we'll be talking about image classification and especially we'll start out on linear classifiers so when we talk about image classification the basic task is that we have some number of fixed categories say a dog cat truck plane or so on we get to decide what these are and then the task really is to take an image which is a giant grid of numbers and we have to transform it to one of these labels we have to bin it into one of the categories this is the image classification problem we'll spend most of our time talking about this one specifically but if you'd like to do any other task in computer vision such as object detection image captioning segmentation or whatever else you'll find that once you know about image classification and how that's done everything else is just a tiny Delta on top of it so you'll be in a great position to do any of the other tasks so it's really good for conceptual understanding and we'll work through that as a specific example to simplify things in the beginning now why is this problem hard just to give you an idea the problem is what we refer to as a semantic gap this image here is a giant grid of numbers the way the images are represented in the computer is that this is basically say roughly 300 by a hundred by three pixel array so three dimensional array and the three is for the three color channels red green and blue and so when you zoom in on a part of that image it's basically a giant grid of numbers between 0 and 255 so that's what we have to work with these numbers indicate the amount of brightness and all the three colour channels at every single position in the image and so the reason that image classification is difficult is when you think about but we have to work with these like millions of numbers of that form and having to classify things like cats if we can become as apparent the complexity of their task so for example the camera can be rotated around this cat and it can be zoomed in and out and rotated shifted the focal properties intrinsic so that camera can be different and think about what happens to the brightness values in this grid as you actually do all these transformations with a camera you'll completely shift all the patterns are changing and we have to be robust to all of this there's also many other challenges for example challenges of illumination here we have a long cat-on-cat we actually have two of them but you almost can't see the other one and so one cat is basically luminate it quite a bit and the other is not but you can still recognize two cats and so think about again the brightness values on the level of the grid and what happens to them as you change all the different Lighting's and all the possible lighting schemes that we can have in the world we have to be a robust to all that there's issues of deformation many classes lots of strange arrangements of all of these objects we'd like to recognize so cats come in very different poses by the way the slides when I create them they're quite dry there's a lot of math and so on so this is the only time I get to have fun so that's why I just pile everything cat pictures so we have to be robust to all these deformations you can still recognize it there's a cat and all these images despite its arrangement there's problems of occlusion so sometimes we might not see the full object but you still recognize that's a cat behind a curtain that's a cat behind a water bottle and there's a also a cat there inside a couch even though you're seeing just tiny pcs pieces of this class basically there's problems of background clutter so things can blend into the environment we have to be robust to that and there's also what we referred to as intro class variation so cat actually there's a huge amount of cats just species and so they can look different ways we have to be robust to all of that so I just like you to appreciate the complexity of the task when you consider any one of these independently it's difficult but when you consider the full cross-product of all these different things and the fact that our algorithms have to work across all of that it's actually quite amazing that anything works at all in fact not only does it work but it works really really well almost at human accuracy we can recognize thousands of categories like this and we can do that in a few dozen milliseconds with the current technology and so that's what you'll learn about in this class so what is an image classifier look like basically we're taking this 3d array of pixel values we'd like to produce a class label and what I'd like you to notice is that there's no obvious way of actually encoding any of this of these classifiers right there's no simple algorithm like say you're taking an algorithms class in your early computer science curriculum you're writing bubble sort or you're writing something else to do any particular task you can Intuit all the possible steps and you can enumerate them and list them and play with it and analyze it but here there's no algorithm for detecting a cat under all these variations or it's extremely difficult to think about how you actually write that up what is the sequence of operations you would do on an arbitrary image to the tactic at that's not to say that people haven't tried especially in the early days of computer vision there were these explicit approaches as I'd like to call them where you think about ok a cat say is a we'd like to maybe look for little earpieces so what we'll do is we'll detect all the edges well trace out edges will classify the different shapes of edges and their junctions will create you know libraries of these and we'll try to find their arrangements and if we ever see anything you're like then we'll detect a cat or if we see any particular texture of some particular frequencies will detect the cat and so you can come up with some rules but the problem is that once I tell you ok I'd like to actually recognize a boat now or a person then you have to go back to the drawing board and you have to be like ok what makes a boat exactly what's the arrangement of edges right it's a completely unscalable approach to it to classification and so the approach we're adopting in this class and they brush that works much better is the data-driven approach that we like in the framework of machine learning and just to point out that in these days actually in the early days they did not have the luxury of using data because at this point in time you're taking you know grayscale images a very low resolution you have 5 images and you're trying to recognize things it's obviously not going to work but with the availability of internet huge amount of data I can search for example for cat on Google and I get lots of cats everywhere and we know that these are cats based on the surrounding text in their web pages and so that gives us lots of data so the way that this now looks like is that we have a training phase where you give me lots of training examples of cats and you tell me that they're cats and you'll give me lots of examples of any type of other category today I do I go away and I train a model a model is a class and I can then use that model to actually classify new test data so when I'm given a new image I can look at my training data and I can do something with this based on just a pattern matching and statistics and so on so as a simple first example we'll work within this framework consider the nearest neighbor classifier the way nearest neighbor classifier works is that effectively we're given this giant training set what we'll do at training time is we'll just remember all of the training data so I have all the training data I just put it here and I remember it now when you give me a pest image what we'll do is we'll compare that test image to every single one of the images we saw in the training data and we'll just transfer the label over so I'll just look through all the images we'll work with specific case as I go through this I'd like to be as complete as possible so we'll work with a specific case of something called CFR 10 data set the CFR 10 data set has ten labels these are the labels there are 50,000 training images that you have access to and then there's a test set of ten 10,000 images where we're going to evaluate how well the classifier is working and these images are quite tiny they're just little toy data set of 32 by 32 little thumbnail images so the way nearest neighbor classifier would work is we take all this training data that's given to us with a few thousand images now a test time suppose we have these 10 different examples here these are test images along the first column here what we'll do is we'll look up nearest neighbors in the training set of things that are most similar to every one of those images independently so there you see a ranked list of images that are most similar to in the training data to any one of those ten two every one of those test images over there so in the first row we see that there's a truck I think as a test image and there's quite a few images that look similar to it we'll see how exactly we define similarity in a bit but you can see that the first retrieve result is in fact a horse not a truck and that's because of just the arrangement of the blue sky that was thrilling enough so you can see that this will not probably work very well now how do we define the distance metric how do we actually do the comparison there are several ways one of the simplest ways might be a Manhattan distance so an l1 distance or a Manhattan distance I'll use the two terms interchangeably simply what it does is you have a test image you're interested in classifying and one single training image that we want to compare this image to basically what we'll do is we'll element-wise compare all the pixel values so we'll form the absolute value differences and then we just add all of it up so we're just looking at every single pixel position we're subtracting it off and seeing what the differences are at every single spatial position adding it all up and that's our similarity so these two images are 456 different okay so we'll get a zero if we have identical images here just to show you code specifically the way this would look like this is a full implementation of a nearest neighbor classifier and numpy and Python where I filled in the actual body of the two methods that I talked about and basically what we do here at training time as we're given this data set X the images and Y which usually denote the labels so we're giving images and labels all we do is just assigned to the class instance methods so we just remember the data nothing is being done at predict time though what we're doing here is we're getting new test set of images X and I'm not going to go through full details but you can see that there's a for loop over every single test image independently we're getting the distances to every single training image and notice that that's only a single line of vectorized python code so in a single line of code we're comparing that test image to every single training image in the database and we're computing this distance in a previous slide in a single line okay so that's a vectorized code we didn't have to expand out all those four loops that are involved in processing this distance and then we compute the instance that is closest so we're getting the min index so that's the index of the training example that is as the lowest distance and then we're just predicting for this image the label of whatever was nearest okay so here's a question for you in terms of the nearest neighbor classifier how does its speed depend on the training data size what happens is I scale up the training data it's slower okay yes it's actually it's in fact linearly slower right because if I have I just have to I have to compare to every single training example independently so it's linear slow down and you'll notice actually as you go as we go through the classes that this is actually backwards because what we really care about in most practical applications is we care but the test time performance of these classifiers that means that we want this classifier to be very efficient at test time and so there's this trade-off between really how much compute do we put in the train method and how much compute do we put in a test method a nearest neighbor is instant at train but then it's expensive a test and as we'll see soon come that's actually flipped this completely the other way around we'll see that we do a huge amount of compute at train time we'll be training convolutional neural network but the test time performance will be super efficient in fact it will be constant amount of compute for every single test image we'll do constant amount of computation no matter if you have a million billion or trillion training images I'd like to have a trillion trillion training images no matter how large your training dataset is we'll do a constant compute to classify any single testing example so that's very nice practically speaking now I'll just like to point out that there are ways of speeding up nearest neighbor classifiers there's these approximate nearest neighbor methods plan as an example library that people use often in practice that allow you to speed up this process of nearest neighbor matching but that's just a side note ok so let's go back to the design of the nearest neighbor classifier we saw that we've defined this a distance and I've arbitrarily chosen to show you the Manhattan distance which compares the difference of the absolute values there's in fact many ways you can formulate a distance metric and so there's many different choices of exactly how we do this quick Pearson another some another choice that people like to use in practice is what we call the Euclidean or l2 distance which instead sums up the differences in the sums of squares of these differences between images and so this choice what happened there did someone push a button over there in the back ok thank you so this choice of what how exactly we compute a distance it's a discrete choice that we have control over that's something we call the hyper parameter it's not really obvious how you set it it's a hyper parameter we have to decide later on exactly how to set this somehow another sort of hyper parameter that I'll talk about in context of nearest neighbor classifier is when we generalize nearest neighbor to what we call a K nearest neighbor classifier so in a key nearest neighbor classifiers that are retrieving for every test image the single nearest training example will in fact retrieve several nearest examples and we'll have them do a majority vote over the classes to actually classify every test instance so say a five nearest neighbor we would be retrieving the five most similar images in the training data and doing a majority vote of the labels here's a simple two-dimensional data set to illustrate the point so here we have a three class data set in 2d and here I'm drawing what we call decision regions of this nearest neighbor classifier here what this refers to is we're showing the training data over there and we're coloring the entire 2d plane by what class this nearest neighbor classifier would assign every single point suppose you suppose you had a test example somewhere here then this is just saying that this would have been classified as blue class based on the nearest neighbor you can for example note that here is a point that is a green point inside the blue cluster and it has its own little region of influence where it would have classified a lot of test points around it as green because if any point fell there then that green point would have been the nearest neighbor now when you move to higher numbers for K such as five nearest neighbor classifier what you find is that the boundary start to smooth out it's kind of this nice effect where even there's this one point kind of randomly as a noise and outlier in the blue cluster it's actually not compressing the predictions too much because we're always retreating five nearest neighbors and so they get to overwhelm the green point so in practice you'll find that usually K nearest neighbor classifiers offer better better performance at test time now but again the choice of K is again a hyper parameter right so I'll come back to this in a bit just to show you an example of what this would look like here I'm retrieving ten most similar examples they're ranked by their distance and I would actually do a majority vote over these training examples here to classify every test example here okay so let's do a bit of questions here just for fun consider what is the accuracy of the nearest neighbor classifier on the training data when we're using Euclidean distance so suppose our test set is exactly the training data and we're trying to find the accuracy in other words how many how often would we get the correct answer 100% good why okay among the murmurs yeah that's correct so we're always find a training example exactly on top of that test which has zero distance and then it's really will be transferred over good what if we're using the men hats distance instead so Manhattan distance doesn't use some squares that uses some absolute values of differences it would it's just a trick question it would be something okay good so we're keeping a paying attention okay what is the accuracy of the key nearest neighbor classifier in the training data then say K was five is it a hundred percent not necessarily right good because basically the points around you could overwhelm you even if your best example is actually of the of a different class okay good so we've discussed two choices of different here high parameters we have the distance metric it's a high parameter and this K we're not sure how to set it should be one two three ten and so on so we're not exactly sure how to set these in fact they're problem dependent you'll find that you can't find consistently best choice for these hyper parameters in some applications some case might look might work better than other applications so we're not really sure how to set this so here's an idea we have to basically try out lots of different hat primers so I'm going to do is I'm going to take my train data and then I'm just going to try out lots of different hyper a meters so my test data and I try out K equals one two three four five six twenty hundred I tried all the different distance metrics and whatever works best that's what I'll take so that will work very well right well or very well why is it not gonna rise it's not a good idea okay so basically so basically yes so test data is your proxy for your generalization of your algorithm you should not test should the test data in fact you should forget that you ever have test data so once you're given your data set always set aside the test data pretend you don't have it that's telling you how well your algorithm is generalizing to unseen data points and this is important because you're trying to develop your algorithm and then you're hoping to eventually deploy it in some setting and you'd like to have an understanding of exactly how well do I expect this to work in practice right and so you'll see that for example sometimes you can perform very very well in Trinity but not generalize very well to test data when you're overfitting and so on and a lot of this by the way 229 is a requirement for this class so you should be quite familiar with this this is to most extent a-- this is kind of more of a review for you but basically this test data is use it very sparingly forget that you have it instead what we do is we separate our training data into what we call folds so we separate say we use a five-fold validation so we use 20% of the training data as a imagine test set data and then we only train on part of it and we test on we test out the choices of high parameters on this validation set so I'm going to train on my four folds and try out all different case and all the different systems metrics and whatever else if you're using the approximate nearest neighbor you have many other choices you try it out see what works best on that validation data if you're feeling uncomfortable because you have very few training data points people also sometimes use cross-validation where you actually iterate the choice of your test or validation fold across these choices so I'll first use four one two four for my training and try out on five and then I cycle the choice of the validation fold across all the five choices and I look at what works best across all the possible choices of my test fold and then I just take whatever works best across all the possible scenarios okay that's referred to as a cross-validation set cross-validation so in practice the way this would look like say we're cross validating for K for a nearest neighbor classifier is we are trying out different values of K and this is our performance across five choices of the fold so you can see that for every single K we have five data points there and then this is the accuracy so high as good and I'm plotting a line through the mean and I'm also showing the bars for the standard deviations so we see here is that the performer goes up on the across these validation folds as you go up but at some point it starts to decay so for this particular data set it seems that k equal to seven is the best choice so that's what I'll use I'll do this for all my hyper parameters also for this metric and so on I do my cross validation I find the best high parameters I set them I fix them then I evaluate a single time on a test set and whatever number I get that's what I report as a accuracy of a king of classifier on this data set that's what goes into a paper that's what goes into your final report and so on that's the final generalization result of what you've done okay any questions about this yep I'll be careful with that terminology but basically it's about the statistics of the distribution of these data points in your label in your data space and so sometimes it's basically hard to say like you get whereas this picture you see roughly what's happening is you get a more clunkiness in more case and it just depends on how clunky your data set is that's really where it comes down to is how how blobby is it or how specific is it I know that's a very hand wavy answer but that's roughly what what that comes on to so different data sets will have different clunkiness yeah skewed data sets so what is that referring to so that's a technical question that I may be on when I get into right now but we will address that later in the class probably oh yeah good no not at all because your hyper parameters are just choices you're not sure how to set them and different different data sets will require different choices so you need to see what works best in fact when you try out different algorithms because you're not sure what's going to work best on your data the choice of your algorithm is also kind of like a high parameter so you're just not sure what works you're not different approaches will give you different generalization boundaries they will look different and some data sets have different structure than others so some things work better than others you have to just train it try it out okay cool I just like to point out that K nearest neighbors is no one basically uses this so I'm just going through this just to get you used to this approach of really how this works with training just splits and so on the reason this is never used is because first of all it's very inefficient but second of all distance metrics on images which are very high dimensional objects they act in very unnatural unintuitive ways and so here what I've done is we're taking an original image and I change it in three different ways but all these three different images here have actually the exact same distance to this one in an l2 Euclidean sense and so just think about it this one here is slightly shifted to the left it's basically cropped slightly and it's distances here are completely different because these pixels are not matching up exactly and it's all introducing all these errors and you're getting a distance this one is slightly darkened so you get a small Delta across all spatial locations and this one is untouched so you get zero distance errors across everywhere except for in those positions over there and that is taking out critical pieces of the image and it doesn't the nearest the nearest neighbor classifier would not be able to really tell the difference between these settings because it's based on these distances that don't really work very well in this case so very unintuitive things happen when you try to throw distances on very high dimensional objects that's partly why we don't use this okay so in summary so far we're looking at image classification is a specific case and we'll join two different settings later in the class I've introduced a nearest neighbor classifier and the idea of having different splits of your data and we have these hyper parameters that we need to pick and we use cross-validation for this usually most of the time people don't so use entire cross-validation they just have a single validation set and they try out on the validation set whatever works best in terms of the hyper parameters and once you get the best high parameters you evaluate a single time on a test set okay so I'm going to go into a linear classification bit any questions at this point the rest great okay so we're going to look at a linear classification this is a point where we are starting to work towards convolutional neural networks so there'll be a series of lectures we'll start with linear classification they will build up to an entire convolutional network analyzing an image now I just like to say that we've motivated the class yesterday from a task specific view so this class is computer vision class we're interested in you know giving machines sight another way to motivate this class would be from a model-based point of view in the sense that we're giving you guys' we're teaching guys about deep learning and neural networks these are wonderful algorithms that you can apply to many different data domains not just vision so in particular over the last few years we saw that neural networks can not only see that's what you'll learn a lot about in this class but they can also hear they're used quite a bit in a speech recognition now so when you talk to your phone that's now at the neural network they can also do machine translation so here you're feeding a neural network a set of words one by one in English and the neural network produces the translation in French or whatever else target language you have they can also perform control so we've seen your own Network applications and manipulate in robots manipulation and playing of Atari games so these neural networks learn how to play Atari games just by seeing the raw pixels of the screen and we've seen neural networks be very successful in a variety of domains and even more than I put here and we're uncertain exactly where this will take us and then I'd like to also say that we're exploring ways for neural networks to think but this is very hand wavy is just wishful thinking but there are some hints that maybe they can do that as well now neural networks are very nice because there are just fun modular things to play with when I think about working with neural networks I kind of this picture comes to mind for me here we have a neural networks practitioner and she's building what looks to be a roughly 10 layer convolutional neural network at this point and so these are very fun really the best way to think about playing with neural networks is like Lego blocks you'll see that we're building these little function pieces these Lego blocks that we can stack together to create entire architectures and they very easily talk to each other and so we can just create these modules and stack them together and play with this very easily one work that I think exemplifies this is my own work on image captioning from roughly a year ago so here the task was you take an image and you're trying to get the neural network to produce a sentence description of the image so for example in the top left these are test set results the neural network would say that this is a man in black shirt explaining guitar or a construction worker in orange safety West is working on the road and so on so the neural network can look at the image and create this description of every single image and when you go to the details of this model the way this works is we're taking a convolutional neural network which we know so there's two modules here in this system diagram for image captioning model we're taking a convolutional neural network which we know can see and we're taking a recurrent neural network which we know is very good at modeling sequences in this case sequences of words that we'll be describing the image and then just as if we were playing with LEGOs we take those two pieces and we stick them together that's corresponding to this arrow here in between the two modules and these networks learn to talk to each other and in the process of trying to describe the images these gradients will be flowing through the convolutional Network and the full system will be adjusting itself to better see the images in order to describe them at the end and so this whole system will work together as one so we'll be working towards this model will actually cover this in class so you'll have full understanding exactly of both this part and this part at about halfway through the course roughly and you'll see how that image captioning model works but that's just the motivation for really what we're building up to and these are like really nice models to work with okay but for now back to a CFR 10 and on your classification so again just remind you we're working with this dataset 50,000 images 10 labels and the way we're going to approach your classification is from what we call a parametric approach key nearest neighbor that we just discussed now is something an instance of what we call nonparametric approach there's no parameters that we're going to be optimizing over this distinction will become clear in a few minutes so in the parametric approach what we're doing is we're thinking about constructing a function that takes an image and produces the scores for your classes right this is what we want to do we want to take an image and we'd like to figure out which one of the ten classes it is so we'd like to write down the function an expression that takes an image and gives you those ten numbers but the expression is not only a function of that image but critically it will be also a function of these parameters that alcohol W sometimes also called the weights and so really it's a function that goes from 3072 numbers which make up this image to ten numbers that's what we're doing we're defining a function and we'll go through several choices of this function in this in the first case we'll look at linear functions and then we'll extend that to get neural networks and then we'll extend that to get convolutional neural networks but intuitively what we're building up to is that what we'd like is when we type this image through our function we'd like the ten numbers that correspond to the scores of the ten classes we'd like the number that corresponds in the cat class to be high and all the other numbers to be low and we'll have we don't have a choice over X that X is our image that's given but we'll have choice over W so we'll be free to set this in whatever way we want and we what we all want to set it so that this function gives us the correct answers for every single image in our training data okay that's roughly the approach will building up towards so suppose that we use the simplest function arguably the simplest just a linear classification here so X is our image in this case what I'm doing is I'm taking this array this image that makes up the cat and I'm stretching out all the pixels in that image into a giant column vector so that X there is a column vector of 3072 numbers okay and so if you know your matrix vector operations which you should that's a prerequisite for this class that there is just a matrix multiplication which should be familiar with and basically we're taking X which is a 3072 dimensional column vector we're trying to get ten numbers out and it's a linear function so you can go backwards and figure out that the dimensions of this W are basically ten by three thousand 72 so there are 30,000 some 7,200 numbers that goes into W and that's what we have control over that's what we have to tweak and find what works best on our data so those are the parameters in this particular case what I'm leaving out is there's also an appended plus B sometimes so you have a bias these biases are again 10 more parameters and we have to also find those so usually in your linear classifier you have a W and a B we have to find exactly what works best and this B is not a function of the image that's just independent weights on the on how likely any one of those images might be to go back to your question if you have a very unbalanced data set for so maybe you have mostly cats but some dogs or something like that then you might expect that the cat the bias for the cat class might be slightly higher because by default the classifier wants to predict the cat class unless something convinces otherwise and something in the image would come to sit otherwise okay so to make this more concrete I just like to break it down but of course I can't visualize it very explicitly with 3072 numbers so imagine that our input image only had four pixels and imagine so four pixels are stretched out in the column X and imagine that we have three classes so red green and blue class or a cat dog ship class okay so in this case W will be only a 3 by 4 matrix and what we're doing here is we're trying to compute the score of this image X so this is matrix multiplication going on here to give us the output of F which is this course we got the three scores for three different classes so this is an random setting of W just random weights here and we're carry on the matrix multiplication to get some scores so in particularly you can see that with this this setting of W is not very good right because with this setting of W our cat score negative 96 is much less than any of the other classes right so this was not correctly classified for this training image so that's not it's not a very good classifier so we want to change a different double we want to use a different W so that that score comes out higher than the other ones okay but we have to do that consistently across the entire training set of examples but one thing to notice here as well is that basically W it's this function is in parallel evaluating all the ten classifiers but really there are ten independent classifiers to some extent here and every one of these classifiers like say the cat classifier is just the first row of W here right so the first row and the first bias gives you the cat score and the dog classifier is the second row of W and the ship score the ship classifier third row W so basically this W matrix has all these different classifiers stacked in rows and they're all being dot producted with the image to give you this course okay so here's a question for you what does a linear classifier do in English we saw the functional form it's taking these images is doing this funny operation there what is how do we really interpret in English somehow what this is doing there's this functional form really doing yeah good okay good so you're thinking about it in a spatial domain of X being a high dimensional data point and W is really putting planes through this high dimensional data point I'll come back to that interpretation in a bit what other way can when you think about this okay so with your thinking about it more and kind of like a template a way where every single one of these rows of W effectively is like this template that we're dot product with the image and a dot product is really a way of like matching up seeing what aligns good what other ways yeah yep so what you're referring to is that w basically has the capacity to to care or not care about different spatial positions in the image right because what we can do is some of the spatial positions in X if we have 0 weights then the classifier would be it doesn't care what's in part of the image so if I have 0 weights for this part here then nothing affects it but for some other parts of the image you have positive or negative weights that something's going to happen there and this is going to contribute to this core any other ways of describing it yeah yeah so staking yeah so you can think about it that also is like a nice interpretation it's like a mapping from image space to a label space yep good also I made that up sorry sorry yeah thank you so a good question so this image is a three dimensional array where we have all these channels you just uh stretch it out so all the you just stretch it out in whatever way you like say you stack the red green and blue portions side by side that's one way you stretch it out and whatever way you like but in a consistent way across all the images you figure out a way to serialize in which way you want to read off the pixels and you stretch them out into column four for pixel image yeah okay good point okay yeah so let's say we have a four pixel grayscale image which is just a terrible example thank you you're right thank you I didn't want to confuse people especially because someone pointed out to me later after I made this figure that red green and blue are the color channels but here the red green and blue correspond to classes so this is a complete screw-up on my part so I apologize not color channels nothing to do with cabochon it's just three different colored classes sorry about that okay good yeah thank you so your question is what if my images have different sizes and my data set some could be small some could be large exactly how do we make this all be a single sized column vector the answer is you always we always resize images to be basically the same size we can't easily deal with different sized images or we can and we might go into that in later but the simplest thing to think of is just resize every single image to exact same size is the simplest thing because we want to ensure that all of them are kind of comparable of the same stuff so that we can make these columns and we can analyze statistical patterns that are aligned in the image space yeah in fact state of the art methods the way they actually work on this is they always work on square images so if you have a very long image these methods will actually work worse because many of them what they do is just squash it that's what we do still works fairly well so yeah if you have very long like panorama images and you try to put that somewhere on like some online service chances are and my work worse because they'll probably when they put it through a comm that they will make it a square because these comments always work on squares you can make them work on anything but that's just in practice what happens usually any other questions are yeah yep so you're interpreting the W the classifier yeah yeah so each image gets wrapped through this setting okay any got anyone else would like to interpret this or okay great so another way to actually put it one way that I didn't hear but it's also a relatively nice way of looking at it is that basically every single score is just a weighted sum of all the pixel values in the image and these weights are we get to choose those eventually but it's just a giant weighted sum it's really all it's doing is it's counting up colors right it's counting up colors at different spatial positions so one way to one way that was brought up in terms of how we can interpret this W classifier to make it concrete is that it's kind of like a bit like a template matching thing so here what I've done is I trained a classifier and I haven't shown you how to do that yet but I trained my weight matrix W and then I'll come back to this in a second I'm taking out every single one of those rows that we've learned every single classifier and I'm reshaping it back to an image so that I can visualize it okay so I'm taking it originally just a giant row of 3072 numbers I reshape it back to the image to undo the distortion I've done and then I have all these templates and so for example what you see here is that plane it's like a blue blob here the reason you see blue blob is that if you looked at the color channels of this plane template you'll see that in the blue channel you'll have lots of positive weights because those positive weights if they see blue values then they interact with those and they get a little contribution to the score so this plane classifier is really just counting up the amount of blue stuff in the image across all these spatial locations and if you look at the red and a green Channel for the plane classifier you might find 0 values or even negative values right so that's the plane classifier and then we have classifiers for all these other images so say a frog you can almost see the template of the Frog there right we're looking for some green stuff --is-- green stuff has positive weights in here and then we see some brown starfish things on the side right so if that gets put over an image and duck characted you'll get a high score one thing to note here is that look at this the car classifier that's not a very like nice template of a car also here the horse looks a bit weird what's up with that why is the car looking weird and why is the horse looking weird yeah yeah basically that's what's going in the data the horses somewhat are facing left some are right and this classifier really it's not very powerful classifier and has to combine the two modes it has to do both things at the same time so you end up with this two-headed horse in there and you can in fact say that just from this result there's probably more left facing horses in so far right because it's stronger there also for car right we can have a car like 45 degrees tilted left or right or front and this classifier here is the optimal way of mixing across like merging all those modes into a single template because that's what we're forcing it to do once we're actually doing comm nuts and neural networks they don't have this downside they can actually have in principle they can have a template for this car or that car car and combine across them we're giving them more power to actually carry out this class vacation more properly but for now were constrained by this question yes what you're referring to I think is something we call data augmentation so a training time we would not be taking just uh exact images that will be jittering them stretching them skewing them and we'll be piping all then that's going to become a huge part of getting calmness to work very well so yeah so we'll be doing a huge amount of that stuff for every single training example we're going to hallucinate many other training examples of shifts and rotates and skews and that works much better good how do these templates chain taking the average X value I see so you want to explicitly set the template and the way you'll set the template is you'll take the average across all the images and that becomes your template yeah so this classifier it finds they would do something similar I would guess it would work worse because the linear classifier when you look at its mathematical form really what it optimizes for it I don't think he would have a minimum at what she described in just a mean of the images but that would be like a intuitively decent heuristic to perhaps set the weights in the initialization or supplement yeah there's something related to it yeah yeah but we might even go into that LD a probably is what you're referring to there's several several things okay yep do it with grayscale images because let's say we got a car that was like yellowish right but our template image highlights right yeah so that's a good point so there are cars have many different colors and here this happen to pick up on red which is saying that there's probably more red cars in the data set and it might not work for yellow in fact yellow cars might be frog for this classifier right so this thing just does not have capacity to do all of that which is why it's not powerful enough it can't capture all these different modes correctly and so this will just go after the numbers there's more red cars that's where it will go if this was grayscale I'm not sure if that would work better I'll come back to that actually in a bit good yeah device higher yeah so you might expect as I mentioned for unbalanced datasets what you might expect not exactly what you might expect if you have lots of cats is that the cat bias would be higher because this class this this classifier is just used to spewing out large numbers based on the loss but we have to go into the loss function to exactly see how that will play out so it's hard to say right now okay another interpretation of the linear classifier that also someone else pointed out that I'd like to point out is you can think of these images as very high dimensional points in a 3072 dimensional space right in the 3072 pixel space dimensional pixel space every image is a point and these linear classifiers are describing these gradients across the 3070 two-dimensional space these scores are this gradient of negative to positive along some linear direction across the space and so for example here for a car classifier I'm taking the first row of W which is the car class and the line here is indicating the zero level set of the classifier in other words that long that line the car classifier has zero score so the car classifier there has zero and then the arrow is indicating the direction along which it will color the space with more and more Karnas of score similar we have these three different classifiers in this example they will all correspond to these gradients particularly set and they're basically trying to go in you have all these points they're in this space and these linear classifiers we initialize them randomly so this car classifier would have its level set at random and then you'll see when we actually do the optimization as we optimize this will start to shift turn and it will try to isolate the car class and hold like it's really fun to watch these classifiers train because it'll will rotate it will snap in for the car class and it'll start to jiggle and it will try to like separate out all the cars from all the all the non cars it's really amusing to watch so that's another way of interpreting that someone's brought up okay so here's a question for you given all these interpretations what would be a very hard test set given how long your classifier works would you expect to work really just really not well with a linear classifier sorry concurrent circles so your classes are where your class is exactly oh I see so you're in okay so you're describing is in this interpretation of space if your images in one class would be in a blob and then your other classes like around it so I'm not sure exactly what that would look like if you actually visualize it in the pixel space but yes you're right in that case linear classifier would not be able to separate out those but what about in terms of like what would the images look like you would look at this data set of images and you clearly say that linear classifier will probably not do very well here oh yeah good about the average of the pictures and that made me think if you're doing like ordinary least-squares what you're basically doing is maximizing rejections of all the X's on to your row space so if you look at training X's of scooters and training X's of motorcycles they're going to effectively yield the same like centroid yep yep so that's a pretty good one yeah but maybe the negative images so you want one class do we say you take all the airplanes and then you you want to switch how I see oh I see I see so you're set you're pointing out that if I took that image of an airplane class and I have a trained classifier and that I do a negative of it negative image over that classifier you'd still see the edges and you'll you'll say okay that's an airplane obviously by the shape but kirlyam classifier all the colors would be exactly wrong and so the linear classifier would hate that airplane so yeah example good take the same exact image did you translate it or scale it differently you move in different places or rotated for each class you so these I see so you're saying that would take one thing say like dogs but then so what you're referring to us say you have dogs one class that's dogs in the center and one class is dogs in on the right and you think that that would be a yeah I'm saying even if it's a picture the same exact so would that be a problem if so one class has dogs in the center and one causes dogs on the right but otherwise white background or something like that that'd be a problem it wouldn't be a problem why wouldn't be problem it's an affine transformation an image you have a set of images like the dogs and dogs work I say right so you're saying that may be a more difficult thing would be if you have dogs that are warped in some funny ways according to class why wouldn't it be a problem if you actually could a linear classifier do something in the center and something on the right doesn't actually have an understanding of spatial layout that actually be fine right that would be relatively easy because you would have positive weights in the middle but Oh sir sorry okay maybe I misunderstood I'm sorry okay another one yeah possibly yeah so I think many of you are kind of understanding the the main point is uh yeah so this is really really what it's doing well I'm skipping ahead here really what this is doing is it's counting up counting up colors in spatial positions anything that messes with this will be really hard actually to go back to your point if you had a grayscale data set by the way that would work not very well with linear linear classifiers would probably not work if you took C for ten and you make it a dull gray scale then doing the exact C farting classification but on grayscale images would probably work really terribly because you can't pick up on the colors you have to pick up on these textures and find details now and you just can't localize them because they could be arbitrary positions you can't consistently count across it so yeah that would be kind of a disaster another example would be different textures if you have say all of your textures are blue but these textures could be different types then this doesn't really like say these textures could be different types but they can be spatially invariant and that would be terrible terrible Pro linear classifiers well okay good so just remind you I think nearly there we defined this linear function so with a specific case of W we're looking at test images we're getting some scores out and just looking forward where we're headed now is with some setting of WS we're getting some scores for all these images and so for example with this setting of W in this image we're seeing that the cat score is 2.9 but there are some classes I've got a higher score like dog so that's not very good right but some classes have negative scores which is good for this image so this is a kind of a medium result for this weights for this image and here we see that the car class which is correct for there has the highest score which is good right so this setting of W worked well on this image here we see that the Frog class is a very low score so the W works terribly on that image so where we're headed now is we're going to define what we call a loss function and this loss function will quantify this intuition of what we consider good or bad right now we're just eyeballing these numbers and saying what's good what's bad we have to actually write down a mathematical expression that tells us exactly like this setting of W across our test set is 12.5 bad or 12 point whatever bad or 1.0 bad because then once we have it defined specifically we're going to be looking for WS that minimize the loss and it will be set up in such a way that when you have a loss of very low numbers like say even zero then you're correctly classifying all your images but if you have very high loss then everything is messed up and W is not good at all so we're going to define a loss function and then we're going to look for different WS that actually do very well across all of it so that's roughly what's coming up we'll define a loss function which is a quantify a way to quantify how bad each W is on our data set the loss function is a function of your entire training set and your weights we don't have control over the training set we have control of the weights then we're going to look at the process of optimization how do we efficiently find the set of weights W that works across all the images and gives us a very low loss and then eventually what we'll do is we'll go back and we'll look at this expression the linear classifier that we saw and we're going to start meddling with the function f here so we're going to extend F to not be that simple in your expression but we're going to make it slightly more complex we'll get a neural network out and then we'll make it slightly more complex and we'll get a convolutional net workout but otherwise the entire framework will stay unchanged all the time we'll be computing scores this functional form will be changing but we're going from image to some set of scores ah through some function and we'll make it more elaborate over time and then we're identifying some last function and we're looking at what weights what parameters are giving us very low loss and that's the setup will be working with going forward so next class we'll look into loss functions and then we'll go to our neural networks in combats so I guess this is my last slide so I can take up any last questions and then sorry sorry it's are you I didn't hear why are we doing iterative approach in the optimization so sometimes in optimization settings you can add for these iterative approaches are basically the way this will work will will always start off with a random W so that will give us some loss and then we we don't have a process of finding right away the best set of weights what we do have a process for is iteratively slightly improving the weights so we'll see as we look at the last function and we'll find a gradient and in the parameter space and we'll march down so what we do know how to do is how do we slightly improve a set of weights we don't know how to do the problem of just buying the best set of weights right away we don't know how to do that because especially when these functions are very complex like say entire commnets it's a huge landscape of is just a very intractable problem is that your question I'm not sure okay thank you yeah how do we deal with the color problem yeah oh so okay so so here we saw that the linear classifier for car was this red template for a car a neural network basically what we'll do is well we will you can look at it as stacking linear classifier to some degree so what it will end up doing is it will have all these little templates really for red cars yellow cars green cars whatever cars going this way or that way or that way there will be a neuron assigned to detecting every one of these different modes and then they will be combined across them on the second layer so basically you'll have these neurons looking for a different types of cars and then the next neuron will be just like okay I just take away that some of you guys and I'm just doing an or operation over you and then we can detect cars in all of their modes and all of their positions if that makes sense so that's roughly how it will work okay awesome
Info
Channel: Andrej Karpathy
Views: 133,625
Rating: 4.9604354 out of 5
Keywords: convolutional, neural, networks, visual, recognition, image, classification, linear, stanford, class
Id: 8inugqHkfvE
Channel Id: undefined
Length: 57min 28sec (3448 seconds)
Published: Wed Jan 06 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.