Okay, so welcome to lecture two of CS231N. On Tuesday we, just recall,
we, sort of, gave you the big picture view of
what is computer vision, what is the history, and a little bit of the
overview of the class. And today, we're really going
to dive in, for the first time, into the details. And we'll start to see,
in much more depth, exactly how some of
these learning algorithms actually work in practice. So, the first lecture of the class is probably, sort of, the
largest big picture vision. And the majority of the
lectures in this class will be much more detail orientated, much more focused on
the specific mechanics, of these different algorithms. So, today we'll see our
first learning algorithm and that'll be really exciting, I think. But, before we get to that, I wanted to talk about a couple
of administrative issues. One, is Piazza. So, I saw it when I checked yesterday, it seemed like we had maybe 500 students signed up on Piazza. Which means that there
are several hundred of you who are not yet there. So, we really want Piazza
to be the main source of communication between the
students and the core staff. So, we've gotten a lot of
questions to the staff list about project ideas or questions
about midterm attendance or poster session attendance. And, any, sort of, questions like that should really go to Piazza. You'll probably get answers
to your questions faster on Piazza, because all the
TAs are knowing to check that. And it's, sort of, easy
for emails to get lost in the shuffle if you just
send to the course list. It's also come to my attention
that some SCPD students are having a bit of a hard
time signing up for Piazza. SCPD students are supposed to receive a @stanford.edu email address. So, once you get that email address, then you can use the Stanford
email to sign into Piazza. Probably that doesn't
affect those of you who are sitting in the room right now, but, for those students listening on SCPD. The next administrative issue
is about assignment one. Assignment one will be up later today, probably sometime this afternoon, but I promise, before
I go to sleep tonight, it'll be up. But, if you're getting a little bit antsy and really want to start
working on it right now, then you can look at last year's version of assignment one. It'll be pretty much the same content. We're just reshuffling it
a little bit to make it, like, for example, upgrading
to work with Python 3, rather than Python 2.7. And some of these minor cosmetic changes, but the content of the
assignment will still be the same as last year. So, in this assignment you'll
be implementing your own k-nearest neighbor classifier, which we're going to talk
about in this lecture. You'll also implement several
different linear classifiers, including the SVM and Softmax, as well as a simple
two-layer neural network. And we'll cover all this content over the next couple of lectures. So, all of our assignments
are using Python and NumPy. If you aren't familiar
with Python or NumPy, then we have written a
tutorial that you can find on the course website to
try and get you up to speed. But, this is, actually, pretty important. NumPy lets you write these
very efficient vectorized operations that let you do
quite a lot of computation in just a couple lines of code. So this is super important for pretty much all aspects of numerical
computing and machine learning and everything like that, is efficiently implementing
these vectorized operations. And you'll get a lot of practice with this on the first assignment. So, for those of you who
don't have a lot of experience with Matlab or NumPy or
other types of vectorized tensor computation, I recommend
that you start looking at this assignment pretty early and also, read carefully
through the tutorial. The other thing I wanted to talk about is that we're happy to announce that we're officially supported
through Google Cloud for this class. So, Google Cloud is somewhat
similar to Amazon AWS. You can go and start virtual
machines up in the cloud. These virtual machines can have GPUs. We're working on the tutorial
for exactly how to use Google Cloud and get it to
work for the assignments. But our intention is that
you'll be able to just download some image, and it'll be very seamless for you to work on the assignments on one of these instances on the cloud. And because Google has, very generously, supported this course, we'll be able to distribute to each of you coupons that let you use
Google Cloud credits for free for the class. So you can feel free to use
these for the assignments and also for the course projects when you want to start using
GPUs and larger machines and whatnot. So, we'll post more details about that, probably, on Piazza later today. But, I just wanted to mention, because I know there had
been a couple of questions about, can I use my laptop? Do I have to run on corn? Do I have to, whatever? And the answer is that,
you'll be able to run on Google Cloud and we'll provide
you some coupons for that. Yeah, so, those are, kind of, the
major administrative issues I wanted to talk about today. And then, let's dive into the content. So, the last lecture
we talked a little bit about this task of image classification, which is really a core
task in computer vision. And this is something
that we'll really focus on throughout the course of the class. Is, exactly, how do we work on this
image classification task? So, a little bit more concretely, when you're doing image classification, your system receives some input image, which is this cute cat in this example, and the system is aware
of some predetermined set of categories or labels. So, these might be, like,
a dog or a cat or a truck or a plane, and there's some
fixed set of category labels, and the job of the computer
is to look at the picture and assign it one of these
fixed category labels. This seems like a really easy problem, because so much of your own
visual system in your brain is hardwired to doing these, sort of, visual recognition tasks. But this is actually a
really, really hard problem for a machine. So, if you dig in and
think about, actually, what does a computer see
when it looks at this image, it definitely doesn't get
this holistic idea of a cat that you see when you look at it. And the computer really
is representing the image as this gigantic grid of numbers. So, the image might be something
like 800 by 600 pixels. And each pixel is
represented by three numbers, giving the red, green, and
blue values for that pixel. So, to the computer, this is just a gigantic grid of numbers. And it's very difficult
to distill the cat-ness out of this, like, giant array
of thousands, or whatever, very many different numbers. So, we refer to this
problem as the semantic gap. This idea of a cat, or
this label of a cat, is a semantic label that
we're assigning to this image, and there's this huge gap between
the semantic idea of a cat and these pixel values that the
computer is actually seeing. And this is a really hard problem because you can change the picture
in very small, subtle ways that will cause this pixel
grid to change entirely. So, for example, if we took this same cat, and if the cat happened to sit still and not even twitch, not move a muscle, which is never going to happen, but we moved the camera to the other side, then every single grid,
every single pixel, in this giant grid of numbers would be completely different. But, somehow, it's still
representing the same cat. And our algorithms need
to be robust to this. But, not only viewpoint is one problem, another is illumination. There can be different
lighting conditions going on in the scene. Whether the cat is appearing
in this very dark, moody scene, or like is this very bright,
sunlit scene, it's still a cat, and our algorithms need
to be robust to that. Objects can also deform. I think cats are, maybe,
among the more deformable of animals that you might see out there. And cats can really assume a
lot of different, varied poses and positions. And our algorithms should
be robust to these different kinds of transforms. There can also be problems of occlusion, where you might only see part
of a cat, like, just the face, or in this extreme example,
just a tail peeking out from under the couch cushion. But, in these cases, it's pretty
easy for you, as a person, to realize that this is probably a cat, and you still recognize
these images as cats. And this is something that our algorithms also must be robust to, which is quite difficult, I think. There can also be problems
of background clutter, where maybe the foreground
object of the cat, could actually look quite
similar in appearance to the background. And this is another thing
that we need to handle. There's also this problem
of intraclass variation, that this one notion of
cat-ness, actually spans a lot of different visual appearances. And cats can come in
different shapes and sizes and colors and ages. And our algorithm, again, needs to work and handle all these different variations. So, this is actually a really,
really challenging problem. And it's sort of easy to
forget how easy this is because so much of your
brain is specifically tuned for dealing with these things. But now if we want our computer programs to deal with all of these
problems, all simultaneously, and not just for cats, by the way, but for just about any object
category you can imagine, this is a fantastically
challenging problem. And it's, actually, somewhat miraculous that this works at all, in my opinion. But, actually, not only does it work, but these things work very
close to human accuracy in some limited situations. And take only hundreds
of milliseconds to do so. So, this is some pretty
amazing, incredible technology, in my opinion, and over the
course of the rest of the class we will really see what
kinds of advancements have made this possible. So now, if you, kind of, think about what is the API for writing
an image classifier, you might sit down and try
to write a method in Python like this. Where you want to take in an image and then do some crazy magic and then, eventually,
spit out this class label to say cat or dog or whatnot. And there's really no obvious
way to do this, right? If you're taking an algorithms class and your task is to sort numbers or compute a convex hull or, even, do something
like RSA encryption, you, sort of, can write down an algorithm and enumerate all the
steps that need to happen in order for this things to work. But, when we're trying
to recognize objects, or recognize cats or images, there's no really clear,
explicit algorithm that makes intuitive sense, for how you might go about
recognizing these objects. So, this is, again, quite challenging, if you think about, if it was your first day programming and you had to sit down
and write this function, I think most people would be in trouble. That being said, people have definitely
made explicit attempts to try to write, sort
of, high-end coded rules for recognizing different animals. So, we touched on this a
little bit in the last lecture, but maybe one idea for cats is that, we know that cats have ears
and eyes and mouths and noses. And we know that edges,
from Hubel and Wiesel, we know that edges are pretty important when it comes to visual recognition. So one thing we might try to do is compute the edges of this image and then go in and try to
categorize all the different corners and boundaries, and
say that, if we have maybe three lines meeting this way,
then it might be a corner, and an ear has one corner
here and one corner there and one corner there, and then, kind of, write down
this explicit set of rules for recognizing cats. But this turns out not to work very well. One, it's super brittle. And, two, say, if you want
to start over for another object category, and maybe
not worry about cats, but talk about trucks or dogs
or fishes or something else, then you need to start all over again. So, this is really not a
very scalable approach. We want to come up with some
algorithm, or some method, for these recognition tasks which scales much more
naturally to all the variety of objects in the world. So, the insight that, sort
of, makes this all work is this idea of the data-driven approach. Rather than sitting down and
writing these hand-specified rules to try to craft exactly
what is a cat or a fish or what have you, instead, we'll go out onto the internet and collect a large
dataset of many, many cats and many, many airplanes
and many, many deer and different things like this. And we can actually use tools
like Google Image Search, or something like that, to go out and collect a very
large number of examples of these different categories. By the way, this actually
takes quite a lot of effort to go out and actually
collect these datasets but, luckily, there's a lot
of really good, high quality datasets out there already for you to use. Then once we get this dataset, we train this machine learning classifier that is going to ingest all of the data, summarize it in some way, and then spit out a model that summarizes the
knowledge of how to recognize these different object categories. Then finally, we'll
use this training model and apply it on new images that will then be able to recognize cats and dogs and whatnot. So here our API has changed a little bit. Rather than a single function that just inputs an image
and recognizes a cat, we have these two functions. One, called, train, that's
going to input images and labels and then output a model, and then, separately, another
function called, predict, which will input the model
and than make predictions for images. And this is, kind of, the key insight that allowed all these things
to start working really well over the last 10, 20 years or so. So, this class is primarily
about neural networks and convolutional neural networks and deep learning and all that, but this idea of a data-driven
approach is much more general than just deep learning. And I think it's useful to, sort of, step through this process for a very simple classifier first, before we get to these big, complex ones. So, probably, the simplest
classifier you can imagine is something we call nearest neighbor. The algorithm is pretty dumb, honestly. So, during the training
step we won't do anything, we'll just memorize all
of the training data. So this is very simple. And now, during the prediction step, we're going to take some new image and go and try to find
the most similar image in the training data to that new image, and now predict the label
of that most similar image. A very simple algorithm. But it, sort of, has a lot
of these nice properties with respect to
data-drivenness and whatnot. So, to be a little bit more concrete, you might imagine working on
this dataset called CIFAR-10, which is very commonly
used in machine learning, as kind of a small test case. And you'll be working with
this dataset on your homework. So, the CIFAR-10 dataset gives
you 10 different classes, airplanes and automobiles and
birds and cats and different things like that. And for each of those 10 categories it provides 50,000 training images, roughly evenly distributed
across these 10 categories. And then 10,000 additional testing images that you're supposed to
test your algorithm on. So here's an example
of applying this simple nearest neighbor classifier
to some of these test images on CIFAR-10. So, on this grid on the right, for the left most column, gives a test image in
the CIFAR-10 dataset. And now on the right, we've
sorted the training images and show the most similar training images to each of these test examples. And you can see that they
look kind of visually similar to the training images, although they are not
always correct, right? So, maybe on the second row,
we see that the testing, this is kind of hard to see, because these images are 32 by 32 pixels, you need to really dive in there and try to make your best guess. But, this image is a dog and
it's nearest neighbor is also a dog, but this next one,
I think is actually a deer or a horse or something else. But, you can see that it
looks quite visually similar, because there's kind of a
white blob in the middle and whatnot. So, if we're applying the
nearest neighbor algorithm to this image, we'll find the closest
example in the training set. And now, the closest
example, we know it's label, because it comes from the training set. And now, we'll simply say that
this testing image is also a dog. You can see from these
examples that is probably not going to work very well, but it's still kind of a
nice example to work through. But then, one detail
that we need to know is, given a pair of images, how can we actually compare them? Because, if we're going to take
our test image and compare it to all the training images, we actually have many different choices for exactly what that comparison
function should look like. So, in the example in the previous slide, we've used what's called the L1 distance, also sometimes called
the Manhattan distance. So, this is a really
sort of simple, easy idea for comparing images. And that's that we're going to
just compare individual pixels in these images. So, supposing that our test
image is maybe just a tiny four by four image of pixel values, then we're take this upper-left hand pixel of the test image, subtract off the value
in the training image, take the absolute value, and get the difference in that
pixel between the two images. And then, sum all these
up across all the pixels in the image. So, this is kind of a stupid
way to compare images, but it does some reasonable
things sometimes. But, this gives us a very concrete way to measure the difference
between two images. And in this case, we have
this difference of 456 between these two images. So, here's some full Python code for implementing this
nearest neighbor classifier and you can see it's pretty
short and pretty concise because we've made use of
many of these vectorized operations offered by NumPy. So, here we can see that
this training function, that we talked about earlier, is, again, very simple, in
the case of nearest neighbor, you just memorize the training data, there's not really much to do here. And now, at test time, we're
going to take in our image and then go in and compare
using this L1 distance function, our test image to each of
these training examples and find the most similar
example in the training set. And you can see that, we're
actually able to do this in just one or two lines of Python code by utilizing these vectorized
operations in NumPy. So, this is something that
you'll get practice with on the first assignment. So now, a couple questions
about this simple classifier. First, if we have N examples
in our training set, then how fast can we expect
training and testing to be? Well, training is probably constant because we don't really
need to do anything, we just need to memorize the data. And if you're just copying a pointer, that's going to be constant time no matter how big your dataset is. But now, at test time we need
to do this comparison stop and compare our test image to each of the N training
examples in the dataset. And this is actually quite slow. So, this is actually somewhat backwards, if you think about it. Because, in practice, we want our classifiers to
be slow at training time and then fast at testing time. Because, you might imagine,
that a classifier might go and be trained in a data center somewhere and you can afford to
spend a lot of computation at training time to make
the classifier really good. But then, when you go and deploy the
classifier at test time, you want it to run on your mobile phone or in a browser or some
other low power device, and you really want the
testing time performance of your classifier to be quite fast. So, from this perspective, this
nearest neighbor algorithm, is, actually, a little bit backwards. And we'll see that once we move to convolutional neural networks, and other types of parametric models, they'll be the reverse of this. Where you'll spend a lot of
compute at training time, but then they'll be quite
fast at testing time. So then, the question is, what exactly does this
nearest neighbor algorithm look like when you apply it in practice? So, here we've drawn, what
we call the decision regions of a nearest neighbor classifier. So, here our training set
consists of these points in the two dimensional plane, where the color of the point
represents the category, or the class label, of that point. So, here we see we have five classes and some blue ones up in the corner here, some purple ones in the
upper-right hand corner. And now for each pixel
in this entire plane, we've gone and computed
what is the nearest example in these training data, and then colored the
point of the background corresponding to what is the class label. So, you can see that this
nearest neighbor classifier is just sort of carving up the space and coloring the space
according to the nearby points. But this classifier is maybe not so great. And by looking at this picture we can start to see some of the
problems that might come out with a nearest neighbor classifier. For one, this central
region actually contains mostly green points, but one little yellow point in the middle. But because we're just looking
at the nearest neighbor, this causes a little
yellow island to appear in this middle of this green cluster. And that's, maybe, not so great. Maybe those points actually
should have been green. And then, similarly we also
see these, sort of, fingers, like the green region
pushing into the blue region, again, due to the presence of one point, which may have been noisy or spurious. So, this kind of motivates
a slight generalization of this algorithm called
k-nearest neighbors. So rather than just looking for
the single nearest neighbor, instead we'll do something
a little bit fancier and find K of our nearest neighbors, according to our distance metric, and then take a vote among
each of our neighbors. And then predict the majority vote among our neighbors. You can imagine slightly more
complex ways of doing this. Maybe you'd vote weighted on the distance, or something like that, but the simplest thing that
tends to work pretty well is just taking a majority vote. So here we've shown the
exact same set of points using this K=1 nearest
neighbor classifier, as well as K=3 and K=5 in
the middle and on the right. And once we move to K=3, you
can see that that spurious yellow point in the middle
of the green cluster is no longer causing the
points near that region to be classified as yellow. Now this entire green
portion in the middle is all being classified as green. You can also see that these fingers of the red and blue regions are starting to get smoothed out due to this majority voting. And then, once we move to the K=5 case, then these decision boundaries between the blue and red regions have become quite smooth and quite nice. So, generally when you're
using nearest neighbors classifiers, you almost always want
to use some value of K, which is larger than one because this tends to
smooth out your decision boundaries and lead to better results. Question? [student asking a question] Yes, so the question is, what is the deal with these white regions? The white regions are
where there was no majority among the k-nearest neighbors. You could imagine maybe doing
something slightly fancier and maybe taking a guess
or randomly selecting among the majority winners, but for this simple example
we're just coloring it white to indicate there was no nearest neighbor in those points. Whenever we're thinking
about computer vision I think it's really useful to kind of flip back and forth between
several different viewpoints. One, is this idea of high
dimensional points in the plane, and then the other is actually
looking at concrete images. Because the pixels of the image actually allow us to think of these
images as high dimensional vectors. And it's sort of useful to
ping pong back and forth between these two different viewpoints. So then, sort of taking
this k-nearest neighbor and going back to the images you can see that it's
actually not very good. Here I've colored in red and green which images would actually
be classified correctly or incorrectly according
to their nearest neighbor. And you can see that it's
really not very good. But maybe if we used a larger value of K then this would involve
actually voting among maybe the top three or the top five or maybe even the whole row. And you could imagine that
that would end up being a lot more robust to some
of this noise that we see when retrieving neighbors in this way. So another choice we
have when we're working with the k-nearest neighbor algorithm is determining exactly
how we should be comparing our different points. For the examples so far we've just shown we've talked about this L1 distance which takes the sum of the absolute values between the pixels. But another common choice is
the L2 or Euclidean distance where you take the square
root of the sum of the squares and take this as your distance. Choosing different
distance metrics actually is a pretty interesting topic because different distance metrics make different assumptions
about the underlying geometry or topology that
you'd expect in the space. So, this L1 distance, underneath
this, this is actually a circle according to the L1 distance and it forms this square shape thing around the origin. Where each of the points
on this, on the square, is equidistant from the
origin according to L1, whereas with the L2 or Euclidean distance then this circle is a familiar circle, it looks like what you'd expect. So one interesting thing to
point out between these two metrics in particular, is that the L1 distance
depends on your choice of coordinates system. So if you were to rotate
the coordinate frame that would actually change the L1 distance between the points. Whereas changing the coordinate
frame in the L2 distance doesn't matter, it's the
same thing no matter what your coordinate frame is. Maybe if your input features,
if the individual entries in your vector have some important meaning for your task, then maybe somehow L1 might
be a more natural fit. But if it's just a generic
vector in some space and you don't know which
of the different elements, you don't know what they actually mean, then maybe L2 is slightly more natural. And another point here is that by using different distance metrics we can actually generalize
the k-nearest neighbor classifier to many, many
different types of data, not just vectors, not just images. So, for example, imagine you
wanted to classify pieces of text, then the only
thing you need to do to use k-nearest neighbors is to specify some distance function that can measure distances
between maybe two paragraphs or two sentences or something like that. So, simply by specifying
different distance metrics we can actually apply this
algorithm very generally to basically any type of data. Even though it's a kind
of simple algorithm, in general, it's a very
good thing to try first when you're looking at a new problem. So then, it's also kind of
interesting to think about what is actually happening geometrically if we choose different distance metrics. So here we see the same
set of points on the left using the L1, or Manhattan distance, and then, on the right,
using the familiar L2, or Euclidean distance. And you can see that the
shapes of these decision boundaries actually change quite a bit between the two metrics. So when you're looking at
L1 these decision boundaries tend to follow the coordinate axes. And this is again because
the L1 depends on our choice of coordinate system. Where the L2 sort of doesn't
really care about the coordinate axis, it
just puts the boundaries where they should fall naturally. My confession is that
each of these examples that I've shown you is
actually from this interactive web demo that I built, where you can go and play
with this k-nearest neighbor classifier on your own. And this is really hard to
work on a projector screen. So maybe we'll do that on your own time. So, let's just go back to here. Man, this is kind of embarrassing. Okay, that was way more
trouble than it was worth. So, let's skip this, but I encourage you to go play with this in your browser. It's actually pretty fun and kind of nice to build intuition about how the decision boundary changes as you change the K and change your distance metric and all those sorts of things. Okay, so then the question is once you're actually trying
to use this algorithm in practice, there's several choices you need to make. We talked about choosing
different values of K. We talked about choosing
different distance metrics. And the question becomes how do you actually make
these choices for your problem and for your data? So, these choices, of things
like K and the distance metric, we call hyperparameters, because they are not necessarily
learned from the training data, instead these are choices about
your algorithm that you make ahead of time and there's no way to learn
them directly from the data. So, the question is how
do you set these things in practice? And they turn out to be
very problem-dependent. And the simple thing that
most people do is simply try different values of
hyperparameters for your data and for your problem, and
figure out which one works best. There's a question? [student asking a question] So, the question is, where L1
distance might be preferable to using L2 distance? I think it's mainly problem-dependent, it's sort of difficult to say in which cases you think
one might be better than the other. but I think that because L1
has this sort of coordinate dependency, it actually depends
on the coordinate system of your data, if you know that you have a vector, and maybe the individual
elements of the vector have meaning. Like maybe you're classifying
employees for some reason and then the different elements
of that vector correspond to different features or
aspects of an employee. Like their salary or the
number of years they've been working at the company
or something like that. So I think when your
individual elements actually have some meaning, is where I think maybe using
L1 might make a little bit more sense. But in general, again,
this is a hyperparameter and it really depends on
your problem and your data so the best answer is
just to try them both and see what works better. Even this idea of trying
out different values of hyperparameters and
seeing what works best, there are many different choices here. What exactly does it mean
to try hyperparameters and see what works best? Well, the first idea you might think of is simply choosing the
hyperparameters that give you the best accuracy or best performance on your training data. This is actually a really terrible idea. You should never do this. In the concrete case
of the nearest neighbor classifier, for example, if we set K=1, we will always
classify the training data perfectly. So if we use this strategy
we'll always pick K=1, but, as we saw from the examples earlier, in practice it seems that
setting K equals to larger values might cause us to misclassify
some of the training data, but, in fact, lead to better performance on points that were not
in the training data. And ultimately in machine learning we don't care about
fitting the training data, we really care about how our classifier, or how our method, will perform on unseen
data after training. So, this is a terrible
idea, don't do this. So, another idea that you might think of, is maybe we'll take our full dataset and we'll split it into some training data and some test data. And now I'll try training
my algorithm with different choices of hyperparameters
on the training data and then I'll go and apply
that trained classifier on the test data and now I will pick the set of hyperparameters
that cause me to perform best on the test data. This seems like maybe a
more reasonable strategy, but, in fact, this is also a terrible idea and you should never do this. Because, again, the point
of machine learning systems is that we want to know how
our algorithm will perform. So, the point of the test set is to give us some estimate
of how our method will do on unseen data that's
coming out from the wild. And if we use this strategy
of training many different algorithms with different hyperparameters, and then, selecting the
one which does the best on the test data, then, it's possible, that
we may have just picked the right set of hyperparameters that caused our algorithm
to work quite well on this testing set, but now our performance on this test set will no longer be representative of our performance of new, unseen data. So, again, you should not
do this, this is a bad idea, you'll get in trouble if you do this. What is much more common, is
to actually split your data into three different sets. You'll partition most of
your data into a training set and then you'll create a validation set and a test set. And now what we typically do
is go and train our algorithm with many different
choices of hyperparameters on the training set, evaluate on the validation set, and now pick the set of hyperparameters which performs best on the validation set. And now, after you've
done all your development, you've done all your debugging, after you've dome everything, then you'd take that best
performing classifier on the validation set and run it once on the test set. And now that's the number
that goes into your paper, that's the number that
goes into your report, that's the number that
actually is telling you how your algorithm is doing on unseen data. And this is actually
really, really important that you keep a very
strict separation between the validation data and the test data. So, for example, when we're
working on research papers, we typically only touch the test set at the very last minute. So, when I'm writing papers, I tend to only touch the
test set for my problem in maybe the week before
the deadline or so to really insure that we're not being dishonest here and
we're not reporting a number which is unfair. So, this is actually super important and you want to make sure
to keep your test data quite under control. So another strategy for
setting hyperparameters is called cross validation. And this is used a
little bit more commonly for small data sets, not used
so much in deep learning. So here the idea is we're
going to take our test data, or we're going to take our dataset, as usual, hold out some test
set to use at the very end, and now, for the rest of the data, rather than splitting it
into a single training and validation partition, instead, we can split our training data into many different folds. And now, in this way, we've
cycled through choosing which fold is going to be the validation set. So now, in this example, we're using five fold cross validation, so you would train your
algorithm with one set of hyperparameters on the first four folds, evaluate the performance on fold four, and now go and retrain
your algorithm on folds one, two, three, and five, evaluate on fold four, and cycle through all the different folds. And, when you do it this way, you get much higher confidence about which hyperparameters are going to perform more robustly. So this is kind of the
gold standard to use, but, in practice in deep learning when we're training large models and training is very
computationally expensive, these doesn't get used
too much in practice. Question? [student asking a question] Yeah, so the question is, a little bit more concretely, what's the difference
between the training and the validation set? So, if you think about the
k-nearest neighbor classifier then the training set is this
set of images with labels where we memorize the labels. And now, to classify an image, we're going to take the image
and compare it to each element in the training data, and then transfer the label
from the nearest training point. So now our algorithm
will memorize everything in the training set, and now we'll take each
element of the validation set and compare it to each
element in the training data and then use this to
determine what is the accuracy of our classifier when it's
applied on the validation set. So this is the distinction
between training and validation. Where your algorithm is
able to see the labels of the training set, but for the validation set, your algorithm doesn't have
direct access to the labels. We only use the labels
of the validation set to check how well our algorithm is doing. A question? [student asking a question] The question is, whether the test set, is it possible that the
test set might not be representative of data
out there in the wild? This definitely can be
a problem in practice, the underlying statistical
assumption here is that your data are all independently
and identically distributed, so that all of your data points should be drawn from the same underlying
probability distribution. Of course, in practice, this
might not always be the case, and you definitely can run into cases where the test set might
not be super representative of what you see in the wild. So this is kind of a problem
that dataset creators and dataset curators need to think about. But when I'm creating
datasets, for example, one thing I do, is I'll go and collect a whole
bunch of data all at once, using the exact same methodology
for collecting the data, and then afterwards you go
and partition it randomly between train and test. One thing that can screw you up here is maybe if you're collecting data over time and you make the earlier
data, that you collect first, be the training data, and the later data that you
collect be the test data, then you actually might
run into this shift that could cause problems. But as long as this partition is random among your entire set of data points, then that's how we try
to alleviate this problem in practice. So then, once you've gone through this cross validation procedure, then you end up with graphs
that look something like this. So here, on the X axis, we
are showing the value of K for a k-nearest neighbor
classifier on some problem, and now on the Y axis, we are
showing what is the accuracy of our classifier on some dataset for different values of K. And you can see that, in this case, we've done five fold cross
validation over the data, so, for each value of K we
have five different examples of how well this algorithm is doing. And, actually, going back
to the question about having some test sets
that are better or worse for your algorithm, using K fold cross validation is maybe one way to help
quantify that a little bit. And, in that, we can see the
variance of how this algorithm performs on different
of the validation folds. And that gives you some sense of, not just what is the best, but, also, what is the
distribution of that performance. So, whenever you're training
machine learning models you end up making plots like this, where they show you what is your accuracy, or your performance as a
function of your hyperparameters, and then you want to
go and pick the model, or the set of hyperparameters, at the end of the day, that performs the best
on the validation set. So, here we see that maybe
about K=7 probably works about best for this problem. So, k-nearest neighbor
classifiers on images are actually almost
never used in practice. Because, with all of these
problems that we've talked about. So, one problem is that
it's very slow at test time, which is the reverse of what we want, which we talked about earlier. Another problem is that these things like Euclidean
distance, or L1 distance, are really not a very good way to measure distances between images. These, sort of, vectorial
distance functions do not correspond very well
to perceptual similarity between images. How you perceive
differences between images. So, in this example, we've constructed, there's this image on the left of a girl, and then three different
distorted images on the right where we've blocked out her mouth, we've actually shifted
down by a couple pixels, or tinted the entire image blue. And, actually, if you compute
the Euclidean distance between the original and the boxed, the original and the shuffled, and original in the tinted, they all have the same L2 distance. Which is, maybe, not so good because it sort of
gives you the sense that the L2 distance is really
not doing a very good job at capturing these perceptional
distances between images. Another, sort of, problem
with the k-nearest neighbor classifier has to do with
something we call the curse of dimensionality. So, if you recall back this
viewpoint we had of the k-nearest neighbor classifier, it's sort of dropping paint
around each of the training data points and using that to
sort of partition the space. So that means that if we
expect the k-nearest neighbor classifier to work well, we kind of need our training
examples to cover the space quite densely. Otherwise our nearest neighbors
could actually be quite far away and might not actually
be very similar to our testing points. And the problem is, that actually densely covering the space, means that we need a number
of training examples, which is exponential in the
dimension of the problem. So this is very bad, exponential
growth is always bad, basically, you're never
going to get enough images to densely cover this space of pixels in this high dimensional space. So that's maybe another
thing to keep in mind when you're thinking about
using k-nearest neighbor. So, kind of the summary
is that we're using k-nearest neighbor to introduce this idea of image classification. We have a training set
of images and labels and then we use that to predict these labels on the test set. Question? [student asking a question] Oh, sorry, the question is, what was going on with this picture? What are the green and the blue dots? So here, we have some training samples which are represented by points, and the color of the dot
maybe represents the category of the point, of this training sample. So, if we're in one dimension, then you maybe only need
four training samples to densely cover the space, but if we move to two dimensions, then, we now need, four times
four is 16 training examples to densely cover this space. And if we move to three, four,
five, many more dimensions, the number of training
examples that we need to densely cover the space, grows exponentially with the dimension. So, this is kind of giving you the sense, that maybe in two dimensions we might have this kind
of funny curved shape, or you might have sort of
arbitrary manifolds of labels in different dimensional spaces. Because the k-nearest neighbor algorithm doesn't really make any
assumptions about these underlying manifolds, the only way it can perform properly is if it has quite a dense
sample of training points to work with. So, this is kind of the
overview of k-nearest neighbors and you'll get a chance
to actually implement this and try it out on images
in the first assignment. So, if there's any last minute
questions about K and N, I'm going to move on to the next topic. Question? [student is asking a question] Sorry, say that again. [student is asking a question] Yeah, so the question is, why do these images have
the same L2 distance? And the answer is that, I
carefully constructed them to have the same L2 distance. [laughing] But it's just giving you the
sense that the L2 distance is not a very good measure
of similarity between images. And these images are
actually all different from each other in quite disparate ways. If you're using K and N, then the only thing you
have to measure distance between images, is this single distance metric. And this kind of gives
you an example where that distance metric is
actually not capturing the full description of
distance or difference between images. So, if this case, I just sort
of carefully constructed these translations and these
offsets to match exactly. Question? [student asking a question] So, the question is, maybe this is actually good, because all of these things are actually having the
same distance to the image. That's maybe true for this example, but I think you could also
construct examples where maybe we have two original images and then by putting the
boxes in the right places or tinting them, we could cause it to be
nearer to pretty much anything that you want, right? Because in this example, we
can kind of like do arbitrary shifting and tinting to kind of change these
distances nearly arbitrarily without changing the perceptional
nature of these images. So, I think that this
can actually screw you up if you have many
different original images. Question? [student is asking a question] The question is, whether or not it's
common in real-world cases to go back and retrain the entire dataset once you've found those
best hyperparameters? So, people do sometimes
do this in practice, but it's somewhat a matter of taste. If you're really rushing for that deadline and you've really got to
get this model out the door then, if it takes a long
time to retrain the model on the whole dataset, then maybe you won't do it. But if you have a little
bit more time to spare and a little bit more compute to spare, and you want to squeeze out
that maybe that extra 1% of performance, then that
is a trick you can use. So we kind of saw that
the k-nearest neighbor has a lot of the nice properties of machine learning algorithms, but in practice it's not so great, and really not used very much in images. So the next thing I'd
like to talk about is linear classification. And linear classification is,
again, quite a simple learning algorithm, but this will
become super important and help us build up to
whole neural networks and whole convolutional networks. So, one analogy people often talk about when working with neural networks is we think of them as being
kind of like Lego blocks. That you can have different
kinds of components of neural networks and you
can stick these components together to build these
large different towers of convolutional networks. One of the most basic
building blocks that we'll see in different types of
deep learning applications is this linear classifier. So, I think it's actually
really important to have a good understanding
of what's happening with linear classification. Because these will end up
generalizing quite nicely to whole neural networks. So another example of kind
of this modular nature of neural networks comes from some research in our
own lab on image captioning, just as a little bit of a preview. So here the setup is that
we want to input an image and then output a descriptive sentence describing the image. And the way this kind of works is that we have one convolutional
neural network that's looking at the image, and a recurrent neural network that knows about language. And we can kind of just stick
these two pieces together like Lego blocks and train
the whole thing together and end up with a pretty cool system that can do some non-trivial things. And we'll work through the
details of this model as we go forward in the class, but this just gives you the sense that, these deep neural networks
are kind of like Legos and this linear classifier is kind of like the most
basic building blocks of these giant networks. But that's a little bit too
exciting for lecture two, so we have to go back to
CIFAR-10 for the moment. [laughing] So, recall that CIFAR-10 has
these 50,000 training examples, each image is 32 by 32 pixels
and three color channels. In linear classification,
we're going to take a bit of a different approach
from k-nearest neighbor. So, the linear classifier is
one of the simplest examples of what we call a parametric model. So now, our parametric model
actually has two different components. It's going to take in this image,
maybe, of a cat on the left, and this, that we usually write
as X for our input data, and also a set of parameters, or weights, which is usually called
W, also sometimes theta, depending on the literature. And now we're going to
write down some function which takes in both the data,
X, and the parameters, W, and this'll spit out now
10 numbers describing what are the scores
corresponding to each of those 10 categories in CIFAR-10. With the interpretation that,
like the larger score for cat, indicates a larger probability
of that input X being cat. And now, a question? [student asking a question] Sorry, can you repeat that? [student asking a question] Oh, so the question is what is the three? The three, in this example,
corresponds to the three color channels, red, green, and blue. Because we typically work on color images, that's nice information that
you don't want to throw away. So, in the k-nearest neighbor setup there was no parameters, instead, we just kind of keep around
the whole training data, the whole training set, and use that at test time. But now, in a parametric approach, we're going to summarize our
knowledge of the training data and stick all that knowledge
into these parameters, W. And now, at test time, we
no longer need the actual training data, we can throw it away. We only need these
parameters, W, at test time. So this allows our models
to now be more efficient and actually run on maybe
small devices like phones. So, kind of, the whole
story in deep learning is coming up with the
right structure for this function, F. You can imagine writing down
different functional forms for how to combine weights
and data in different complex ways, and these
could correspond to different network architectures. But the simplest possible example of combining these two things is just, maybe, to multiply them. And this is a linear classifier. So here our F of X, W is
just equal to the W times X. Probably the simplest
equation you can imagine. So here, if you kind of unpack the
dimensions of these things, we recall that our image was
maybe 32 by 32 by 3 values. So then, we're going to take
those values and then stretch them out into a long column vector that has 3,072 by one entries. And now we want to end
up with 10 class scores. We want to end up with
10 numbers for this image giving us the scores for
each of the 10 categories. Which means that now our matrix, W, needs to be ten by 3072. So that once we multiply
these two things out then we'll end up with
a single column vector 10 by one, giving us our 10 class scores. Also sometimes, you'll typically see this, we'll often add a bias term which will be a constant
vector of 10 elements that does not interact
with the training data, and instead just gives us
some sort of data independent preferences for some classes over another. So you might imagine that
if you're dataset was unbalanced and had many
more cats than dogs, for example, then the bias
elements corresponding to cat would be higher
than the other ones. So if you kind of think about pictorially what this function is doing, in this figure we have
an example on the left of a simple image with
just a two by two image, so it has four pixels total. So the way that the
linear classifier works is that we take this two by two image, we stretch it out into a column vector with four elements, and now, in this example,
we are just restricting to three classes, cat, dog, and ship, because you can't fit 10 on a slide, and now our weight matrix is
going to be four by three, so we have four pixels and three classes. And now, again, we have a
three element bias vector that gives us data independent bias terms for each category. Now we see that the cat score
is going to be the enter product between the pixels of our image and this row in the weight matrix added together with this bias term. So, when you look at it this way you can kind of understand
linear classification as almost a template matching approach. Where each of the rows in this matrix correspond to some template of the image. And now the enter product or dot product between the row of the
matrix and the column giving the pixels of the image, computing this dot
product kind of gives us a similarity between this
template for the class and the pixels of our image. And then bias just,
again, gives you this data independence scaling offset
to each of the classes. If we think about linear classification from this viewpoint of template matching we can actually take the
rows of that weight matrix and unravel them back into images and actually visualize
those templates as images. And this gives us some
sense of what a linear classifier might actually be doing to try to understand our data. So, in this example, we've
gone ahead and trained a linear classifier on our images. And now on the bottom we're visualizing what are those rows in
that learned weight matrix corresponding to each of the 10 categories in CIFAR-10. And in this way we kind
of get a sense for what's going on in these images. So, for example, in the
left, on the bottom left, we see the template for the plane class, kind of consists of this like blue blob, this kind of blobby thing in the middle and maybe blue in the background, which gives you the sense
that this linear classifier for plane is maybe looking for blue stuff and blobby stuff, and those
features are going to cause the classifier to like planes more. Or if we look at this car example, we kind of see that
there's a red blobby thing through the middle and a
blue blobby thing at the top that maybe is kind of a blurry windshield. But this is a little bit weird, this doesn't really look like a car. No individual car
actually looks like this. So the problem is that
the linear classifier is only learning one
template for each class. So if there's sort of
variations in how that class might appear, it's trying to average out all
those different variations, all those different appearances, and use just one single template to recognize each of those categories. We can also see this pretty
explicitly in the horse classifier. So in the horse classifier we
see green stuff on the bottom because horses are usually on grass. And then, if you look
carefully, the horse actually seems to have maybe two
heads, one head on each side. And I've never seen a
horse with two heads. But the linear classifier
is just doing the best that it can, because it's
only allowed to learn one template per category. And as we move forward
into neural networks and more complex models, we'll be able to achieve
much better accuracy because they no longer
have this restriction of just learning a single
template per category. Another viewpoint of the linear classifier is to go back to this idea of images as points and high dimensional space. And you can imagine
that each of our images is something like a point in
this high dimensional space. And now the linear classifier
is putting in these linear decision boundaries
to try to draw linear separation between one category and the rest of the categories. So maybe up on the upper-left hand side we see these training
examples of airplanes and throughout the process of training the linear classier will
go and try to draw this blue line to separate
out with a single line the airplane class from all
the rest of the classes. And it's actually kind of
fun if you watch during the training process these
lines will start out randomly and then go and snap into
place to try to separate the data properly. But when you think about
linear classification in this way, from this high
dimensional point of view, you can start to see again
what are some of the problems that might come up with
linear classification. And it's not too hard
to construct examples of datasets where a linear
classifier will totally fail. So, one example, on the left here, is that, suppose we have a
dataset of two categories, and these are all maybe
somewhat artificial, but maybe our dataset has two categories, blue and red. And the blue categories
are the number of pixels in the image, which are
greater than zero, is odd. And anything where the
number of pixels greater than zero is even, we want to
classify as the red category. So if you actually go and
draw what these different decisions regions look like in the plane, you can see that our blue class
with an odd number of pixels is going to be these two
quadrants in the plane, and even will be the
opposite two quadrants. So now, there's no way that we
can draw a single linear line to separate the blue from the red. So this would be an example
where a linear classifier would really struggle. And this is maybe not such an
artificial thing after all. Instead of counting pixels, maybe we're actually trying
to count whether the number of animals or people in
an image is odd or even. So this kind of a parity problem of separating odds from evens is something that linear classification really struggles with traditionally. Other situations where a linear
classifier really struggles are multimodal situations. So here on the right, maybe our blue category has
these three different islands of where the blue category lives, and then everything else
is some other category. So, for something like horses, we saw on the previous example, is something where this
actually might be happening in practice. Where there's maybe one
island in the pixel space of horses looking to the left, and another island of
horses looking to the right. And now there's no good
way to draw a single linear boundary between these two
isolated islands of data. So anytime where you have multimodal data, like one class that can appear in
different regions of space, is another place where linear
classifiers might struggle. So there's kind of a lot of problems with linear classifiers, but it
is a super simple algorithm, super nice and easy to interpret
and easy to understand. So you'll actually be
implementing these things on your first homework assignment. At this point, we kind of talked about what is the functional
form corresponding to a linear classifier. And we've seen that this functional form of matrix vector multiply corresponds this idea of template matching and learning a single
template for each category in your data. And then once we have this trained matrix you can use it to actually
go and get your scores for any new training example. But what we have not told you is how do you actually go
about choosing the right W for your dataset. We've just talked about
what is the functional form and what is going on with this thing. So that's something we'll
really focus on next time. And next lecture we'll talk about what are the strategies and algorithms for choosing the right W. And this will lead us to questions of loss functions and optimization and eventually ConvNets. So, that's a bit of the
preview for next week. And that's all we have for today.