Today I want to talk about Machine Learning:
A very cool topic that's getting used more and more these days. Part of the reason that machine learning seems
weird is that we have this idea of machines as being this sort of rigid and unadaptable
kind of thing, just sort of blindly doing whatever they do. But the whole idea of learning is that you
take in some data, some observations about the world and try to find patterns in them
that you can exploit to do better next time. So a machine, for a machine to learn means
it has to be able to, number one, change its behavior, and, number two, observe and find
out things about the world: Sort of the opposite of blindly doing the same thing. There's a lot of different ways that we can
sort of look at learning problems; here's four of them, and you already know all of
them. Supervised learning is, you get some input,
and then a teacher, a supervisor, tells you what they are. Like, you know, "This is a fleeb." And "This is a gloob." OK, now, so here's a quiz: What is this? It's a fleeb: Supervised learning. Unsupervised learning, there's no teacher. You're just sort of seeing stuff and saying,
"Oh, this thing and this thing, oh these are pretty similar, I'm going to put them together. Oh, this other thing, that's very different,
I'm going to put it in a separate category." So, unsupervised learning, you just observe
some data and sort it into groups based on your sense of similarity amongst the things
that you see. In reinforcement learning, you're actually
out doing things, not just looking, but acting in the world: "I wonder what happens if I
do this? Oh! Got a shock!" Negative reinforcement. Or, "Ooh, tasty!" Positive reinforcement. And then you're trying to learn to do things
that give positive reinforcement. Temporal; temporal reinforcement learning,
it's the same as reinforcement learning except there's sort of like a big time delay. It's like "Oh, I shouldn't have eaten that
mushroom yesterday!" Games:Chess, checkers, whatever, are typically
examples of temporal reinforcement learning tasks. You have to take a whole bunch of moves; you
win or lose and it's not clear how the combination of moves were necessarily responsible for
that. Given these problem formulations, there's
lots of different learning algorithms: Specific things to have the machine do to change its
behavior in response to its observations, and try to do better. Nearest neighbor and k-means clustering, those
are unsupervised learning algorithms, the other ones are mostly supervised. This is a current area of research; there's
lots of different algorithms here; lots of different ones that could be put in. Which is good, so, there's a lot of activity
happening -- but it also kind of suggests that, you know, this is definitely not a solved
problem. And, you know, there's a sense in which the
whole idea of learning is not really a solvable problem. It depends on making assumptions about the
world: That some pattern that used to hold, will continue to hold. OK? Alright. So, regardless of what learning algorithm
-- or, which problem formulation, really -- you adopt, there's some things that come up over
and over again, and I want to talk about two of them here, and try to do demos. The first one is this idea of exploration
versus exploitation. If you need to explore out in the world to
find out what your options are -- to find out what's good and what's bad -- then a question
comes up: "OK, this thing seems to be pretty good; should I just cash out and keep doing
this thing? Or should I try doing some other stuff in
hopes of finding something better?" Which may cause me to give up payoff, reward,
good things, that I could have had by doing what I already knew. Trying new stuff is exploration, cashing out
the best thing that you know so far: Exploitation. There's an inherent tradeoff between them. Alright. Let's try an example. This is the Chili Monster. It's a alien robot Chili Monster. Maybe it's a zombie too. And the way it works is: You poke this guy
in the eye and he gives you a cookie. So, there: Plus one; that's a good cookie;
that's a payoff. And so the question is: Well, so should we
punch him in the red eye again, or should we go for green? I'm going to go for green. Green, OK, that was bad: Minus one. So, here we are. We're now the learning creature. We've got a good one from red, a bad one from
green, figures we'll do the red one again. Great, so two and one. Do this guy maybe a little more. Now we're two and one either way. OK? So what choice do we want to go with next? Who knows? Pick a red one. OK, three and one. Alright, that's worse. So, here we are with a learning problem: Do
we want to just commit to red, and do that all the time? Or do we want to try green some more and see
if it catches up? Now, as humans we have a real tendency to
try to make up immediate explanations for everything. Like, alright, we'll try red again. OK, good, so red is just better. Try green, well, green was better too. Maybe, you know, we're just being extra lucky
today, and we should alternate back and forth.. well, guess not. So one way that we can avoid sort of trying
to say: "Well, the reason that red is better is because I'm wearing my lucky red socks."
is, instead of making up explanations immediately, we should lay back and collect data. We need to keep count of how many things went
well and how many things went badly. One of the mistakes we can make as people
is to only count one side: Optimists count the good side and pessimists look at the bad
side. So we've got a little collector cup, here,
that we can use. And the way it works, we just sort of put
it here and it vacuums up whatever it is. So, so far, on the red side we've got four
good ones and two bad ones. We got a guy for the green side. He's paying off at three and two; they're
both pretty good. OK? So, you know, which one should we pick? Well, we've got this guy; we've got a learning
creature that is connected to our cups, so he's aware of the data he gets here. And he doesn't have a mouse so he throws water
balloons. So he threw it at green, and got a, got a
negative payoff. Right. So if we go again, well, got a negative payoff
on that side. So, the learning algorithm for this guy has
to decide what to do. Which way is he going to go? Let's just let him run for a while. Red, got a payoff. So, he's focusing on red. Why is he focusing on red? Well, red has won sixty-something percent
of the time. Green, well now, green is up there too as
well. He's going back and forth between the two. So, how does this guy work, inside? And here, it's pretty cool actually. It's very simple, and it's an algorithm that
is being used out in the world -- we'll talk about that in a sec. Alright. So here, this graph, I'm going to explain
it in complete detail because it's a little bit involved. It's called a "beta distribution". And the way it works is.. So, how much is that red guy worth? We've tried him thirty-one times, we got sixteen
wins and fifteen losses, how good is that? Well, we know the average, you know if you
divide it out, fifty-one percent, whatever it is. But is that really what it's worth? Could it actually be sort of paying off fifty-fifty,
and we haven't see it yet? Could it even be paying off fifty-five percent
and it's just we're getting particularly low? Well, this thing called a beta distribution
-- it comes out of statistics -- you can feed in the two numbers: The number of wins and
the number of losses. And I'm fluffing over some details here. And it gives you back this curve. And the curve doesn't say what the worth of
the choice is, but it says the distribution of it. So, right now, we've got this green curve
corresponds to the fact that we got seventeen wins and thirteen losses, which is actually
a little better than the red curve at sixteen, plus sixteen and minus fifteen. But both the curves are very broad, meaning
we're not really sure how much this thing is worth at all. I mean, the green guy, he has some height
all the way out at point-seven, and all the way down to maybe point-three, or something
like that. So, the white guy's a little demo rod, and
he's completely flat. So if we've never tried him at all, it's completely
flat. But as we try it, and, start to get wins,
the curve gets sharper and sharper toward the wins. So if we had eleven wins and zero losses,
we'd start to think "Wow, it's pretty likely that this thing pays off all the time!" But then the instant we get a loss, well we
know it can't be paying off all the time, and so on. So the beta distribution, you feed in the
wins and losses, and it gives you this estimate of how much.. the distribution of possible
values of the choice. And, so now, the nice thing about this, is
that it's better than just saying the average number of times it won. Because, you know, if we had three wins and
three losses, that's fifty-fifty. But if we had three hundred wins and three
hundred losses, that's also fifty-fifty, but we know more about it when we have three hundred
wins and losses than when we had three. So you see, as we try to, like, go from zero-zero,
it's completely flat; as we increase the number of wins and losses, if I can go straight here,
it stays at, sort of, fifty-fifty, but the distribution gets narrower and narrower, 'cause
we got more and more data saying "It's really likely that it's fifty-fifty." So, what this guy is doing, is he's got those
two curves -- let's get rid of the white guy -- he's got the two curves for the data he's
seen, and every time he makes a choice, the one curve that gets more data gets updated. And what he does is he throws a random number
weighted by the green curve, and then throws a random number weighted by the red curve,
and whichever one comes out higher, he picks that. OK, so now at the moment the red and green
curves are sort of overlapping quite a bit, which means its going to pick one, pick the
other, and so on. But as we more data for these things, the
distributions start to get more narrow, and, if they don't overlap.. if they overlap less,
then it's going to pick one over the other. Let's let this go a little faster. All right. Something like that. OK. So, green is paying off fifty-four percent,
something like that; red's paying off fifty percent.. I can tell you now that I programmed this
so that one of red and green pays off forty-eight percent of the time; the other one pays off
fifty-two percent of the time. But I don't even know which is which; the
program scrambles them randomly when it starts. So, I mean, if I had.. no, I don't know.. I was going to say it was probably green,
but now they're very similar, and, in fact, they're both paying off more than their actual
payoff, just 'cause of random variations -- just 'cause of luck. Actually speed this up a little more; I just
figured this out the other day: You can get two of them going at once.. See if I can get it going.. There we go. OK. So now he's throwing the next one before he
even got the cookie from the previous one -- which means he's making his decisions about
what to do next based on the throw from two ago, but, you know, we're trying to get lots
of data, so who cares? And now it's looking like the green is significantly
better, so the learning algorithm is choosing to exploit, to dump it into the green guy. Still trying the red one every so often, because
it might be wrong. Still exploring a little bit, but shifting
more and more towards exploitation. This number in the middle here is its total
payoff over all choices red and green put together. So, it's up thirty-three, thirty-one, whatever
it is. This kind of problem, where you've got to
do something out in the world, and you have two goals. One is to get knowledge about what's happening
in the work; that's why you explore. And the other one is to actually make something
good for yourself; that's the exploit. It's a fundamental problem of learning algorithms. And this kind of algorithm, this kind of learning
problem -- this is really not called the "Alien Robot Chili Monster Problem"; it's more often
called the "Two-armed Bandit Problem" or the "Multi-armed Bandit Problem", where each action
you got is like a different slot machine; you're trying to figure out which is the best
slot machine -- if they're any different. And this problem comes up all over the internet. It comes up with advertising. Now, the actions are what ads should you put
on a web page, say. And the Chili Monster is..you. It's trying to get you to click on one of
those ads -- that's a plus one. If you don't click on an ad, that's minus
one for all of them. So, the system behind the ad system -- or,
like in the video website, when you're watching videos, at the end of the video you get this
little grid of other ones to watch? Same thing. Those choice of which ones to put up there
are based on this type of algorithm. It's trying to make the ones that, on the
one hand, you'll pick and you'll watch, but on the other hand it's also trying to get
knowledge about is this particular video one that people like. Alright. So.. Still not completely clear.. can let this
go a little bit more and then come back when we do the next demo. Alright. So exploration versus exploitation. Even when we're in other problem.. other versions
of learning, that's still behind the scenes. I mean, in supervised learning, the teacher
is just telling you do this, do this, do this. It's not so obvious how there's the explore/exploit
tradeoff, but once you learn something you're still going to have go out and do it in the
world and then that comes back again. OK. So, you know, if the video site is showing
me a set of videos that I might want to watch.. it's like I'm going to make one choice and
then I'm gone. It's not like I'm the Chili Monster sitting
there getting treated over and over and over again. So in fact, really, I'm just a teeny little
piece of the Chili Monster, and all the other people who are looking at videos and getting
presented with choice screens, and so forth, are all more parts of the Chili Monster. Which raises the question: You know, I don't
like the same videos that somebody else does, and if they show me something that someone
else likes, it's not necessarily going to work that well for me. I mean, they might be able to pick the videos
that everybody likes. You know, Gangnam Azalea, whatever it is. But you could do better if you could recognize
that I'm similarly to other, you know, geeky, computer-y, whatever I am. And somebody else is similar to that too. So what we can try to do is, we want to do
a categorization task, and based on some knowledge that we have, we want to make a distinction
between one versus the other. And, this ultimately leads into this question
of knowledge representation: How do we divide up the things that we see in the world into
categories that we can make predictions about them? Things in this category are going to tend
to like these videos; things in this category tend to like those videos. OK? So, let's do an example of that. Well, let's see how the learning guy is doing. They're still both extremely close together,
and, as a result -- and this is good -- as a result it's still trying both of them. It looks like red's got a little bit of the
edge here. Fifty-two percent, and green is falling away
a little bit, although.. Did we lose our, did we lose our double.. We lost our double clocking there, so we weren't
going as well. Well, I guess we're not going to know this
story for sure, exactly which one ends up better. Maybe if there's time we'll come back to it
at the end; dunno. But we're going to have to stop him now. And, alright. Let's look at this. One of the earliest learning algorithms for
making distinctions -- supervised learning now. We get input; teacher's going to tell us this
is this, this is that, and we have to learn it. In this case, we're going to learn pictures. OK? I went out and found on the web a bunch of
news photographs that were used for face recognition research, and took a dozen pictures of former
President Bush, and I looked in my camera and I got a dozen pictures of my cats, so
we're going to try to learn between former president and cat. Alright? And the way it works is this: Up here we've
got our input. This is two-hundred and fifty by two-hundred
and fifty individual pixels, that make up a black-and white picture. So they can be zero which is black. They can be one which is white. Anywhere in between which is gray. Down here at the bottom we have another whole
set of two hundred fifty by two hundred fifty -- sixty-two thousand something -- weights. Numbers, that represent what we've learned. And if we start this thing up.. Alright, so first.. the first thing we do
is clear out the weights. So, this level of gray is a weight value of
zero. And if the weight value goes negative, it's
going to be darker than that; if the weight goes positive it's going to be brighter than
that. So the way this machine works is we put in
a picture. Alright; there's a picture: Former President
Bush. And in order to evaluate this we take each
pixel, in turn, two hundred fifty by two hundred fifty, multiply it by the corresponding weight,
and add them all up. And if it's greater than zero, we say the
category is 'President', and if it's less than or equal to zero we say the category
is 'cat'. OK. Now at the moment, all the weights are zero,
so no matter what picture it is, we're going to get times zero plus something times zero
plus something times zero -- the whole result is going to be zero. So the first prediction is.. the sum is less
than or equal to zero: In this case it's equal to zero.. so what the perceptron says is:Cat. And the supervisor says:Wrong. OK. Now the perceptron learning algorithm, what
it does, is say, OK, you're saying this thing should be greater than zero. It should have a sum greater than zero. So it goes through each of its weights, and
all the weights that are sort of big, it moves them make -- all the inputs that are high,
it makes the weight more positive; all the inputs that are low, it makes them more negative. As a result, the whole sum goes up. And, I didn't even really expect this when
I first implemented this, but, alright.. Now we're going to modify our brains to make
this thing look.. produce a more positive sum. And actually it makes kind of a shadow of
the input that we're in, because the brighter spots get more positive weights, the darker
spots get more negative weights. Alright. Now here's a cat. But since we just changed the weights to make
everything positive it comes out sum greater than zero, so, again it makes the wrong guess. Now we do the same learning, except now we
want to make the sum smaller -- less than or equal to zero -- so it's like we memorize
a negative of the picture, in order to cancel it out. Like that. So now we've got sort of half of one picture
in positive; half the other picture in negative. And so on. So we can go through.. We got a dozen of one category, a dozen of
the other category, and we're just going through them in a random order. Whenever we get the answer right -- OK, cat
-- we don't change the weights. We don't have to learn when we didn't make
a mistake. But when we do make a mistake, then we update
the weights again. OK? We go all the way through all twenty-four
cases, we see how well we did. If we got them all right we're done, otherwise
we shuffle them again and go back to the beginning. So let's speed this up. Alright. So, it's getting some right, getting some
wrong. Alright, so now we're at the end of the pass,
and we got half right. Same thing we would have gotten by flipping
a coin, or always saying 'Cat', but let's see what happens in the second pass. Alright. Yes. Well, no, still making some mistakes. Nope. Yes. Yes. Hey, yeah. Uh-uh. Yes. Yes. Starting to learn. OK? So if we let this go a few passes -- and because
the actual pattern of the weights depends on the order of presentation -- because we
only learn when we make a mistake. And if we just got.. went in one direction
so that it gets several more of them right, then we don't do any learning -- it can take
a different number of passes before we actually categorize all the inputs correctly. And there's -- oh, we got it. OK? So, it that cool? We learned all these things. Now, one of the things that people got excited
about the perceptron, way back when in the '50s when it was first studied, is that there
was a proof that if there was any way for it to learn something, it would learn it,
so in this case it learned it no problem. But we would like more then having it just
memorize a bunch of pictures that we showed it. We want it to learn the concept of former
President versus cat. So that we could feed in new pictures.. Maybe pictures that the supervisor hadn't
even categorized, and get useful answers out of this thing. So that's called how well the algorithm generalizes. So not just learning; not just getting it
right -- you know, "teaching to the test" -- but then being able to apply knowledge
to other cases. So I've got a couple other inputs here, that
the perceptron never saw. So, a generalization test. So, here's a different picture of the cats,
that the perceptron never saw. What do you think? Is it going to get it right? No. Got it wrong. Another picture of a cat? Got it right. FIfty-fifty. Another picture of the former President; never
saw? No, sorry. Not so good. OK? The perceptron, as we've used it here, is
not very good a generalization. Because really, if we look at these weights,
you know, it's learning extremely specific little details about the pictures that we
happened to show it, and there's really no reason to expect that it would actually apply
to other pictures. I mean, even if we took one of the same pictures
we trained it on, and just sort of moved it a little bit, it would line up with different
weights and get probably a different answer. So, the knowledge representation problem is:
How do you actually build algorithms that, like the perceptron, they're making categorizations,
but they make better characterizations, better categorizations, that are more like the way
we do it. And one thing we do; one of the algorithms
that was mentioned at the beginning -- the deep belief networks -- is, it's like got
a whole bunch of perceptrons, but instead of going straight from the input to the output,
it's got layers of them. Trying to get more and more general, abstract
things. And, a lot of this stuff is working pretty
well now. As we go forward, one of the things that is
driving machine learning getting used now is the combination of two things: Computers
are much more powerful than they used to be, number one, and because everything is connected
to the internet, in particular, there's tons and tons of data to learn from. There's lots of pictures; there's people typing
stuff; people speaking, and doing speech recognition; there's tons of data, and just like if we
had many many more pictures that we were trying to make a distinction between, we'd have a
better chance that our learning algorithm would sort out what's really important and
what isn't. Assuming our knowledge representation, and
the way we built the learning algorithm, is even capable of representing the concepts
that we want it to learn. In the general case, the explore/exploit tradeoff
does not go away. The theoretical framework it gets studied
in is called 'minimizing regret'. In the learning algorithm world there's really
-- the idea of 'no regrets' isn't possible. 'Regret' means -- if you could predict everything
perfectly, you'd get some amount of payoff, and whatever you get less than that because
you have to explore is your regret. And finally, knowledge representation goes
to the core of what learning needs to produce in order to be successful -- and also, how
we even see things, how we understand the world, as a result of our experiences. And, you know, this perceptron, this stupid
little perceptron: One thing that it's really bad at is recognizing what it doesn't know. It's totally happy to make a judgment about
any possible input you give it. So for example, if we give it something that's
not a cat or a President, like: This is my dad, at Sky City years ago. What do we say that is? It says my dad's a cat. And here's a picture of a dentist light that
looks like a robot, so I took a picture of it: There you go.
Wow, that was great. Easily the best introduction to the multi-armed bandit (exploration vs. exploitation) problem I've seen, thanks.