NMCS4ALL: Machine Learning (full version)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Wow, that was great. Easily the best introduction to the multi-armed bandit (exploration vs. exploitation) problem I've seen, thanks.

👍︎︎ 7 👤︎︎ u/e_j_white 📅︎︎ Feb 29 2020 đź—«︎ replies
Captions
Today I want to talk about Machine Learning: A very cool topic that's getting used more and more these days. Part of the reason that machine learning seems weird is that we have this idea of machines as being this sort of rigid and unadaptable kind of thing, just sort of blindly doing whatever they do. But the whole idea of learning is that you take in some data, some observations about the world and try to find patterns in them that you can exploit to do better next time. So a machine, for a machine to learn means it has to be able to, number one, change its behavior, and, number two, observe and find out things about the world: Sort of the opposite of blindly doing the same thing. There's a lot of different ways that we can sort of look at learning problems; here's four of them, and you already know all of them. Supervised learning is, you get some input, and then a teacher, a supervisor, tells you what they are. Like, you know, "This is a fleeb." And "This is a gloob." OK, now, so here's a quiz: What is this? It's a fleeb: Supervised learning. Unsupervised learning, there's no teacher. You're just sort of seeing stuff and saying, "Oh, this thing and this thing, oh these are pretty similar, I'm going to put them together. Oh, this other thing, that's very different, I'm going to put it in a separate category." So, unsupervised learning, you just observe some data and sort it into groups based on your sense of similarity amongst the things that you see. In reinforcement learning, you're actually out doing things, not just looking, but acting in the world: "I wonder what happens if I do this? Oh! Got a shock!" Negative reinforcement. Or, "Ooh, tasty!" Positive reinforcement. And then you're trying to learn to do things that give positive reinforcement. Temporal; temporal reinforcement learning, it's the same as reinforcement learning except there's sort of like a big time delay. It's like "Oh, I shouldn't have eaten that mushroom yesterday!" Games:Chess, checkers, whatever, are typically examples of temporal reinforcement learning tasks. You have to take a whole bunch of moves; you win or lose and it's not clear how the combination of moves were necessarily responsible for that. Given these problem formulations, there's lots of different learning algorithms: Specific things to have the machine do to change its behavior in response to its observations, and try to do better. Nearest neighbor and k-means clustering, those are unsupervised learning algorithms, the other ones are mostly supervised. This is a current area of research; there's lots of different algorithms here; lots of different ones that could be put in. Which is good, so, there's a lot of activity happening -- but it also kind of suggests that, you know, this is definitely not a solved problem. And, you know, there's a sense in which the whole idea of learning is not really a solvable problem. It depends on making assumptions about the world: That some pattern that used to hold, will continue to hold. OK? Alright. So, regardless of what learning algorithm -- or, which problem formulation, really -- you adopt, there's some things that come up over and over again, and I want to talk about two of them here, and try to do demos. The first one is this idea of exploration versus exploitation. If you need to explore out in the world to find out what your options are -- to find out what's good and what's bad -- then a question comes up: "OK, this thing seems to be pretty good; should I just cash out and keep doing this thing? Or should I try doing some other stuff in hopes of finding something better?" Which may cause me to give up payoff, reward, good things, that I could have had by doing what I already knew. Trying new stuff is exploration, cashing out the best thing that you know so far: Exploitation. There's an inherent tradeoff between them. Alright. Let's try an example. This is the Chili Monster. It's a alien robot Chili Monster. Maybe it's a zombie too. And the way it works is: You poke this guy in the eye and he gives you a cookie. So, there: Plus one; that's a good cookie; that's a payoff. And so the question is: Well, so should we punch him in the red eye again, or should we go for green? I'm going to go for green. Green, OK, that was bad: Minus one. So, here we are. We're now the learning creature. We've got a good one from red, a bad one from green, figures we'll do the red one again. Great, so two and one. Do this guy maybe a little more. Now we're two and one either way. OK? So what choice do we want to go with next? Who knows? Pick a red one. OK, three and one. Alright, that's worse. So, here we are with a learning problem: Do we want to just commit to red, and do that all the time? Or do we want to try green some more and see if it catches up? Now, as humans we have a real tendency to try to make up immediate explanations for everything. Like, alright, we'll try red again. OK, good, so red is just better. Try green, well, green was better too. Maybe, you know, we're just being extra lucky today, and we should alternate back and forth.. well, guess not. So one way that we can avoid sort of trying to say: "Well, the reason that red is better is because I'm wearing my lucky red socks." is, instead of making up explanations immediately, we should lay back and collect data. We need to keep count of how many things went well and how many things went badly. One of the mistakes we can make as people is to only count one side: Optimists count the good side and pessimists look at the bad side. So we've got a little collector cup, here, that we can use. And the way it works, we just sort of put it here and it vacuums up whatever it is. So, so far, on the red side we've got four good ones and two bad ones. We got a guy for the green side. He's paying off at three and two; they're both pretty good. OK? So, you know, which one should we pick? Well, we've got this guy; we've got a learning creature that is connected to our cups, so he's aware of the data he gets here. And he doesn't have a mouse so he throws water balloons. So he threw it at green, and got a, got a negative payoff. Right. So if we go again, well, got a negative payoff on that side. So, the learning algorithm for this guy has to decide what to do. Which way is he going to go? Let's just let him run for a while. Red, got a payoff. So, he's focusing on red. Why is he focusing on red? Well, red has won sixty-something percent of the time. Green, well now, green is up there too as well. He's going back and forth between the two. So, how does this guy work, inside? And here, it's pretty cool actually. It's very simple, and it's an algorithm that is being used out in the world -- we'll talk about that in a sec. Alright. So here, this graph, I'm going to explain it in complete detail because it's a little bit involved. It's called a "beta distribution". And the way it works is.. So, how much is that red guy worth? We've tried him thirty-one times, we got sixteen wins and fifteen losses, how good is that? Well, we know the average, you know if you divide it out, fifty-one percent, whatever it is. But is that really what it's worth? Could it actually be sort of paying off fifty-fifty, and we haven't see it yet? Could it even be paying off fifty-five percent and it's just we're getting particularly low? Well, this thing called a beta distribution -- it comes out of statistics -- you can feed in the two numbers: The number of wins and the number of losses. And I'm fluffing over some details here. And it gives you back this curve. And the curve doesn't say what the worth of the choice is, but it says the distribution of it. So, right now, we've got this green curve corresponds to the fact that we got seventeen wins and thirteen losses, which is actually a little better than the red curve at sixteen, plus sixteen and minus fifteen. But both the curves are very broad, meaning we're not really sure how much this thing is worth at all. I mean, the green guy, he has some height all the way out at point-seven, and all the way down to maybe point-three, or something like that. So, the white guy's a little demo rod, and he's completely flat. So if we've never tried him at all, it's completely flat. But as we try it, and, start to get wins, the curve gets sharper and sharper toward the wins. So if we had eleven wins and zero losses, we'd start to think "Wow, it's pretty likely that this thing pays off all the time!" But then the instant we get a loss, well we know it can't be paying off all the time, and so on. So the beta distribution, you feed in the wins and losses, and it gives you this estimate of how much.. the distribution of possible values of the choice. And, so now, the nice thing about this, is that it's better than just saying the average number of times it won. Because, you know, if we had three wins and three losses, that's fifty-fifty. But if we had three hundred wins and three hundred losses, that's also fifty-fifty, but we know more about it when we have three hundred wins and losses than when we had three. So you see, as we try to, like, go from zero-zero, it's completely flat; as we increase the number of wins and losses, if I can go straight here, it stays at, sort of, fifty-fifty, but the distribution gets narrower and narrower, 'cause we got more and more data saying "It's really likely that it's fifty-fifty." So, what this guy is doing, is he's got those two curves -- let's get rid of the white guy -- he's got the two curves for the data he's seen, and every time he makes a choice, the one curve that gets more data gets updated. And what he does is he throws a random number weighted by the green curve, and then throws a random number weighted by the red curve, and whichever one comes out higher, he picks that. OK, so now at the moment the red and green curves are sort of overlapping quite a bit, which means its going to pick one, pick the other, and so on. But as we more data for these things, the distributions start to get more narrow, and, if they don't overlap.. if they overlap less, then it's going to pick one over the other. Let's let this go a little faster. All right. Something like that. OK. So, green is paying off fifty-four percent, something like that; red's paying off fifty percent.. I can tell you now that I programmed this so that one of red and green pays off forty-eight percent of the time; the other one pays off fifty-two percent of the time. But I don't even know which is which; the program scrambles them randomly when it starts. So, I mean, if I had.. no, I don't know.. I was going to say it was probably green, but now they're very similar, and, in fact, they're both paying off more than their actual payoff, just 'cause of random variations -- just 'cause of luck. Actually speed this up a little more; I just figured this out the other day: You can get two of them going at once.. See if I can get it going.. There we go. OK. So now he's throwing the next one before he even got the cookie from the previous one -- which means he's making his decisions about what to do next based on the throw from two ago, but, you know, we're trying to get lots of data, so who cares? And now it's looking like the green is significantly better, so the learning algorithm is choosing to exploit, to dump it into the green guy. Still trying the red one every so often, because it might be wrong. Still exploring a little bit, but shifting more and more towards exploitation. This number in the middle here is its total payoff over all choices red and green put together. So, it's up thirty-three, thirty-one, whatever it is. This kind of problem, where you've got to do something out in the world, and you have two goals. One is to get knowledge about what's happening in the work; that's why you explore. And the other one is to actually make something good for yourself; that's the exploit. It's a fundamental problem of learning algorithms. And this kind of algorithm, this kind of learning problem -- this is really not called the "Alien Robot Chili Monster Problem"; it's more often called the "Two-armed Bandit Problem" or the "Multi-armed Bandit Problem", where each action you got is like a different slot machine; you're trying to figure out which is the best slot machine -- if they're any different. And this problem comes up all over the internet. It comes up with advertising. Now, the actions are what ads should you put on a web page, say. And the Chili Monster is..you. It's trying to get you to click on one of those ads -- that's a plus one. If you don't click on an ad, that's minus one for all of them. So, the system behind the ad system -- or, like in the video website, when you're watching videos, at the end of the video you get this little grid of other ones to watch? Same thing. Those choice of which ones to put up there are based on this type of algorithm. It's trying to make the ones that, on the one hand, you'll pick and you'll watch, but on the other hand it's also trying to get knowledge about is this particular video one that people like. Alright. So.. Still not completely clear.. can let this go a little bit more and then come back when we do the next demo. Alright. So exploration versus exploitation. Even when we're in other problem.. other versions of learning, that's still behind the scenes. I mean, in supervised learning, the teacher is just telling you do this, do this, do this. It's not so obvious how there's the explore/exploit tradeoff, but once you learn something you're still going to have go out and do it in the world and then that comes back again. OK. So, you know, if the video site is showing me a set of videos that I might want to watch.. it's like I'm going to make one choice and then I'm gone. It's not like I'm the Chili Monster sitting there getting treated over and over and over again. So in fact, really, I'm just a teeny little piece of the Chili Monster, and all the other people who are looking at videos and getting presented with choice screens, and so forth, are all more parts of the Chili Monster. Which raises the question: You know, I don't like the same videos that somebody else does, and if they show me something that someone else likes, it's not necessarily going to work that well for me. I mean, they might be able to pick the videos that everybody likes. You know, Gangnam Azalea, whatever it is. But you could do better if you could recognize that I'm similarly to other, you know, geeky, computer-y, whatever I am. And somebody else is similar to that too. So what we can try to do is, we want to do a categorization task, and based on some knowledge that we have, we want to make a distinction between one versus the other. And, this ultimately leads into this question of knowledge representation: How do we divide up the things that we see in the world into categories that we can make predictions about them? Things in this category are going to tend to like these videos; things in this category tend to like those videos. OK? So, let's do an example of that. Well, let's see how the learning guy is doing. They're still both extremely close together, and, as a result -- and this is good -- as a result it's still trying both of them. It looks like red's got a little bit of the edge here. Fifty-two percent, and green is falling away a little bit, although.. Did we lose our, did we lose our double.. We lost our double clocking there, so we weren't going as well. Well, I guess we're not going to know this story for sure, exactly which one ends up better. Maybe if there's time we'll come back to it at the end; dunno. But we're going to have to stop him now. And, alright. Let's look at this. One of the earliest learning algorithms for making distinctions -- supervised learning now. We get input; teacher's going to tell us this is this, this is that, and we have to learn it. In this case, we're going to learn pictures. OK? I went out and found on the web a bunch of news photographs that were used for face recognition research, and took a dozen pictures of former President Bush, and I looked in my camera and I got a dozen pictures of my cats, so we're going to try to learn between former president and cat. Alright? And the way it works is this: Up here we've got our input. This is two-hundred and fifty by two-hundred and fifty individual pixels, that make up a black-and white picture. So they can be zero which is black. They can be one which is white. Anywhere in between which is gray. Down here at the bottom we have another whole set of two hundred fifty by two hundred fifty -- sixty-two thousand something -- weights. Numbers, that represent what we've learned. And if we start this thing up.. Alright, so first.. the first thing we do is clear out the weights. So, this level of gray is a weight value of zero. And if the weight value goes negative, it's going to be darker than that; if the weight goes positive it's going to be brighter than that. So the way this machine works is we put in a picture. Alright; there's a picture: Former President Bush. And in order to evaluate this we take each pixel, in turn, two hundred fifty by two hundred fifty, multiply it by the corresponding weight, and add them all up. And if it's greater than zero, we say the category is 'President', and if it's less than or equal to zero we say the category is 'cat'. OK. Now at the moment, all the weights are zero, so no matter what picture it is, we're going to get times zero plus something times zero plus something times zero -- the whole result is going to be zero. So the first prediction is.. the sum is less than or equal to zero: In this case it's equal to zero.. so what the perceptron says is:Cat. And the supervisor says:Wrong. OK. Now the perceptron learning algorithm, what it does, is say, OK, you're saying this thing should be greater than zero. It should have a sum greater than zero. So it goes through each of its weights, and all the weights that are sort of big, it moves them make -- all the inputs that are high, it makes the weight more positive; all the inputs that are low, it makes them more negative. As a result, the whole sum goes up. And, I didn't even really expect this when I first implemented this, but, alright.. Now we're going to modify our brains to make this thing look.. produce a more positive sum. And actually it makes kind of a shadow of the input that we're in, because the brighter spots get more positive weights, the darker spots get more negative weights. Alright. Now here's a cat. But since we just changed the weights to make everything positive it comes out sum greater than zero, so, again it makes the wrong guess. Now we do the same learning, except now we want to make the sum smaller -- less than or equal to zero -- so it's like we memorize a negative of the picture, in order to cancel it out. Like that. So now we've got sort of half of one picture in positive; half the other picture in negative. And so on. So we can go through.. We got a dozen of one category, a dozen of the other category, and we're just going through them in a random order. Whenever we get the answer right -- OK, cat -- we don't change the weights. We don't have to learn when we didn't make a mistake. But when we do make a mistake, then we update the weights again. OK? We go all the way through all twenty-four cases, we see how well we did. If we got them all right we're done, otherwise we shuffle them again and go back to the beginning. So let's speed this up. Alright. So, it's getting some right, getting some wrong. Alright, so now we're at the end of the pass, and we got half right. Same thing we would have gotten by flipping a coin, or always saying 'Cat', but let's see what happens in the second pass. Alright. Yes. Well, no, still making some mistakes. Nope. Yes. Yes. Hey, yeah. Uh-uh. Yes. Yes. Starting to learn. OK? So if we let this go a few passes -- and because the actual pattern of the weights depends on the order of presentation -- because we only learn when we make a mistake. And if we just got.. went in one direction so that it gets several more of them right, then we don't do any learning -- it can take a different number of passes before we actually categorize all the inputs correctly. And there's -- oh, we got it. OK? So, it that cool? We learned all these things. Now, one of the things that people got excited about the perceptron, way back when in the '50s when it was first studied, is that there was a proof that if there was any way for it to learn something, it would learn it, so in this case it learned it no problem. But we would like more then having it just memorize a bunch of pictures that we showed it. We want it to learn the concept of former President versus cat. So that we could feed in new pictures.. Maybe pictures that the supervisor hadn't even categorized, and get useful answers out of this thing. So that's called how well the algorithm generalizes. So not just learning; not just getting it right -- you know, "teaching to the test" -- but then being able to apply knowledge to other cases. So I've got a couple other inputs here, that the perceptron never saw. So, a generalization test. So, here's a different picture of the cats, that the perceptron never saw. What do you think? Is it going to get it right? No. Got it wrong. Another picture of a cat? Got it right. FIfty-fifty. Another picture of the former President; never saw? No, sorry. Not so good. OK? The perceptron, as we've used it here, is not very good a generalization. Because really, if we look at these weights, you know, it's learning extremely specific little details about the pictures that we happened to show it, and there's really no reason to expect that it would actually apply to other pictures. I mean, even if we took one of the same pictures we trained it on, and just sort of moved it a little bit, it would line up with different weights and get probably a different answer. So, the knowledge representation problem is: How do you actually build algorithms that, like the perceptron, they're making categorizations, but they make better characterizations, better categorizations, that are more like the way we do it. And one thing we do; one of the algorithms that was mentioned at the beginning -- the deep belief networks -- is, it's like got a whole bunch of perceptrons, but instead of going straight from the input to the output, it's got layers of them. Trying to get more and more general, abstract things. And, a lot of this stuff is working pretty well now. As we go forward, one of the things that is driving machine learning getting used now is the combination of two things: Computers are much more powerful than they used to be, number one, and because everything is connected to the internet, in particular, there's tons and tons of data to learn from. There's lots of pictures; there's people typing stuff; people speaking, and doing speech recognition; there's tons of data, and just like if we had many many more pictures that we were trying to make a distinction between, we'd have a better chance that our learning algorithm would sort out what's really important and what isn't. Assuming our knowledge representation, and the way we built the learning algorithm, is even capable of representing the concepts that we want it to learn. In the general case, the explore/exploit tradeoff does not go away. The theoretical framework it gets studied in is called 'minimizing regret'. In the learning algorithm world there's really -- the idea of 'no regrets' isn't possible. 'Regret' means -- if you could predict everything perfectly, you'd get some amount of payoff, and whatever you get less than that because you have to explore is your regret. And finally, knowledge representation goes to the core of what learning needs to produce in order to be successful -- and also, how we even see things, how we understand the world, as a result of our experiences. And, you know, this perceptron, this stupid little perceptron: One thing that it's really bad at is recognizing what it doesn't know. It's totally happy to make a judgment about any possible input you give it. So for example, if we give it something that's not a cat or a President, like: This is my dad, at Sky City years ago. What do we say that is? It says my dad's a cat. And here's a picture of a dentist light that looks like a robot, so I took a picture of it: There you go.
Info
Channel: Dave Ackley
Views: 10,507
Rating: 4.9759035 out of 5
Keywords: Machine Learning (Software Genre), Computer Science (Field Of Study), supervised learning, unsupervised learning, reinforcement learning, two-armed bandit, multi-armed bandit
Id: OQsn1c92pdY
Channel Id: undefined
Length: 26min 22sec (1582 seconds)
Published: Thu Aug 14 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.