A Fireside Chat with Turing Award Winner Geoffrey Hinton, Pioneer of Deep Learning (Google I/O'19)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] NICHOLAS THOMPSON: Hello, I'm Nicholas Thompson. I'm the editor in chief of "Wired." It is my honor today to get the chance to interview Geoffrey Hinton. They're a couple-- well, there are many things I love about him. But two that I'll just mention in the introduction. The first is that he persisted. He had an idea that he really believed in that everybody else said was bad. And he just kept at it. And it gives a lot of faith to everybody who has bad ideas, myself included. Then the second, as someone who spends half his life as a manager adjudicating job titles, I was looking at his job title before the introduction. And he has the most non pretentious job title in history. So please welcome Geoffrey Hinton, the engineering fellow at Google. [APPLAUSE] Welcome. GEOFFREY HINTON: Thank you. NICHOLAS THOMPSON: So nice to be here with you. All right, so let us start. 20 years ago when you write some of your early very influential papers, everybody starts to say, it's a smart idea, but we're not actually going to be able to design computers this way. Explain why you persisted, why you were so confident that you had found something important. GEOFFREY HINTON: So actually it was 40 years ago. And it seemed to me there's no other way the brain could work. It has to work by learning the strengths of connections. And if you want to make a device do something intelligent, you've got two options. You can program it, or it can learn. And we certainly weren't programmed. So we had to learn. So this had to be the right way to go. NICHOLAS THOMPSON: So explain, though-- well, let's do this. Explain what neural networks are. Most of the people here will be quite familiar. But explain the original insight and how it developed in your mind. GEOFFREY HINTON: So you have relatively simple processing elements that are very loosely models of neurons. They have connections coming in. Each connection has a weight on it. That weight can be changed to do learning. And what a neuron does is take the activities on the connections times the weights, adds them all up, and then decides whether to send an output. And if it gets a big enough sum, it sends an output. If the sum is negative, it doesn't send anything. That's about it. And all you have to do is just wire up a gazillion of those with a gazillion squared weights and just figure out how to change the weights, and it'll do anything. It's just a question of how you change the weights. NICHOLAS THOMPSON: So when did you come to understand that this was an approximate representation of how the brain works? GEOFFREY HINTON: Oh, it was always designed as that. NICHOLAS THOMPSON: Right. GEOFFREY HINTON: It was designed to be like how the brain works. NICHOLAS THOMPSON: But let me ask you this. So at some point in your career, you start to understand how the brain works. Maybe it was when you were 12. Maybe it was when you were 25. When do you make the decision that you will try to model computers after the brain? GEOFFREY HINTON: Sort of right away. That was the whole point of it. The whole idea was to have a learning device that learned like the brain like people think the brain learns by changing connection strengths. And this wasn't my idea. Turing had the same. Turing, even though he invented a lot of the basis of standard computer science, he believed that the brain was this unorganized device with random weights. And it would use reinforcement learning to change the connections. And it would learn everything, and he thought that was the best route to intelligence. NICHOLAS THOMPSON: And so you were following Turing's idea that the best way to make a machine is to model it after the human brain. This is how a human brain works. So let's make a machine like that. GEOFFREY HINTON: Yeah, it wasn't just Turing's idea. Lots of people thought that back then. NICHOLAS THOMPSON: All right, so you have this idea. Lots of people have this idea. You get a lot of credit. In the late '80s, you start to come to fame with your published work, is that correct? GEOFFREY HINTON: Yes. NICHOLAS THOMPSON: When is the darkest moment. When is the moment where other people who have been working who agreed with this idea from Turing start to back away and yet you continue to plunge ahead? GEOFFREY HINTON: There were always a bunch of people who kept believing in it, particularly in psychology. But among computer scientists, I guess in the '90s, what happened was data sets were quite small. And computers weren't that fast. And on small data sets, other methods like things called support vector machines, worked a little bit better. They didn't get confused by noise so much. And so that was very depressing because we developed back propagation in the '80s. We thought it would solve everything. And we were a bit puzzled about why it didn't solve everything. And it was just a question of scale. But we didn't really know that then. NICHOLAS THOMPSON: And so why did you think it was not working? GEOFFREY HINTON: We thought it was not working because we didn't have quite the right algorithms. We didn't have quite the right objective functions. I thought for a long time it's because we were trying to do supervised learning where you have to label data. And we should have been doing unsupervised learning, where you just learn from the data with no labels. It turned out it was mainly a question of scale. NICHOLAS THOMPSON: Oh, that's interesting. So the problem was you didn't have enough data. You thought you had the right amount of data, but you hadn't labeled it correctly. So you just misidentified the problem? GEOFFREY HINTON: I thought that using labels at all was a mistake. You would do most of your learning without making any use of labels just by trying to model the structure in the data. I actually still believe that. I think as computers get faster, for any given size data set, if you make computers fast enough, you're better off doing unsupervised learning. And once you've done the unsupervised learning, you'll be able to learn from fewer labels. NICHOLAS THOMPSON: So in the 1990s, you're continuing with your research. You're in academia. You are still publishing, but it's not coming to a claim. You aren't solving big problems. When do you start-- well, actually, was there ever a moment where you said, you know what, enough of this. I'm going to go try something else? GEOFFREY HINTON: Not really. NICHOLAS THOMPSON: Not that I'm going to go sell burgers, but I'm going to figure out a different way of doing this. You just said we're going to keep doing deep learning. GEOFFREY HINTON: Yes, something like this has to work. I mean, the connections in the brain are learning somehow. And we just have to figure it out. And probably there's a bunch of different ways of learning connection strengths. The brains using one of them. There may be other ways of doing it. But certainly, you have to have something that can learn these connection strengths. And I never doubted that. NICHOLAS THOMPSON: OK, so you never doubt it. When does it first start to seem like it's working? OK, you know, we've got this. I believe in this idea, and actually, if you look at that, if you squint, you can see it's working. When did that happen? GEOFFREY HINTON: OK, so one of the big disappointments in the '80s was if you made networks with lots of hidden layers, you couldn't train them. That's not quite true because convolutional networks designed by Yann LeCun, you could train for fairly simple tasks like recognizing handwriting. But most of the deep nets, we didn't know how to train them. And in about 2005, I came up with a way of doing unsupervised training of deep nets. So you take your input, say your pixels, and you'd learn a bunch of feature detectors so that were just good at explaining why the pixels were behaving like that. And then you treat those feature detectors as the data and then you learn another bunch of feature detectors. So we got to explain why those feature detectors have those correlations. And you keep learning less and less. And what was interesting was you could do some math and prove that each time you learned another layer, you didn't necessarily have a better model of the data, but you had a band on how good your model was. And you could get a better band each time you added another layer. NICHOLAS THOMPSON: What do you mean you had a band on how good your model was? GEOFFREY HINTON: OK, so once you got a model, you can say how surprising does a model find this data? You showed some data and you say, is that the kind of thing you believe in or is that surprising? And you can sort of measure something that says that. And what you'd like to do is have a model, a good model is one that looks at the data and says yeah, I knew that. It's unsurprising, OK? And it's often very hard to compute exactly how surprising this model finds the data. But you can compute a band on that. You can say this model finds the data less surprising than this. And you could show that, as you add extra layers of feature detectors, you get a model. And each time you add a layer, it finds the data, the band on how surprising it finds the data gets better. NICHOLAS THOMPSON: Oh, I see. OK, so that makes sense. So you're making observations, and they're not correct. But you know they're closer and closer to being correct. I'm looking at the audience. I'm making some generalization. It's not correct, but I'm getting better and better at it, roughly? GEOFFREY HINTON: Roughly. NICHOLAS THOMPSON: OK, so that's about 2005 where you come up with that mathematical breakthrough? GEOFFREY HINTON: Yeah. NICHOLAS THOMPSON: When do you start getting answers that are correct and what data are you working on? This is speech data where you first have your break through. GEOFFREY HINTON: This was just handwritten digits. Very simple data. Then around the same time, they started developing GPUs. And the people doing neural networks started using GPUs in about 2007. I had one very good student called Vlad Mnih, who started using GPUs for finding roads in aerial images. He wrote some code that was then used by other students for using GPUs to recognize phonemes in speech. And so they were using this idea pre-training. And after they'd done all this pre-training, then they'd just stick labels on top and use back propagation. And it turned out that way, you could have a very deep net that was pre-trained this way. And you could then use back propagation, and it actually worked. And it sort of beat the benchmarks for speech recognition initially just by a little bit. NICHOLAS THOMPSON: It beat the best commercially available speech recognition. It beat the best academic work on speech recognition. GEOFFREY HINTON: On a relatively small data set called TIMIT, it did slightly better than the best academic work. It also worked on at IBM. And very quickly people realized that this stuff, since it was beating standard models that are taking 30 years to develop with a bit more development would do really well. And so my graduate students went off to Microsoft and IBM and Google. And Google was the fastest to turn it into a production speech recognizer. And by 2012, that work that was first done in 2009 came out in Android. And Android suddenly got much better in speech recognition. NICHOLAS THOMPSON: So tell me about that moment where you've had this idea for 40 years, you've been publishing on it for 20 years, and you're finally better than your colleagues? What did that feel like? GEOFFREY HINTON: Well, back then I'd only had the idea for 30 years. NICHOLAS THOMPSON: Correct, correct, sorry, sir. Just a new idea. It's fresh. GEOFFREY HINTON: It felt really good that it finally got the state of the art on a real problem. NICHOLAS THOMPSON: And do you remember where you were when you first got the revelatory data? GEOFFREY HINTON: No. NICHOLAS THOMPSON: No, no, OK. All right, so you realize it works on speech recognition. When do you start applying it to other problems? GEOFFREY HINTON: So then we start applying it to all sorts of other problems. So George Dahl, who was one of the people who did the original work on speech recognition, applied it to-- I give you a lot of descriptors of a molecule and you want to predict if that molecule will bind to something to act as a good drug. And there was a competition on Kaggle. And he just applied our standard technology design for speech recognition to predicting the activity of drugs and it won the competition. So that was a sign that this stuff sort of fairly universal. And then I had a student called [INAUDIBLE],, who said, you know, Geoff, this stuff is going to work for image recognition. And Fei-Fei Li has created the correct data set for it, and it's a public competition. We have to do that. And so what we did was take an approach originally developed by Yann LeCun. A student called Alex Krizhevsky was a real wizard. He could make GPUs do anything. Programmed the GPUs really, really well. And we got results that were a lot better than standard computer vision. That was 2012. And it was a coincidence I think of the speech recognition coming out in the Android. So you knew this stuff could solve production problems. And on vision in 2012, it had done much better than the standard computer vision. NICHOLAS THOMPSON: So those are three areas where it succeeded. So modeling chemicals, speech, voice, where was it failing? GEOFFREY HINTON: The failure is only temporary, you understand. [LAUGHTER] NICHOLAS THOMPSON: Where was it failing? GEOFFREY HINTON: For things like machine translation, I thought it would be a very long time before we could do that because machine translation, you've got a string of symbols comes in and a string of symbols goes out. And it's fairly plausible to say in between you do manipulations on strings of symbols, which is what classical AI is. Actually, it doesn't work like that. Strings and symbols come in. You turn those into great big vectors in your brain. These vectors interact with each other. And then you convert it back into strings and symbols to go out. And if you told me in 2012 that in the next five years, we'll be able to translate between many languages using just the same technology, recurrent nets, but just the stochastic gradient descent from random initial weights, I wouldn't have believed you. It happened much faster than expected. NICHOLAS THOMPSON: But so what distinguishes the areas where it works the most quickly and the areas where it will take more time? It seems like the visual processing, speech recognition, sort of core human things that we do with our sensory perception seem to be the first barriers to clear. Is that correct? GEOFFREY HINTON: Yes and no because there's other things we do like motor control. We're very good at motor control. Our brains are clearly designed for that. And that's only just now a neuron net's beginning to compete with the best other technologies there. They will win in the end. But they're only just winning now. I think things like reasoning, abstract reasoning, they're the kind of last things we learn to do. And I think they'll be among the last things these neural nets learn to do. NICHOLAS THOMPSON: And so you keep saying that neural nets will win at everything eventually. GEOFFREY HINTON: Well, we are neural nets, right? Anything we can do they can do. NICHOLAS THOMPSON: Right, but just because the human brain is not necessarily the most efficient computational machine ever created. GEOFFREY HINTON: Almost certainly not. NICHOLAS THOMPSON: So why could there not be-- certainly not my human brain. Couldn't there be a way of modeling machines that is more efficient than the human brain? GEOFFREY HINTON: Philosophically, I have no objection to the idea there could be some completely different way to do all this. It could be that if you start with logic and you're trying to automate logic, and you make some really fancy theorem prover, and you do reasoning, and then you decide you're going to do visual perception by doing reasoning, it could be that that approach would win. It turned out it didn't. But I have no philosophical objection to that winning. It's just we know that brains can do it. NICHOLAS THOMPSON: Right, but there are also things that our brains can't do well. Are those things that neural nets also won't be able to do well? GEOFFREY HINTON: Quite possibly, yes. NICHOLAS THOMPSON: And then there's a separate problem, which is we don't know entirely how these things work, right? GEOFFREY HINTON: No, we really don't know how they work. NICHOLAS THOMPSON: We don't understand how top down neural networks work. There is even a core element of how neural networks work that we don't understand, right? GEOFFREY HINTON: Yes. NICHOLAS THOMPSON: So explain that and then let me ask the obvious follow up, which is, we don't know how these things work. How can those things work? GEOFFREY HINTON: OK, you ask that when I finish explaining. NICHOLAS THOMPSON: Yes. GEOFFREY HINTON: So if you look at current computer vision systems, most of them, they're basically feed forward. They don't use feedback connections. There's something else about current computer vision systems, which is they're very prone to have visceral examples. You can change a few pixels slightly and something that was a picture of a panda and still looks exactly like a panda to you, it suddenly says that's an ostrich. Obviously, the way you change the pixels is cleverly designed to fool it into thinking it's an ostrich. But the point is it still looks just like a panda to you. And initially, we thought these things work really well. But then when confronted with the fact that they look at a panda and be confident it's an ostrich, you get a bit worried. And I think part of the problem there is that they're not trying to reconstruct from the high level representations. They're trying to do descriptive learning where you just learn layers of feature detectors and the whole, whole objective is just to change the weights. So you get better at getting the right answer. They're not doing things like at each level of feature detectors, check that you can reconstruct the data in the layer below from the activities of these feature detectors. And recently in Toronto, we've been discovering, or Nick Frost's been discovering, that if you introduce reconstruction then it helps you be more resistant to adversarial attack. So I think in human vision, to do the learning we do in reconstruction and also because we're doing a lot of learning by doing reconstructions, we are much more resistant to adversarial attack. NICHOLAS THOMPSON: But you believe that top down communication in a neural network is how you test, how you reconstruct, how you test and make sure it's a panda not an ostrich? GEOFFREY HINTON: I think that's crucial, yes. Because I think if you-- NICHOLAS THOMPSON: But brain scientists are not entirely agreed on that, correct? GEOFFREY HINTON: Brain scientists all agreed on the idea that if you have two areas of the cortex in a perceptual pathway, if there's connections from one to the other, they'll always be backwards connections, not necessarily point to point. But there will always be a backwards pathway. They're not agreed on what it's for. It could be for attention. It could be for learning, or it could be for reconstruction, or it could be for all three. NICHOLAS THOMPSON: And so we don't know what the backwards communication is. You are building your new neural networks on the assumption that-- or you're building backwards communication that is for reconstruction into your neural networks even though we're not sure that's how the brain works. GEOFFREY HINTON: Yes. NICHOLAS THOMPSON: Isn't that cheating? GEOFFREY HINTON: Not at all NICHOLAS THOMPSON: If you're trying to make it like the brain, you're doing something we're not sure is like the brain. GEOFFREY HINTON: Not at all. NICHOLAS THOMPSON: OK. GEOFFREY HINTON: There's two-- I'm not doing computational neuroscience science. That is, I'm not trying to make a model of how the brain works. I'm looking at the brain and saying this thing works. And if we want to make something else that works, we should sort of look to it for inspiration. So this is neuro inspired, not a neural model. So the neurons we use, they're inspired by the fact neurons have a lot of connections and they change the strings. NICHOLAS THOMPSON: That's interesting. So if I were in computer science and I was working on neural networks, and I wanted to beat Geoff Hinton, one thing I could do is I could build in top down communication and base it on other models of brain science. So based on learning, not on reconstructing. GEOFFREY HINTON: If they were better models, then you'd win, yeah. NICHOLAS THOMPSON: That's very, very interesting. All right, so let's move to a more general topic. So neural networks will be able to solve all kinds of problems. Are there any mysteries of the human brain that will not be captured by neural networks or cannot? For example, could the emotion-- GEOFFREY HINTON: No. NICHOLAS THOMPSON: No. So love could be reconstructed by a neural network? Consciousness can be constructed? GEOFFREY HINTON: Absolutely, once you've figured out what those things mean-- we are neural networks, right? Now consciousness is something I'm particularly interested in. I get by fine without it. But um-- [LAUGHTER] So people don't really know what they mean by it. There's all sorts of different definitions. And I think it's a pre-scientific term. So 100 years ago, if you ask people what is life? They would have said, well, living things have vital force. And when they die, the vital force goes away. And that's the difference between being alive and being dead, whether you got vital force or not. And now we don't think that sort of-- we don't have vital force. We just think it's a pre-scientific concept. And once you understand some biochemistry and molecular biology, you don't need vital force anymore. You understand how it actually works. And I think it's going to be the same with consciousness. I think consciousness is an attempt to explain mental phenomena with some kind of special essence. And this special essence, you don't need it. Once you can really explain it, then you'll explain how we do the things that make people think we're conscious. And you'll explain all these different meanings of consciousness without having some special essence as consciousness. NICHOLAS THOMPSON: Right, so there's no emotion that couldn't be created. There's no thought that couldn't be created. There's nothing that a human mind can do that couldn't theoretically be recreated by a fully functioning neural network once we truly understand how the brain works. GEOFFREY HINTON: There's something in a John Lennon song that sounds very like what you just said. [LAUGHTER] NICHOLAS THOMPSON: And you're 100% confident of this? GEOFFREY HINTON: No, I'm a Bayesian. So I'm 99.9% confident. NICHOLAS THOMPSON: OK, and what is the point one? GEOFFREY HINTON: Well, we might, for example, all be part of a big simulation. NICHOLAS THOMPSON: True, fair enough, OK. [LAUGHTER] [APPLAUSE] That actually makes me think it's more likely that we are. All right, so what are we learning as we do this and as we study the brain to improve computers? How does it work in reverse? What are we learning about the brain from our work in computers? GEOFFREY HINTON: So I think what we've learned in the last 10 years is that if you take a system with billions of parameters, and you'd use stochastic gradient descent in some objective function, and the objective function might be to get the right labels or it might be to fill in the gap in a string of words, or any objective function, it works much better than it has any right to. It works much better than you would expect. You would have thought, and most people in conventional AI thought, take a system with a billion parameters, start them off with random values, measure the gradient of the objective function. That is, for each parameter figure out how the objective function would change if you change that parameter a little bit. And then change it in that direction that improves the objective function. You would have thought that would be a kind of hopeless algorithm that will get stuck. And it turns out, it's a really good algorithm. And the bigger you scale things, the better it works. And that's just an empirical discovery really. There's some theory coming along, but it's basically an empirical discovery. Now because we've discovered that, it makes it far more plausible that the brain is computing the gradient of some objective function and updating the weights of strength of synapses to follow that gradient. We just have to figure out how it gets the gradient and what the objective function is. NICHOLAS THOMPSON: But we didn't understand that about the brain. We didn't understand the re-weighted [INAUDIBLE].. GEOFFREY HINTON: It was a theory. It was-- I mean, a long time ago, people thought that's a possibility. But in the background, there was always sort of conventional computer scientists saying, yeah, but this idea of everything's random, you just learn it all by gradient descent, that's never going to work for a billion parameters. You have to wire in a lot of knowledge. NICHOLAS THOMPSON: All right, so-- GEOFFREY HINTON: And we know now that's wrong. You can just put in random parameters and learn everything. NICHOLAS THOMPSON: So let's expand this out. So as we learn more and more, we will presumably continue to learn more and more about how the human brain functions as we run these massive tests on models based on how we think it functions. Once we understand it better, is there a point where we can, essentially, rewire our brains to be more like the most efficient machines or change the way we think? GEOFFREY HINTON: You'd have thought-- NICHOLAS THOMPSON: If it's a simulation that should be easy, but not in a simulation. GEOFFREY HINTON: You'd have thought that if we really understand what's going on, we should be able to make things like education work better, and I think we will. NICHOLAS THOMPSON: We will? GEOFFREY HINTON: Yeah. It would be very odd if you could finally understand what's going on in your brain and how it learns and not be able to adapt the environment so you can learn better. NICHOLAS THOMPSON: Well, OK, I don't want to go too far into the future. But a couple of years from now, how do you think we will be using what we've learned about the brain and about how deep learning works to change how education functions? How would you change a class? GEOFFREY HINTON: In a couple of years, I'm not sure we'll learn much. I think it's going to change the education. It's going to be longer. But if you look at it, Assistants are getting pretty smart now. And once Assistants can really understand conversations, Assistants can have conversations with kids and educate them. So already, I think most of the new knowledge I acquire comes from me thinking, I wonder, and typing something to Google and Google tells me, if I could just have a conversation, I'd acquire knowledge even better. NICHOLAS THOMPSON: And so theoretically, as we understand the brain better, and as we set our children up in front of Assistants. Mine right now almost certainly based on the time in New York is yelling at Alexa to play something on Spotify, probably "Baby Shark"-- you will program the Assistants to have better conversations with the children based on how we know they'll learn? GEOFFREY HINTON: Yeah, I haven't really thought much about this. It's not what I do. But it seems quite plausible to me. NICHOLAS THOMPSON: Will we be able to understand how dreams work, one of the great mysteries? GEOFFREY HINTON: Yes, I'm really interested in dreams. NICHOLAS THOMPSON: Good, well, let's talk about that, GEOFFREY HINTON: I'm so interested. I have at least four different theories of dreams. NICHOLAS THOMPSON: Let's hear them all-- 1, 2, 3, 4. GEOFFREY HINTON: So a long time ago, there were things called-- OK, a long time ago there was Hopfield networks. And they would learn memories as local attractors. And Hopfield discovered that if you try and put too many memories in, they get confused. They'll take two local attractors of merged them into an attractor sort of halfway in between. Then Francis Crick and Graeme Mitchison came along and said, we can get rid of these false minima by doing unlearning. So we turn off the input. We put the neural network into a random state. We let it settle down, and we say that's bad. Change the connections so you don't settle to that state. And if you do a bit of that, it will be able to store more memories. And then Terry Sejnowski and I came along and said, look, if we have not just the neurons where you're storing the memories, but lots of other neurons, too, can we find an algorithm that we'll use all these other neurons to help you store memories? And it turned out in the end, we came up with the Boltzmann machine learning algorithm. And the Boltzmann machine learning algorithm had a very interesting property which is I show you data. That is, I fixed the states of the observable units. And it sort of rattles around the other units until it's got a fairly happy state. And once it's done that, it increases the strength of all the connections based on if two units are both active, it increases connection strength. That's called kind of Hebbian learning. But if you just do that, the connection strengths just get bigger and bigger. You also have to have a phase where you cut it off from the input. You let it rattle around to settle into a state it's happy with. So now it's having a fantasy. And once it's had the fantasy you say, take all passive neurons that are active and decrease the strength to the connection. So I'm explaining the algorithm to you just as a procedure. But actually that algorithm is the result of doing some math and saying, how should you change these connection strengths so that this neural network with all these hidden units finds the data unsurprising? And it has to have this other phase. It has to have this what we call the negative phase when it's running with no input. And it's canceling out-- its unlearning whatever state it settles into. Now what Crick pointed out about dreams is that, we know that you dream for many hours every night. And if I wake you up at random, you can tell me what you were just dreaming about because it's in your short term memory. So we know you dream for many hours. But in the morning, you wake up, you can remember the last dream, but you can't remember all the others, which is lucky because you might mistake them for reality. So why is it that we don't remember our dreams at all? And Crick's view was it's the whole point of dreaming is to unlearn those things so you put the learning rule in reverse. And Terry Sejnowski and I showed that actually that is a maximum [INAUDIBLE] learning procedure for Boltzmann machines. So that's one theory of dreaming. NICHOLAS THOMPSON: You showed that theoretically? GEOFFREY HINTON: Yeah, we should theoretically that's the right thing to do if you want to change the weights so that your big neural network finds the observed data less surprising. NICHOLAS THOMPSON: And I want to go to your other theories, but before we lose this thread, you've proved that it's efficient. Have you actually set any of your deep learning algorithms to essentially dream? Right, study this image data set for a period of time, resort, study again, resort versus a machine that's running continuously? GEOFFREY HINTON: So yes, we had machine learning algorithms. Some of the first algorithms that could learn what to do with hidden units were Boltzmann machines. They were very inefficient. But then later on, I found a way of making approximations to them that was efficient. And those were actually the trigger for getting deep learning going again. Those were the things that learned one layer feature detector at a time. And it was efficient form of restricted Boltzmann machine. And so it was doing this kind of unlearning. But rather than going to sleep, that one would just fantasize for a little bit after each data point. NICHOLAS THOMPSON: So Androids do dream of electric sheep. So let's go to theories 2, 3, and 4. GEOFFREY HINTON: OK, theory 2 was called the wake-sleep algorithm. And you want to learn a generative model. So you have the idea that you're going to have a model that can generate data. It has layers of features detectors. And it activates the high level ones and the low level ones and so on, until it activates pixels, and that's an image. You also want to learn the other way. You want to learn to recognize data. And so you're going to have an algorithm that has two phases. In the wake phase, data comes in. It tries to recognize it. And instead of learning the connections it is using for recognition, it's learning the generative connections. So data comes in. I activate the hidden units, and then I learn to make those hidden units be good at reconstructing s that data. So it's learning to reconstruct at every layer. But the question is, how do you learn the forward connection? So the idea is, if you knew the forward connections, you could learn the backward connections because you could learn to reconstruct. NICHOLAS THOMPSON: Yeah. GEOFFREY HINTON: Now it also turns out that if you knew the backward connections, you could learn the forward connections because what you could do is start at the top and just generate some data. And because you generated the data, you'd know the states of all the hidden layers. And so you could learn the forward connections to recover those states. So that would be the sleep phase. When you turn off the input, you just generate data and then you try and reconstruct the hidden units that generated the data. And so if you know the top down connections, you'd learn the bottom up ones. If you know the bottom up ones, you could learn the top down ones. And so what's going to happen if you start with random connections and try doing both-- alternate both kinds of learning and it works. Now to make it work well, you have to do all sorts of variations of it. But it works. NICHOLAS THOMPSON: All right, that is-- do you want to go through the other two theories? We only have eight minutes left. I think we should probably jump through some other questions. We'll deal with-- GEOFFREY HINTON: If you give me another hour, I could do the other two theories. [LAUGHTER] NICHOLAS THOMPSON: All right, well, Google I/O 2020. So let's talk about what comes next. So where is your research headed? What problem are you trying to solve now? GEOFFREY HINTON: The main thing I'm trying to solve, which I've been doing for a number of years now-- actually, I'm reminded of a soccer commentator. You may notice soccer commentators, they always say things like they're doing very well, but they always go wrong on the last pass. And they never seem to sort of notice there's something funny about that. It's a bit circular. So I'm working-- eventually, you're going to end up working on something you don't finish. And I think I may well be working on the thing I never finish. But it's called capsules, and it's a theory of how you do visual perception using reconstruction and also how you root information to the right places. And the two main motivating factors were in standard neural nets, the information-- the activity in the layer just automatically goes somewhere. You don't make decisions about where to send it. The idea of capsules was to make decisions about where to send information. Now since I started working on capsules, some other very smart people at Google invented transformers, which are doing the same thing. They're deciding where to route information, and that's a big win. The other thing that motivated capsules was coordinate frames. So when humans do visual, they're always using coordinate frames. And if they impose the wrong coordinate frame on an object, they don't even recognize the object. So I'll give you a little task. Imagine a tetrahedron. It's got a triangular base and three triangular faces, all equilateral triangles. Easy to imagine, right? Now imagine slicing it with a plane. So you get a square cross-section. That's not so easy, right? Every time you slice it, you get a triangle. It's not obvious how you get a square. It's not at all obvious. OK, but I give you the same shape described differently. I need your pen. Imagine, the shape you get, if you take a pen like that, another pen that right angles like this, and you connect all points on this pen to all points on this pen. That's a solid tetrahedron. OK, you're seeing it relative to a different coordinate frame where the edges of the tetrahedron-- these two line up with the coordinate frame. And for this, if you think of the tetrahedron that way, it's pretty obvious that at the top, you've got a long rectangle this way. At the bottom, you get a long rectangle that way. And there's [INAUDIBLE] that you've got to get a square in the middle. So it's pretty obvious how you can slice it to get a square. But that's only obviously if you think of it with that coordinate frame. So it's obvious that for humans, coordinate frames are very important for perception. And they're not at all important for conv nets. For conv nets, if I show you a tilted square and an upright diamond, which is actually the same thing, they look the same to a conv net. It doesn't have two alternative ways of describing the same thing. NICHOLAS THOMPSON: But how is adding coordinate frames to your model not the same as the error you were making in the '90s where you were trying to put rules into the system as opposed to letting the system be unsupervised? GEOFFREY HINTON: It is exactly that error. And because I am so adamant that that's a terrible error, I'm allowed to do a tiny bit of it. It's sort of like Nixon negotiating with China. [LAUGHTER] Actually that puts me in a bad role. Anyway, so if you look at conv nets, they're just neural nets where you wired in a tiny bit of knowledge. You add in the knowledge that if a feature detector is good here, it's good over there. And people would love to wire in just a little bit more knowledge about scale and orientation. But if you do it in the obvious way of having a 4D grid instead of a 2D grid, the whole thing blows up on you. But you can get in that knowledge about what viewpoint does to an image by using coordinate frames the same way they do them in graphics. So now you have a representation in one layer. When you try and reconstruct the parts of an object in the layer below, when you do that reconstruction, you can take the coordinate frame of the whole object and multiply it by the part whole relationship to get the coordinate frame of the part. And you can wire that into the network. You can wire into the network the ability to do those coordinate transformations. And that should make it generalize much, much better. It should mean the networks just find viewpoint very easy to deal with. Current neural networks find viewpoint other than translation very hard to deal with. NICHOLAS THOMPSON: So your current task is specific to visual recognition, or it is a more general way of improving or coming up with the rule set for coordinate frames? GEOFFREY HINTON: OK, it could be used for other things. But I'm really interested in the use for visual recognition. NICHOLAS THOMPSON: OK, last question. I was listening to a podcast you gave the other day. And in it, you said that the people whose ideas you value most are the young graduate students who come into your lab because they aren't locked into the old perceptions. They have fresh ideas, and yet they also know a lot. Is there anything that you, sort of looking outside yourself, you think you might be locked into that a new graduate student or somebody in this room who came to work with you would shake up? GEOFFREY HINTON: Yeah, everything I said. NICHOLAS THOMPSON: Everything you said. [LAUGHTER] Take out those coordinate units. Work on feature three, work on feature four. I wanted to ask you a separate question. So deep learning used to be a distinct thing, and then it became sort of synonymous with the phrase AI. And then AI is now a marketing term that basically means using a machine in any way whatsoever. How do you feel about the terminology as the man who helped create this? GEOFFREY HINTON: Well, I was much happier when there was AI, which meant your logic inspired and you do manipulations on cymbal strings. And there was neural nets, which means you want to do learning in a neural network. And they were completely different enterprises that really sort of didn't get along too well and fought for money. That's how I grew up. And now I see sort of people who spent years saying neural networks are nonsense, saying I'm an AI professor. So I need money. And it's annoying. NICHOLAS THOMPSON: So your field succeeded kind of ate or subsumed the other field, which then gave them an advantage in asking for money, which is frustrating? GEOFFREY HINTON: Yeah, now it's not entirely fair because a lot of them have actually converted. NICHOLAS THOMPSON: Right, so wonderful. Well, I've got time for one more question. So in that same interview, you were talking about AI. And you said, think of it like a backhoe, a backhoe that can build a hole, or if not constructed properly, can wipe you out. And the key is when you work on your backhoe to design it in such a way that it's best to build a hole and not to clock you in the head. As you think about your work, what are the choices you make like that? GEOFFREY HINTON: I guess I would never deliberately work on making weapons. I mean, you could design a backhoe that was very good at knocking people's heads off. And I think that would be a bad use of a backhoe, and I wouldn't work on it. NICHOLAS THOMPSON: All right, well, Geoffrey Hinton-- extraordinary interview. All kinds information-- will be back next year to talk about dreams theories three and four. That was so much fun. Thank you. [MUSIC PLAYING]

Info

Channel: TensorFlow

Views: 34,935

Rating: 4.9242902 out of 5

Keywords: type: Conference Talk (Full production);, pr_pr: Google I/O, purpose: Educate

Id: UTfQwTuri8Y

Channel Id: undefined

Length: 39min 1sec (2341 seconds)

Published: Thu May 09 2019