Geoffrey Hinton: The Foundations of Deep Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Applause] I'm going to talk about some basic share and I imagine the some people in the audience who don't really have a good grip of what the backpropagation algorithm is so I'm actually going to explain that very briefly so you know what we're talking about and now I'm sure a few examples of what it can do and these are all things that are now a little bit out of date so if you want a computer to do something the old way to do it is to write a program that is you figure out how you do it yourself and its squizz it detail you tell the computer exactly what to do and the computers like you but faster the new way is you tell the computer to pretend to be in your network with a learning algorithm in it that's programming but then after that if you want to solve a particular problem you just show examples so suppose you want to solve the problem of I give you all the pixels in the image that's three numbers per pixel for the color there's like let's say a million of them and you have to turn those three million numbers into a string of words that says what's in the image that's a tricky program to write people tried in air for 50 years and didn't even come close but now a neural net can just do it and I'll show you how it does that that is we have no idea how to write that program but a neural net can do it so we're gonna make our neural net out of artificial neurons an artificial neuron is going to have some input lines that come from the sensors or other neurons on each input line there's going to be a weight that could be positive or negative and it's going to adapt by changing the strengths of those weights and the way it behaves is that it takes the values on the input lines multiplies each value by weight adds it all up and that's this total input and then it gives an output that's a nonlinear function of that total input and the function is shown on the right if the total input isn't big enough it stays silent as soon as the total input gets bigger than that it starts giving a response it gets bigger as the total input gets bigger for 30 years we used a different kind of neuron that didn't work as well and then we tried this one and this works better that gives you some idea about the state of the art in the field there's very simple changes that just make things weren't much better than people haven't explored ok we're going to hook those guys up into networks with multiple layers and we're going to learn the connections on the inputs to the neurons in all the layers and so the problem solved now all we need to do is figure out how to adapt those connections and we're done because these networks can do anything so it's just a question of changing the connections and there's a very interesting and simple algorithm that occurs to anybody who believes in evolution which hopefully is most of you what you do is you take one of the connections you take a small batch of training examples a typical batch you run it through the network and you see how well the network does so how similar are the outputs of the network to the outputs that you think are the correct answers on this training data and then you change that one wait and then you run this batch through again and you see if that improves things if it improves things you keep the change if it doesn't improve things you don't keep the change you leave it like it was that's it and that algorithm works that's a very naive view of evolution but it works it's how many mutations you're doing and if you just do that long enough you'll get a network that does good things the problem is that I have to run a whole bunch of examples through the network actually twice for each weight and I might have a billion weights so what we want we want to do that algorithm that's the basic algorithm you're going to tinker with weights and just keep the tinkers of change but it's hard to do it efficiently and so we're now going to use some calculus and do the same thing efficiently what we do is because we know the weights in the network or your brain knows the weights in your brain we don't actually have to tinker with the weight and see the effect we can imagine tinkering with the weight and figure out what the effect would be if you know all the weights in the network you can say if I were to change this the output would change this way and that would be good so therefore I want to change this so what we can do is look at the discrepancy in the output and from the discrepancy between what we want and what we got we can send information backwards through the net that's doing this computation for every weight of how a small change in that way would affect the output how a small increase in that weight would improve or make worse the output and we can do that for all the weights in parallel so in the same amount of time is the mutation algorithm can figure out what shoe is one wage we can figure out what to do with all the weights and if there's a million weights that's a million times more efficient when a million times more efficient enough to make a difference and backpropagation had great promise but by the 1990s people in machine learning given up on it because they had relatively small data sets and other algorithms worked better and what's more you could prove things about these other events with back propagation you couldn't prove it would work and what's more when different people ran it they got different answers and so if you're obsessed with only being one correct answer and you're being able to prove you get it back propagation is not for you nor is life actually one of the reasons people lost interest is that it doesn't work so well the the naive algorithm didn't work well in GP networks and it didn't work in recurrent networks which are explained in a minute so then a few technical advances were made in Canada in by Canada I mean Toronto and Montreal in New York and we're very concerned about those details of those advances but that's minor February else the main message is if we give you a lot of label data and a lot of compute power back propagation now works amazingly well and the rest of the talk would be trying to convince you of that so here was the first practical problem that it made a big impact on that's not quite fair but since this was done in Toronto I'll pretend with us actually for handwritten digit recognition it made a big impact but people said that's an easy problem by a speech recognition is a tough problem so a couple of graduate students at the University of Toronto took an algorithm that I'd been working on and applied to speech recognition and you take some coefficients that describe the sound wave you put it through multiple layers of hidden units and these were lots of hidden units so there are only a few million training examples and between each pair of layers there's four million parameters that because it's fully connected so a statistician if you do statistics 101 you will know that this cannot possibly work because there's many more parameters than our training examples so it's crazy if you want a critique statistics 101 you might say in your lifetime you make about 10 billion fixations and you have about 10,000 times more synapses than that so you have about 10,000 synapses for each fixation you make so you don't satisfy statistics 101 either um okay so we trained this up on or they trained it up on sound waves trying to predict which piece of which phoneme was being said so imagine a little bit of a spectrogram which is essentially what the bottom is and you're looking at this piece of the vector and saying in the middle of the spectrogram which piece of which phoneme is this guy trying to say and you get a probabilistic answer and then you take all those probability answers and you string together with something else to find a plausible utterance nowadays you don't do that nowadays what you do is you just put sine waves in and you get the transcription out and the only thing is a neural network so recurrent neural network but back then this was just the front end of the system we replaced the front end of speech recognizers with this and it worked better it worked just a little bit better but good speech people particularly down at Microsoft realized right away that if this works a little bit better and to graduate students did in a few months he's going to completely wipe out the existing state-of-the-art and indeed over the next couple years it did so an avid DJ Taniya grad student of Toronto he wanted to go to rim and take this technology to rim he really wanted to do that and I talked to him and they said they weren't interested in speech recognition I I don't know what became of them um so by 2012 Google was using it in the Android and that was the first there was a big increase in speech recognition performance sense it suddenly got better than Siri now everybody's using this algorithm but more updated versions of it and all the best speech recognition systems when I trained with backpropagation in a neural net and some are just end to end in some solar system there's nothing else you just trained it on data all that how you pronounce things and what the words are and all that I forget it and that we'll learn all that then in 2012 two more graduate students of mine so the trick to all this is you have to always get graduate students who are smarter than you there's no point having a graduate student dumber than you because you could have done that so two other graduate students Ilya sutskever who recently got given a billion dollars by open AI to run and laugh um she's slightly depressing because that's a lot more than I ever got and Aleks Reshevsky they took the image net competition where there were a million images and a thousand of each class and you had to recognize subjects in that class and it was a public competition with a secret test set so you couldn't cheat and the person who ran our system on the test set told me I met him at a conference he told me he didn't believe the results so I'm back and ran it again he still didn't believe the results he had to run it three times before he believed the performance results because they were so much better than anybody else's so here's the results in 2012 all of the conventional computer vision systems that didn't use neural nets had plateaued at about twenty five percent error rate and our system I'm almost half that and as with speech recognition as soon as people switched then you got thousands of smart graduate students and thousands of experienced developers and so on making this work really well and by 2015 we'd reach on that data set people reach human levels one hero called andrew capacity actually did the task himself which took a lot of time and got 5% error and now they're down below 3% error and so it's a tenth of the error rate of the computer vision systems now of the previous computer vision systems so this made a big impact partly because speech' worked already but people thought that was a niche the speech worked first because they were the guys who had big labeled data sets when this worked people got all excited and it was very good for IP lawyers okay so these are examples of the kinds of images and notice they're not images that have one nicely centered object in canonical view point most of the teachers missing and the red bar is what the system thought was his best bet it gets told it's right if he in the top five bets it gets right ants because it's not always clear what the right answer is but you'll notice the bullet train it gets right even there's only a small fraction of the image the hand glass it gets wrong it thinks it's scissors if you look at the other things that thinks it is it thinks it might be a stethoscope or a frying pan and you can see why it thinks that and you can see that it needs glasses but the point is it's got the visual appearance of something if you look at it's wrong answers they tell you more than looking at the right answers now I'm going to go on to recurrent Nets so these fie forward Nets were very good at recognizing a phoneme in speech and recognizing object in an image but for dealing with sequences you want a recurrent net and the kinds of recurrent Nets people use now are based on work by a hawk writer and schmidt hoop in 1997 that I'm not going to explain and I'm going to simplify them I'm going to pretend to you these recurrent Nets are simpler than they are because you really don't want to it would be nice if they were this simple but they're not okay um so here's how our current net works it has a bunch of input neurons not just one like it shows here but a bunch and that's representing the data of regular time so it might represent a word in a sentence it might work represent an image in a video that's the input it has a bunch of hidden neurons and these hidden neurons connect to themselves so if you look at the second time slice here the second column and look at that middle unit it's getting input it'll get some input from the input to the system which might be the video frame of the word it'll also get input from the previous state of all the hidden neurons so it's remembering and accumulating information and you can train this same thing with backpropagation what you do is you feed it the inputs and the hidden units of accumulated information when you get to the end you see if they can produce the right answer and if they can't you back propagate information so you just go backwards through all those arrows and one thing you'll notice about those arrows is they form a directed acyclic graph that is you cannot go around in a circle following the arrows and that means you can do back propagation you can go backwards without everyone getting in or not basically so in your suits cover and Oriole villians in quickly and pretty much in parallel yoshua bengio and Barda new and show in Montreal developed a way of using these algorithms for doing machine translations and initially it seemed crazy so what we're going to do is we're going to have an encoder Network that reads the sentence in one language and turns it into a thought and then we're going to take the thought and we're going to turn it into a sentence in another language of course to do that you need to know what a thought is now most people in AI in fact still most people in AI they made a very naive mistake which is they thought that strings of symbols come in as words when you say something strings of symbols come out so they think what's in between must be something like a string of symbols that's the stupidest thinking pixels come in and when you print something pixels come out so everything in between must be pixels and in fact the symbolic AI people were laughing in that view that it's all pixels in between that was a view of someone particularly naive called Steve Gosselin and they laughter that but they had exactly the same mistake they thought the stuff that comes in and the stuff that comes out which is the only stuff we know about from outside it must be the same kind of stuff in the middle even though you know that what's in the brain is just big vectors of neural activity there's no symbols in there this particularly no symbolic expressions in there and there certainly aren't rules for animate manipulating symbolic expressions at least after many years of high school there might be a few of those rules that you can't really follow very well but that's not the basic way of doing business so you're going to put words into this encoding network one at a time it's going to first turn those words into a vector representation which is a whole bunch of features it's going to learn to do that so all of these connections are learn by back propagating and it's gonna basically make say the vector for Tuesday be very similar to the vector for Wednesday and very different to the vector for although the words come in it accumulates information in it's hidden units and at the end of the English sentence of the top there there'll be a state of the hidden units that I will call a thought and that's not meant to be a joke that's what I believe a thought is a thought is an activity pattern in a big bunch of neurons and ass activity pattern that doesn't need to be inspected to thinks out and it causes things to happen so I can say to you John thought dan or John thought is snowing outside anything you can put in quotes John can think and what's more John can say it so if John thought it's snowing outside you might say to you is snowing yes so it's obvious that the way you get at thoughts the way I tell you what I'm thinking is either by the words that would have caused the thought or by the words that the thought would have caused it's hooked up at both ends and but the thought itself doesn't look anything like words it's something completely different inside and in fact it looks like that it's not necessarily red um you take that thought vector and you give it to a decoder network and decoder network says okay that was the thought let's suppose it's doing English to French what's the first word in French so it takes a thought and it says okay I think the house was probably loved but it might be law and it might be something else it gives you probabilities of all the various words one way of seeing what the network one way of decoding the thought not the best way but one way to do it is to say okay take those words it thought were reasonably possible pick one of them according to how probably thought they were and then lie to the network tell it okay that was actually the right then you got it that right okay what do you think comes next and then it gives you a prediction for the next word and you say okay you got that right what do you think comes next and that way it will give you a string of words until it eventually gives you a full stop and then that's the translation now what's amazing is that actually works but if you train the whole thing with backpropagation and Google Translate used to have huge tables of phrases this phrase in English tends to go to that phrase in French and you try and put all these tables together to get a plausible French sentence and it turns out it works much better to have a system that has no linguistic knowledge whatsoever that is this actually got lots of linguistic much but it wasn't put in by people so now the way Google Translate works on many pairs of languages and soon all of them I think is you take a language you automatically break the words of that language into 32,000 fragments so fragments for English would be whole words like the they'd also be the individual letters there'd be things like in and II D and s and you represent the input string by this string of symbols these photoshoot from this alphabet a 32,000 symbols you feed it the English sentence you have a trance big database of translations it then produces the French sentence and it has these probabilities of producing words and you look at at each point in time when it's producing the French sentence you look at the probability of the science of the correct word then you sang in a back propagate through all those connections you see in the net there send information backwards computing how a small change in that connection strength would increase the probability of the right word that's what you do and so you start with random weights and then you send all this information back to change the strengths very slightly so as to increase the probability of the correct word and you do that for a lot of things and hey presto is the best machine translation system there is pretty much one big improvement that was made by researchers in Montreal is attention so the system I described to you turns English into a thought and then turns the thought into French because that's not what a real translator does I mean he could do that but it'll do better if as he's producing the French he looks back at the English and so they made their networks look back not at the input words in the English but at the hidden States when it was getting English words and they made it learn where to look so that's pretty fancy it's an extra but module in the network that's trying to learn where to attend in the English sentence as its producing the French sentence it successfully does that and it makes the whole thing work better and be able to be trained on much less data that's one way the word fragments already described don't use words use pieces of words it'll also work if you use individual letters in fact here's an amazing thing if you're translating Chinese to English and I give you the following choice you could have a big list of Chinese symbols because they're symbols for whole words or I could give you bitmaps of the Chinese symbols which would you rather have as input well it turns out it works better if I give you the bitmaps because back propagation learns that Chinese symbols actually have confidential structure a Chinese symbol you know there's a man running to a house or so I don't I know nothing Chinese but those little bits in there actually have morphemic structure and it'll learn that from bitmaps so this is rather bad mood for news for linguists the number of linguists you need to make a really good speech recognition system is zero actually that's entirely unfair you you need to have a well curated data set and linguists will know a lot about how to get a well curated data set but you don't need them telling the neural network what to do now let's combine that with the vision we did before so we're going to take our net that recognizes objects in images trained on image net and we're going to say when we translate we get a thought and then we say that thought well what if we got a percept and then we said that percept so instead of using English to get the thought we're going to use the image net thing to look at an image and get a thought and then from that thought we're going to produce the output and the thought or percept that the net has is simply the activity of all the units just before the answer the last hidden let because what the net is done is really it's trying to pixels into activity of a bunch of things that's to do with objects but to do with lots of objects in the image and then it makes a choice and says the name of an object but before it sets the name of an object you have stuff to do with lots of objects so we use that percept as the encoding and then we decode it and we train it to decode that is obviously percepts to say those you need some training so you take that last layer of image net and you take a big database that emma that microsoft kindly supplied with a few hundred thousand images each with several possible captions and you train the decoder to turn that percept into a sentence and then it does things like you show it that and it says a group of people shopping around to market the actual transcript of that is the correct answer according to the databases people are crouched around in an open market which is better because it's got the crowd cheers and then the one you saw at the beginning so we reach closure you now know how this worked we trained up something on image net we then trained the out brother of that thing to produce synthesis in English and it says the clothes of a child holding a stuffed animal the real caption a young girl asleep on the sofa cuddling a stuffed bear is somewhat better but I was just completely blown away by this when oreal venules and Sammy Avenger and other researchers at Google showed me that this worked I thought well you know that's the dream of AI to be able to look at the picture and say what's in it I mean that's sort of basic AI if you can do that you're really onto something and it worked and then within about a week lots of other people had done similar things I think I was just slightly better there's all sorts of implications for document processing if you can convert a sentence into a thought and then model the structure of those thoughts you can get natural reasoning you might not want natural reasoning because most people's natural reasoning isn't much good but a tissue can model it I think to do this properly we'll need a number of parameters comparable with the brain which is a hundred trillion and our neural networks currently have a few billion it's as a puzzle here which is we can translate between multiple pairs of languages using a few billion weights that's less than one voxel of a brain scan so either the brain is amazingly much better than what we can do or it's using a different algorithm or it's using back propagation but inefficiently and I don't know which in medical images very soon will be better than radiologists so already for skin cancers there's a system that's comparable with radiologists with dermatologists and actually as soon as he's trained on more images it'll be significantly better that was trained on off the order of 100,000 images training on 10 million men will be better one thing to bear in mind you that doctors often worry about is where do you get the correct answers well here's something interesting you train a neural network on labels produced by doctors and the neural network can end up much better than the doctors but is the doctors all disagree I only have 70 percent agreement the neural network actually gets what's going on and it can be much better than the labels you used to train it that seems paradoxical but it's not so we don't actually need the ground truth we just need enough something related enough to the ground true so the neural network and figure out what's going on which the doctor couldn't and then we can do better I want to finish with one story about the same student George Dahl as with involved in the speech recognition in 2009 in 2012 or 11 I think 12 he entered a competition on Cargill which he entered quite late and the competition was I give you a few thousand properties of molecules and you have to predict whether this molecule will bind to something the drug companies would like to know this and they'd like to do without synthesizing the molecule so they'd like to predict which one's a good candidates for binding to something George basically through our standard neural network at it multiple layers of rectified linear units far more parameters than the world training cases here he was using probably a million parameters with 15,000 training cases and it worked he combined it with other methods but he didn't need to it actually would have won the competition without being confined with other methods and they were surprised and there's $20,000 prize and so George said okay give me the price and Merck said well it's part of the competition that you have to tell us what cuse are you used and George says what said what's Q so this is slightly embarrassing because there's a field called Kusa q site is quantitative structure-activity relationships that is how does the structure give rise to the activity I mean has this been going for like 20 years it has a journal it has an annual conference it has a whole bunch of people who that's what they do and George white the right even knowing the name of the field okay that's it [Applause] [Music] [Applause] [Music]

Info

Channel: Elevate SIX

Views: 65,541

Rating: 4.9326859 out of 5

Keywords: Artificial intelligence, Machine learning, Deep learning, Neural Networks, Technology, Canada, Toronto, Image recognition, Speech recognition, Back propogation, video, sharing, upload, technology future, technology news, technology 2019, innovation, entrepreneur motivation, entrepreneur ideas, festival 2019, Toronto 2019, gadgets, celebrity

Id: zl99IZvW7rE

Channel Id: undefined

Length: 28min 21sec (1701 seconds)

Published: Wed Feb 07 2018