Does the brain do backpropagation? CAN Public Lecture - Geoffrey Hinton - May 21, 2019

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi I think we're ready to start so my name is Paul Franklin and I'm a neuroscientist here at SickKids and also I'm program chair for the Canadian neuroscience meeting that takes place in Toronto all this week so the traditional curtain raiser for the neuroscience meaning the kin meeting is that the public lecture and this year we decided to focus on the interface between neuroscience and AI we did that for two reasons first reason is that AI or Toronto is one of the main hubs in the world for AI research and the second reason is that it's also home to one of the true pioneers in this field also known as the Godfather of the deep learning geoff hinton and so when we asked Jeff if he'd participate in this event I think of a year and a half ago we asked him and he said yes we were super excited so at this point I want to hand over to Blake Richards and Blake Rich's is an associate professor at the University of Toronto Scarborough and he's gonna host this evening as a vent Blake thanks Paul so just as a brief introduction I wanted to tell you a little bit more about Jeff so you might be surprised to learn that the godfather of deep learning associated mostly with AI got his BA in Experimental Psychology from Cambridge originally he then went on to do his PhD in artificial intelligence in Edinboro so he did get started relatively early but throughout his early career he really contributed to the first wave of what was known at the time as parallel distributed processing or connectionism which really brought back into the fore the idea of using neural networks for both models of the mind and for artificial intelligence now Jeff got his first tenure-track position at Carnegie Mellon in the eighties but we were able to steal him away from them in the late 80s which I understand was largely because of his ethical objections to DARPA funding so once again Canada political bent has helped us in our research endeavors over the course of the 90s Jeff continued to really push neural networks and machine learning forward and sorry about that in 98 he actually went to University College London to found the Gatsby computational neuroscience unit and we might have lost him but thankfully we pulled him back again he came back to Toronto in 2001 where he became a university professor in 2006 and then an emeritus professor in 2014 now I think that you know you all know that Jeff is a monumental figure within artificial intelligence and machine learning and he's been critical to the founding of the vector vector Institute here in Toronto and putting Toronto on the map for AI and certainly he's had incredible recognitions of his work most recently the Turing award which he shared with Josh Webb NGO and young McCune as well as the Order of Canada and he's a distinguished fellow of the Kane Institute for Advanced Research which I highlight because they were one of the people who continue to support neural networks throughout the time when it wasn't as Fatih SH but for all his successes in his technical endeavors I think one of the things that's most important to understand about Jeff is the impact that he's had on other scientists Jeff has really molded the career of so many people and changed the way that they think about things the when you look at the people who have been his graduate students or postdocs it really is the who's who in artificial intelligence it includes people like max Welling young McCune and you know the phrase I use to describe it is have you drank have you drunk Jeff's kool-aid because once you've drunk Jeff's kool-aid there is no going back you see neural networks you see AI differently and I would argue you also see neuroscience differently and for me my understanding of the brain has been largely shaped in by Jeff and his work but you know we're at the point now where computer science has drunk Jeff's kool-aid so he's got an H index of 145 and according to Google Scholar his work has been cited 270,000 times which is more than Einstein rimoni kaha and Alan Turing combined but that's largely from computer scientists and if my prediction is correct neuroscience 3040 years from now we'll also have drunk Jeff's kool-aid and maybe you're going to get your first taste tonight so with that I hand you over to Geoffrey Hinton [Applause] so thank you very much great I can give you some more kool-aid today it's kool-aid produced by one of my former students Ilya sutskever first I want to show you a little bit about the history of deep learning in AI can I just ask before I start how many people here know what the backpropagation algorithm is put your hands up so some people don't I'll explain it very quickly and I'll explain you such a way that you'll be able to explain to other people if so if you do know what it is you follow the explanation from the point of view of how you explain it um okay there was a war between two paradigms for AI there were people who thought that the essence of intelligence was reasoning and logic is what does reasoning so we should base artificial intelligence on taking strings of symbols and manipulating them to arrive at conclusions and then there are the people who looked at the brain and said no no intelligence is all about adapting connections in the brain to it get smarter and this war went on for a long time and eventually people who were trying to figure out how to change connections between fake neurons to make these networks smarter got to be able to do things that the people doing symbolically I just couldn't do it at all and now there's a different way of getting a computer to do what you want instead of programming it which is tedious you just showed examples and it figures it out now of course you have to write the program that figures it out but that's just one program that will then do everything and this is an example of what it can do so the image just think of the numbers they're RGB values of pixels and that's the input to the computer lots of values of pixels just real numbers saying how bright the red channel is and you have to turn those numbers into a string of words that says a close-up of a child holding a stuffed animal and imagine writing that program well peopling eventually I treasure ate their program and they couldn't partly because they didn't know how we did it we still don't know how we do it but we can get artificial neural networks to do it now and do a pretty good job and my prediction is within ten years if you go and get a CT scan and what will happen is a computer will look at the CT scan and a computer will predict will produce the written report that the radiologist currently produces radiologists don't like this thank you okay here's a simplified model of a neuron it's very simple it gets some input which is just the activity on the input lines times the weights adds it all up that's called the depolarization and then it gives an output that's proportional to how much input it gets as long as it gets enough input and so to remove you won't have spiking neurons these just can be neurons to send real values in just the way neurons don't we're going to make networks of them by hooking them up into layers and you could put some pixels on the input neurons look they're the input neurons and you go forwards through the net till you get outputs and then you compare those outputs with what you ought to have got so you have to know what the right answer is and what we'd like to do is train the weights these red and green dots so that it gives the right output now I'm going to show you a way of training the weights that everybody can understand and everybody is thoughts on basically what you do is you start with random weights you shoot some inputs you measure how well it does then you change one weight a tiny bit so I take that weight there and I just change you to tiny bit and then they show the same inputs again and see if it does better or worse if he does better I keep the change if it does worse maybe I keep the change in the opposite direction that's an easy algorithm to understand and that algorithm works it's just incredibly slow because you have to show it lots of examples change your weight and then show it lots more examples change another weight and every weight has to be changed many times so if you use calculus you can go millions of times faster so the trick of this algorithm the sort of mutation algorithm is you have to measure the effect of the weight change on the performance we don't really need to measure it because when I change one of these waves the effect that it has on the output is determined by the network it just depends on the other weights in the network it's not like normal evolution where the effect of a gene depends on the environment you're in this is all kind of internal to the brain and so changing one of these weights has an effect that's predictable here so I ought to be able to predict how changing the weight will help get the right output and so what backpropagation does is basically says I'm going to compute using an algorithm the details of which I won't tell you and compute for every weight all at the same time how changing that weight would improve the output and then I can change all the weights a little bit so every weight changes in direction to improve the output and the output improves quite a bit and then they do it all again now that allows me to compute for every weight how I'd like to what direction I'd like to change it in and the question is should I when I showed examples show it all of the examples and then update the weights so should you live your whole life with the synapse strengths you're born with then update your weights a little bit then live your life again and update the base a little bit more until they seem very good or should you take one case or a few cases figure out how you'd like to update the weights update them and then take more cases that's the online algorithm and that's what we do and the amazing thing is it works you can take one case at a time move you take small batch of cases you update the weights and these networks get better and it's very surprising how well it works on big datasets so for a long time people thought you're never going to be able to learn something complicated like for example take a string of words in English feed them into a neural net and I put a string of words in French that mean the same thing you're never going to be able to do that if you start with a big neural net with just random weights it's just asking too much for the neural net to organize itself so they can do translation because you have to kind of understand what the English says and people predicted this was completely impossible but you'd have to put in lots of prior knowledge well they were wrong so in 2009 my students in Toronto showed that you could actually improve speech recognizers using these neural nets that had random weights they were just trying to predict in a spectrogram which piece of which phoneme you were trying to say in the middle of the spectrogram and then there was more to the system that wasn't your own nets now what we've done is we've got rid of all the stuff that wasn't neural nets and now you can take sound waves coming in and you have transcriptions coming out or even better you have sound waves coming in and you have sound waves coming out in another language with the same accent they can do that now that's speech recognition done then in 2012 two of my students took a big database of images and used essentially the same algorithm the few clever tricks to say what was in the image not a full caption just the class of the most obvious object and they did much better than conventional computer vision which would being over many years and since then all the best recognizes have used neural nets in 2011 you couldn't publish a paper out neural nets in the Constanta computer vision conference because they said they were rubbish in 2014 you couldn't publish paper that wasn't about neural Nets okay and in 2014 they did something that I didn't expect this was done by people at Google not me and yoshua bengio in his group in montreal particularly by a guy called Barda now and Cho they managed to get a neural net so you feeding actually fragments of words in one language you have 32,000 possible fragments so the word that in English would be one of the fragments was so with things like in and on and what comes out in another language is fragments of words in that other language and it's a pretty good translation and that's how Google now does translation so it did translation better than symbolic arrow so what changed between 1986 and 2009 and it was basically computers got faster that was the main change data so it's got bigger we developed some clever tricks and we like to emphasize those but it was really the computers getting fastened it's getting bigger but I'll emphasize the clever tricks nonetheless and I can tell you about two clever tricks I can tell you about transformers and tell you about better ways of stopping your networks from overfitting but first I want to show you example of what neural Nets can do now so a team at open AI took work on transformers that was originally done at Google they developed it a little bit further and they applied it to big neural nets they have 1.5 billion learn herbal connection strengths so yeah they're learning 1.5 billion numbers that's the knowledge of the system and they train it up on billions of words of English text and all the Nets trying to do is predict the next word so what the net will do or fragment of what the net will give you probabilities for the next word so if you give it some words leading it'll give you probabilities for next word and once the Nets trained what you can do is you can look at those probabilities and if he says there's a probability of naught point for that the next word is that you pick though with probability nine point four and it says fish with probability point over one you pick fish with probability point arrow on and so you just pick from its distribution and then you tell them you're all net okay the one I pick was the next word what do you think comes after that and this way you can get it to sort of reveal what it really believes about the world so you're getting its riddick words one at a time and every time it makes a prediction you say you were right and it just gets more and more tired away so they initiated it with some interesting text and the question is with will the neural net then produce stuff that's sort of related that I mean first question is will it produce English words will the words have decent syntax will have any meaning will they be related to this if you are really optimistic you might say will they sort of relate to the fundamental problem here which is how these unicorns can speak English okay so here goes this is what the neural net produced now this was cherry picked this was one of their better examples the neural net just made this up right it made up dr. Jorge Perez there is no such person at the University of La Paz but it's pretty plausible because it is South America and I believe La Paz as a university okay so that's the first bit of what it may definitely carries on and it gets better the next bit sounds a bit like one of those fantasy games so so it's remembered about unicorns and herds of unicorns right so they walk up and there's this strange rally and it's a very strange value and they fund the herds of unicron's and it has something about seeing them from the air and being able to touch them which isn't quite right so people in symbolic head leap on this and say you see it doesn't understand well sure there's little bits that it doesn't get right but notice it's remembered that these unicorns have to speak English and so it tells you about you know they spoke some fairly regular regular English it doesn't know the difference between dialect and dialectic but my kids don't know that either in fact I'm not sure I know its attributes of the unicorns to Argentina even though dr. risk comes from Bolivia and it actually understands about magic realism so the descendants of a lost race and I love the bit at the end where it says in South America such incidents seem to be quite common this has an ability to just make up something that fits your prejudices and sounds moderately plausible like a certain president and it finally gets to the point which is if you really want to know where these unicorns reduce by breeding with this strange race of lost race of people you ought to do a DNA test okay it understands that okay so that's what neural nets can do now this was a neural net with 1.5 billion connections that was trained on Google's actually I would draw that I mean 1.5 billion connections is trained on a lot of hardware and we look at what it says and we sort of laugh at how you know it's pretty good but it hasn't got it quite right but it's pretty good okay what they've done now is they've trained and you're on net with 50 billion connections on Google's latest cloud hardware which is it's like having several of the world's biggest supercomputers going for you for months the net worth 50 billion connections I haven't seen any text from it yet but my prediction is it's sitting around laughing at how cute what we produce is okay so one thing about that net is it's clearly very well aware of the initial context these unicorns in a valley that speak English and it's remembering this initial context a long time later and they're occurring you're on let can't do that or occurring on that would have forgotten about the initial stuff and wouldn't produce such good context dependent stuff so the way this works is the word comes in the neural effect makes some hidden units that kind of activity in the hidden units goes and compares itself with previous patterns your point of view with previous patents at earlier times and when it finds a plan at an earlier time that's a bit similar it says that we'll take advice from that previous hidden pattern about how to affect the next layer and so actually a word comes in and how one pattern of activity in the bottom layer of the hidden neurons affects the next layer is dependent on what happened previously now it's depending quite a complicated way and this seems very implausible for a brain because what's happening in the computer is you're storing all these activity patterns they're meant to be neural activity bands or like neural activity bands and you're comparing and this looks hopeless but actually all you need to do is every time you have an activity panel and you use the outgoing weights to affect the next layer just change the weight slightly with hebbian learning so now what's going to happen is that weight matrix that comes out of that activity pattern is going to be modified slightly now I want to get a new activity pattern if the activity pattern is orthogonal to the previous activity pattern then any modifications you made in the weight matrix due to that preview you can't make any difference but if it lines up with the previous activity pattern if it's similar the modifications you made in the weights back there the temporary modifications will cause this new activity pattern to have a different effect chip so you'll get that long temporal context and the way to store a long temporal context is not to keep copies of neural activity bands is to take your weights and to have temporary changes to the weights which I call fast weights so you temporarily change them and these changes decay over time so you'll have a memory so if you ask where in your brain is the mule memory of what I said a few minutes ago I'll ask the younger people this because of the older people it's nowhere with the younger people it's somewhere you can if I were to say suddenly I said a few minutes ago like these big neural nets and I laughing at us you remember I said that where was that memory I think it's in the temporary changes to the weights because that's got much bigger capacity than activities and yawns and you don't need to use up neurons just sitting there remembering right and those temporary changes don't need to be driven by back propagation they can just be heavier okay so I've tried to relate these wonderful nets that can make up stories with an idea about short-term memories in the brain and now I'll talk about whether the cortex can do back propagation so neuroscientists 20 years ago neuroscientist is a don't be ridiculous of course the brain can't do back propagation and that interpreted very literally as sending signals backwards down the same axons and the same yours don't do that new things but now we know that back propagation works really well for solving tough practical problems so that's rather change the balance because when back propagation was just a theory of how you might get computers to learn something and when it learns some simple things it wasn't sort of imperative to understand whether the brain did it but now we know that you can do all these things with back propagation what's more we know that back propagation is the right thing to do but if you have a sensory pathway then you want to take the early feature detectors so their outputs are more helpful for making the right decision later on in the system then what you really need to do is ask the question how should I change the receptive field of this early detector so that that is what is output helps with the decision and what you have to do is do back propagation to compute that that's the efficient way to compute it and I think it'd be crazy if the brain wasn't somehow doing this so why didn't your a scientist think it's impossible apart from silly objections like things don't go backwards down axons at least not at the right speed he wants me to update things oh it's just died I'm gonna go out represent amount and go back into presenter mode okay so here's some reasons why the brain can't do back propagation the first reason is they say well it doesn't get the super vision signal and they're imagining that the super vision signal is like you take a micropipette and you put it into the infra temporal cortex and you inject the right answer and the brain doesn't have anything like that right but actually if you take that language model it didn't need label data it was just trying to predict the next word so you can often use part of the input maybe a future part of the input or maybe a small part of an image that's the right answer and so you can get super vision signals easily so there's no problem with about super vision signals the second reason is neurons don't send real-valued activities listen spikes and back propagation is using these real-valued activities so you can get nice smooth derivatives so back propagation can't possibly be what's going on in the brain the third objection is neurons have to send two signals they have to send their activity forwards and they have to send error derivatives backwards the signal they have to send backwards is how sensitive am i to changes in my input or rather if you change my input how much does that help with the final answer and the last thing is about neurons having reciprocal connections because you have to when you send things backwards if he's a different neuron you have to use the same weight as the forward weight I'm not going to tell you how you can overcome that but you can easily so supervision signals isn't really a problem there's many ways to get a supervision signal and the simplest is predicting what comes next now the question of can neurons communicate real values well the first thing to notice about back propagation is if you have very noisy estimates of the gradient it works just as well it's very very tolerant of noise as long as it's unbiased noise so for example you the signal you Semple's can be one bit one stochastic bit and the signal you sent back was can be 2 bits if they have the right average value the expected values are correct then they're just this expected value plus some noise and the whole system still works fine so in the brain you have a neuron at any instant the neuron has an underlying firing range and it produces spikes and for now let's just suppose he produces spikes according to a Poisson process so it's just probability of producing a spike in a small interval which is the underlying firing range and the question is suppose we treated it as if it could send that underlying firing rate when it sends a plasma spike is just a very noisy version of the underlying firing rate it's a 1 or a 0 but it's expected value is the underlying firing rate okay so how well do neural networks work if we send very noisy signals so I'm gonna have a statistics digression if you do statistics 101 they tell you you shouldn't have more parameters in your model and you have data points you really ought to have quite a few data points for each parameter it turns out this is completely wrong beijing's knew it was wrong 'she um the brain is not in the race aim regime of statistics when i won in the brain you're fishing about 10 to the 14 parameters and you have about 10 to the 9 seconds so even if you have sort of 10 experiences per second so even if 800 milliseconds is a time for an experience that's the kind of backward masking time you have like 10,000 synapses per 100 milliseconds of your life you're throwing a lot of parameters so if your mother just kept saying good bad good bad good bad she couldn't possibly provide enough information to learn all those 10 to the 14 parameters and here's what they teach you wrong in statistics everybody knows that if you've got a given size model with the given number of parameters the more data you have the better you'll generalize so for a given size a model it's always better to have more data in fact the best thing you can do is get more data okay but that doesn't mean that if you've got a fixed amount of data you should make it look like a lot by having a small model that's what they tell you sir this is 101 okay big models are good if you regularize them if you stop them doing crazy things we can see that using a lot of parameters is good you can always win by having more parameters and the way you do that is say I'm gonna have a committee I'm going to learn lots of different little neural nets you give me more parameters I learn more different neural nets and then I'll average what they all say and you'll always win it's a sort of declining win but if you have enough of them you'll win by having more so it's always better to have more parameters it turns out if you have a fixed amount of data and you have enough computation power which the brain has you should always use a such a big model that the amount of data looks small that's the regime you ought to be in for a fixed amount of data that is if you seek the limit when the amount of date is fixed and you have unlimited computation and ask now how big would you like your model to be you'd like your model to be much bigger than the data okay now that only works if you have a good regularizer and I'm now going to tell you a very good regularizer called dropout so this is to use in ural networks where you have a lot more parameters and you have data points to train them on and you could learn ensemble of little models and this is a way of learning an ensemble of many more models but the models in there Tamil can things with each other so the idea is if we just have one hidden layer in your net we put the data in and each time we shout a data vector we randomly remove half the neurons so we randomly get rid of half the neurons in your brain and only use the ones that remain and it's a different subset we remove each time now when we do use in Iran we use it with the same weights each time so what you've got is if you've got a chin neurons you've got two to the H different subsets in yours you might use so you actually have two to the H different models exponentially many models most of the models are never used a few of the models will see one example a small fraction of them will see one example no models will see two examples and yet they can learn because they're all sharing parameters so this idea of sharing parameters in the neural network is very effective so really you've got all these different models that are sharing parameters and you train it up and it generalizes really well so I said that so what we know is if you get rid of a fraction of the neurons each time and treated as though they weren't there it works really well that's just a form of noise and basically this is just an example of if you have a very big model and you add a lot of noise the noise allows it to generalize well and it's better to have a big model with a lot of noise than to have a small model with no noise and so what the brain wants because he's got such a big model compared with the amount of data it operates on it wants a lot of noise and so now a person is kind of ideal it's got a firing rate and now it adds a whole lot of noise to that and either sends a 1 or a 0 and that's actually makes you generalize much better ok so the argument is the reason yours don't send real values is they don't want to they want to send things with a lot of noise in and that's making them generalize better so that's not an argument against back propagation he's dropping models a trained with backpropagation so the random spikes are really just a way of adding noise to the signal to get better generalization and now the last thing into a dress I'm going to keep going till Blake stops me and I figure I've got about another 5 minutes before he gets really ratty so the output of a neuron represents the presence of a feature in the current input so it's obvious the same output can't represent the error derivative right you couldn't have a neuron that said to higher layers this is the value of my feature and said to lower layers this is my error derivative it couldn't be done so the neurons that get back was need to be different neurons except that that's nonsense so here's my claim yoshua bengio picked up in this stage I made this claim first in 2007 I share mainly first immigrant module but um and I still believe this claim even though nobody's managed to make it work really well in the neural net yet the idea is a neuron has a firing rate that's the firing rate is its real output which is communicated stochastically by a spike and that firing rate is actually changing over time the underlying firing rate and the rate of change of the firing rate is used to represent the error derivative now the nice thing about a rate of change is it can be positive or negative so we can represent positive or negative derivatives without a neuron having to change the sort of signs of it synapses and what it's representing the derivatives representing is the derivative the error with respect to the input to the neuron and that gets sent back to early in Iran so never had enough time I could show you a whole bunch of slides about how this will do back problem but I want to show you one consequence of this so that's look here we have a nice equation because it's got light knits on one side Newton on the other side that's likenesses notation for derivatives because they're not derivatives with respect to time and this is Newton's notation because that was for derivatives with respect to time okay and what we're saying is the output of neuron J which is why J is the output in your and J but how fast that's changing over a short time interval is the error derivative this is just a hypothesis you understand but it's true [Laughter] J McClellan's I first used a version of this in 1988 before we knew about spike timing-dependent plasticity I'm not sure to be discovered them where you take this is where I need the cursor yes yes you take some input you send it to some hidden units which send it to more hidden units by the green connections it says there's more hidden units it comes back to the input so you reconstruct the input and then you send it around again not all the way around but up to there and up to there and after there using the right connections and then the learning rule which you'll notice doesn't involve explicit back propagation is to say for these neurons for example I change the incoming weights by the activity of the presynaptic neuron down here times the difference between what I got on the green iteration on the red activation first time around a second time around so the rate of change of the activation of the neuron is what's used to communicate an error derivative now unfortunately this thing has the wrong sign but later on we fix that and so here's a theory from 2007 the still hasn't been conclusively proved wrong and it sort of worked but doesn't work quite as well as we hoped about how you could get a brain to do back propagation what you first do is you learn stack of auto-encoders that as you learn to get each layer to actuate features the layer above from which you can reconstruct the layer below so you learn some features that can really attract this way then you feed those features of data and lots of features of doing strike them you've built a big stack of autoencoders like that okay once you've built the stack of autoencoders then each layer can activity in elec and we construct the activity in the layer below and then you do top to top Lampasas you do a top-down pass from the thing you predicted at the output so you put an input actually goes forward through the layers you predict something at the output and now you do a top-down pass and you get reconstructed activities every one and then you take your output and you change it to be like more like the desired output and now you do a top-down pass and you'll get slightly different reconstructions and the difference between those two reconstructions is actually the signal you need for back propagation and so if you do that the learning rule is that you should change your synapse by the presynaptic activity in the layer below times the rate of change of the activity in the layer above in the postsynaptic neuron so it's a very simple learning rule the it's changed the weight in proportion to the presynaptic activity times the rate of change of the postsynaptic activity now it turns out if you're using spiking neurons what that amounts to that are representing underlying firing rates that are changing that amounts to and learning rule it looks like this what you do is you take a presynaptic spike and you ask whether the postsynaptic spiking before or after it because what you're interested in is the rate of change of the postsynaptic firing rate around the time of the presynaptic spike okay and if the postsynaptic spike occurs often just after it and seldom just before it that suggests the firing rates going up and if the postsynaptic spiking codes often just before the reason everyone and less often just after it that means the firing rate of the post urines going down so if you want your learning rule to be the pre-synaptic activity well you'll only learn when you get a presynaptic spike and then what you'll do is you'll say did the postsynaptic spiking afterwards or before if it occurred afterwards I should raise the weight and if it kept before I should lower the weight and so you're learning rule will look like this and this thing is actually a derivative filter it's centered at zero and what this is really doing is measuring the rate of change of the postsynaptic firing rate and of course it's sampling it so you haven't post an apt you're firing range this is a spike trains and are the other spikes even closer together or further apart well this is a way to measure that and of course you can do the learning on individual spikes and the learning rule would then be the implementation of this idea that the rate of change in the postsynaptic firing rate is the error signal the learning rules just be if the postsynaptic spike goes after the presynaptic one increase the strength otherwise decrease it and have that whole effect fall off as the spicy it further away because we're really only interested in the rate of change of the flowing rate around the time the presynaptic spike now there's one consequence of that which is that if you're going to use the rate of change of a neuron to represent not what the neurons representing but to represent an error derivative you've basically used up temporal derivatives for communicating error derivatives so you cannot use temporal derivatives to communicate the temporal derivatives is what the new your own represents so Ivan your own the represents position I can't use how fast that's changing to represent velocity and that's true of neurons if you want to represent velocity you have to have a neuron whose output represents velocity you can't do it with the rate of change of a position your if I kill the velocity neurons and keep the position neurons then and I watch car moving the position neurons will change but I won't see any motion similarly you can't use the rate of change of a roasting you on to represent acceleration okay so I think the fact you can't use the rate of change of a representation to represent that that stuff in the world is changing is more evidences of all of the idea temporal derivatives of neurons are used up in representing our edges so now I'll summarize the main arguments against back propagation the fact they spent neurons has spikes rather than real numbers well that's just because a lot of noise regularizes things you can represent error derivatives temporal derivative so the same neuron can send temporal derivatives backwards communicate those backwards and communicate activities forwards and the fact that in the brain you do gets back to underpin plasticity seems to be evidence in favor of that representation of error derivatives known done [Applause] thalamus you Jeff for a great talk sorry that was a Twitter joke got it anyway so now what we're gonna do is a brief Q&A between myself and Jeff and then after I've had my chance to ask some questions I'm gonna open it up to you guys now I had originally sent Jeff a few questions which I'll rely on partially but his talk is we want to ask a few others so I'm sorry I'm gonna throw a few loops at you as well but let's start with some of the ones that I told you would give you something funny with there is something funny with my mic is it I don't know if you guys there I'll just won't look down yeah don't look down um okay so the first question yeah it's a good I'm going off script anyway the first question which I would like to ask just because it's something that I spend far too long arguing with people online is essentially you know so you're in the computer science department you've come here you've given us a talk that's largely about brains but many people seem to object to the idea that computers have anything to tell us about brains or indeed the idea that the brain is a computer despite the fact that neuroscientists off to refer to computation in the brain so my question is to you is the brain a computer why I just hand that over to you first oh yes good okay and for the record I didn't tell him to say that if I haven't Witter is watching and be can you just maybe give an intuitive understanding of why the answer is yes despite the fact that obviously our brains are very different from our laptops or our cell phones now it's not like right so there's there's many ways you can do computation with physical stuff and you could get some silicone and make transistors and then run them at very high voltage much higher than needed to make them be did and then you could if you wanted to rip mob to represent a number you could have bits and you could and so on and you could create multipliers and adders and then you can put a lot together and you could have some bits that tell you where in memory to find stuff and you can make a conventional computer or you could make little devices that have some input lines that are hardwired with input lines and you could have adaptive weights on the input line so early neural nets um Marvin Minsky made neural nets out of feedback controllers that we used in I think b-52 bombers or B b-29 bombers or something B 27 but I don't some kind of bomber it was America and so you can make computers in lots of different ways when I was a kid I used to make computers by you take a six-inch nail and you saw the head off and then you wrap copper or out wire around it and then you take a razor blade and you break it in half so that it's a nice flexible thing like this and you wrap a bit of copper wire around the razor blade and then when the current goes through the nail it'll make the razor blade go down and you'll make a contacts so now you've got a relay and then you can put a bunch of those together and make logic gates I never got more than right to logic gates that way but yeah you can make computers in lots of different ways and the brain is clearly made in a different way from the normal computers which has some different strengths and weaknesses so it's much slower but on the other hand you can make it much more parallel it has one special property which I think is what makes us mortal which is that every brain is different so I can't take the weights from my brain and put them in Blake's brain and I hope that it'll work because he just doesn't have connections in the same place tried right yeah well there's a way of doing it where you I take the weights in my brain I turned into strings of words like absorbs these strings of words it was different weights in his brain it's pretty lucky or libraries are different because otherwise rich people grabbed poor people's brains so they could live forever but fight okay so I think I want to ask you then following on that what do you think about some of the quests to fully characterize the brains connectome do you think that is a scientifically worthwhile endeavor yes I do um part because some of the people doing it are my friends ignoring your loyalty to Sebastian what not well in that case no it seems to be it is very worth doing yes but you don't have to do that in order to begin to understand the principles very good but for things like the retina which has a lot of hardwired stuff in it I think it's really important to do that okay so that actually leads on to my next question I wanted to ask you about hardwiring so another thing that I think many people who study the brain find difficult to vote artificial neural networks as a model for the brain is that as you say you start with random weights and you train it on a lot of data and you get these things out but we know that there are some pre-wired things in many brains so you know the classic examples are a horse can run pretty much right out of the womb but even within humans arguably there are some things that we find easier to learn than others yeah and so what do you think is the place for innate behavior within neural networks as a model of cognition okay so it used to be when I was a student if you were interested in language people would tell you that it was all innate and he just kind of matured as you got older and maybe you learned like twelve parameters that characterized your particular language whether it was subject-verb-object or some other way if had there's a Nova that I saw it's probably made about twenty years ago and has all the leading linguists all of whom educated by Chomsky and they look straight at the camera and they say there's a lot we don't know about language but one thing we know for sure is that it's not learned so Chomsky had really good kool-aid yeah he did but it's over because we know now if you want to translate you just learn it all the number of linguists required to get a system that can turn a string of symbols in English into a string of symbols in French is roughly zero I mean linguists involved in preparing the databases for training and making sure you get sort of variety of chromatic structures and things but basically you don't need linguists you just need data so you don't need much innate structure the issue of what is innate it doesn't it seems to me there's not much point putting in stuff innately if you can learn it quickly so for example the ability to move and get 3d structure from motion that's actually very easy to learn so I don't believe that's innate even though a child can do it at like two days you shown the sort of W made of paper and you rotate it in a consistent way and they get bored and as soon as you rotate in a way that they move it in a way that's not consistent with rotation they dangerous perks up but I think they can learn it in two days it's really easy to learn ok interesting now so I want to ask you I know partially what your answer is going to be but when I remember long ago you told me that one of your career goals at least earlier in your career was to prove that everything that psychologists thought about the brain was wrong and so my my question is what was it that they had wrong are they still getting it wrong and is neuroscience getting that same thing wrong it was mainly to do with this conviction psychologists had partly based on Chomsky that there was an awful lot of innate stuff there and that you couldn't just learn a whole bunch of representations from scratch there was this innate framework and there was a little bit of tuning of this innate knowledge and that's what learning was and that's just I think that's completely wrong headed approach in fact I want to go the other way and I want to say the stuff in the brain that's innate wasn't discovered by evolution the stuff in the brain that's innate was discovered by learning do I have time to do that digression yeah yeah okay so imagine we have a little neural network and it's got 20 connections in it and each of those connections has a switch that could be on or off so it can let stuff through or not so you've got to make 20 binary decisions so your chance of making them by chance is one in a million and making the correct decisions no this little neural network circuit is a mating circuit and so the neural net goes into a singles bar and it runs this circuit and if it's got the connections right it has lots of offspring and if it hasn't got the connections right it doesn't have offspring or doesn't have so many offspring okay so let's start off with the connections being if they were just kind of random and you did mutation what would happen is you'd have to build about a million organisms before you got a good one and if you had sexual combination of the organisms let's have a really simple biology in which each connection has its own gene and this gene has two alleles for on and off okay if you do mating now you might have an organism that got all 20 connections right and it mates with one that has a few wrong and it gets a few wrong ones and now it's wiped out it doesn't have lost fostering anymore that's it so it seems like a complete disaster and it would obviously take you at least a million organisms to expect to get a good one even if you have parthenogenesis where you didn't have sexual reproduction now I will show you how to build a good good organism you know only thirty thousand tries and the way you do it is this you for each connection you have three aliens you have turned the connection on genetically turn the connection off genetically or leave the connection to learning okay so that's the third over and now you start off with a population in which about half of the connections are genetically determined and the other half are left to learning so that's ten connections in a can be determined so there's a one in a thousand chance that your luck out genetically you get those ten right and then during the organisms lifetime let's have a really dumb learning algorithm where it just randomly fit like the one I talked about randomly fiddles with the connections just randomly flips connections of the ten that are left to learning and it'll take it about a thousand trials and it'll get the combination right but the point is it can do those trials without building a whole organism it can just go into the singles bar and sort of fiddle around a bit with this connection such a bang so what we've done is we've replaced a million trials of evolution building a million organisms with build a thousand organisms and then let's build 30,000 organisms just to be safe and then each of those filters around with its connections and it'll do this search the whole searches are saying after our million combinations but the way you get the million combinations is a thousand organisms each there's a thousand learning trials and so almost all the work is done by learning now if genetically an organism has happens to have more things set like it's got twelve of them set right it'll learn faster and so genetic pressure for if you mate organisms now there's genetic pressure to get more and more of these alleles set genetically but the pressure only comes because the learning can get all 20 set right so this thing can mate and have lots of offspring so the fact that the learning can find a solution creates genetic pressure to hardwire these things so what's happening there's the search a thousand things were done by evolution a million minus a thousand things were done by learning and that created a landscape for evolution that allowed evolution to gradually how worried more and more but these things were first planned by learning so I think a lot of the structure in the brain that's hardwired is first phone by learning and gradually gets backed up into the hard wiring but to get the evolutionary pressure to say that's good you have to be able to do the learning you just hardwired things you'd never find anything that was good okay great thank you now that's called the Baldwin effect by the way yes yeah it's called after a psychology professor at the University of Toronto in the 1890s called Baldwin who invented this effect he didn't do any computer simulations oh so I want to do one follow-up question on that and then ask my final question before handing it to the audience so my follow-up to that is you know I think one of the things that is unclear in terms of the success of deep learning is exactly how much it was purely the compute or some clever things now I've seen both cases argued and you today kind of suggested that it was just the compute but I want to ask you about this falling on your last point which is that we know that if you build networks with particular architectures and with particular learning rules you are effectively making learning faster if you do it right and arguably a lot of the success of deep learning has actually been as a result of people thinking about good designs for their networks and good ways of making learning faster yep so would you potentially say that we have seen that process that you just described actually occur within AI over the last 10 years of the learning kind of backing up into the hard wiring I [Music] need to think about that what we've seen I mean Yamaka invented convolutional neural nets right yes more explain that in the late 1980s but computers weren't fast enough to really do a lot with them so they were used for handwriting recognition and they were used for reading 10% of all the checks in America but they didn't really take off they really took off when the computer hardware came along to make them really efficient so that's a case where the ideas were had first but without the hardware they didn't work you've obviously got em both right yes okay so now my last question before I hand it to the audience is just what do you see as being the future of the interaction between neuroscience and AI do you think that there is space for a sort of new cognitive science where we study general intelligence but with brain centric models rather than logic based models or will we see the two-streams depart over the next few decades the way I like to think of it is we'd like to understand if you like to understand how the brain does computation you've got brains in your computation and they look like you said they look pretty different to begin with because there's many different ways to do computation and with a conventional digital computer you can get it to pretend to be anything so we're getting it to pretend to be some other kind of computer an artificial neural net and we'd like to sort of bridge this huge gap between brains and what we can simulate on computers and so neuroscientists are sort of doing experiments and good computational neuroscientists are sort of experiments to try and sort of see how you could do the computation and I think of myself at this end as doing simulating things with artificial neural nets to see how you can make it more biological and we're trying to build a bridge and so the computational neuroscientists miss them a building from this end I'm building from the other end but obviously if you want to build a bridge to somewhere you need to look at where you're going and so I'm trying to build a bridge that does conversation more and more like the brain does it or like I guess the brain does it from what light my neuroscientist friends tell me and then this conventional AI which is trying to build a bridge like that great okay thank you so now I'm going to open up to questions from the audience now for this we've got this kind of interesting system here so rather than you putting up your hands and me selecting you you can actually nominate yourself to ask a question by pressing on the button on your microphone and it is a first-come first-served basis so you're going to be queued up and NASA aid you're now first on the queue and by now it's too late to be able to answer questions yes and one last thing about that though when you are done asking your question please turn off your microphone because that will open up this slot for the next person in the queue okay go ahead yeah if your red light is flashing that means you're on you get to ask a question if your red light is flashing you're on Oh solid sorry okay I've got a slowly light you you were faster okay so this is a Clifton suspension bridge analogy for your for your interests yet so you mentioned briefly heavy ensign apses as neuroscientists we have a good understanding of how they work at a molecular level so my question is to what extent are the understanding of biological memory mechanisms I have been synapse is implemented by AI for deep learning and the sorts of systems that you're describing so at present people don't use heavy inside synapses for most deep learning they're using back propagation so it's an error correction rule as opposed to something where if you just use it it gets stronger but if you want a short-term memory for things like transformers to remember a temporal context just as simple have been synapses is a good thing to have yeah but heavy and synapse is can encode memories in humans that can last a lifetime so is this something that AI is working towards using or are we just going to bypass heavy on synapse and come up with something superior okay so if you think about what's been successful in the last few years it's using error correction learning with either labeled data or trying to predict what comes next and not heavy ins not heavy in silences now people like me who sort of do this kind of learning but interested in the brain no this isn't right we're much more interestin unsupervised learning we just can't make it work very well yet and I would love to be able to get learning to work as well as it does when you do back propagation without using biologically implausible things and one place we can do that is with temporary memories so if you say synapses have a fast component you can use have been learning for that fast component and that will actually help you or let's work better even if you're using back propagation for the slow component that didn't really answer your question but you know filled the time [Laughter] hello I read in the reinforcement learning book that dopamine is used as a reward prediction error signal so I was wondering do you see it used as a supervisory signal like a patient like you mentioned earlier okay so for reinforcement learning there is some lovely work done by Peter Dayan and who is the theoretician and some experimentalists showing that the all data from neuroscience fits in with a theory that was started with rich Sutton and Peter down did the work of showing that dopamine is corresponds to something in a particular learning algorithm and it doesn't correspond to the reward it corresponds to the I think it's the difference between their order expecting the reward you get so if you're a monkey and you're expecting a grape and I give you a piece of cucumber that's negative reward and that will be a big negative hit of the dopamine so that's not the kind of learning that's been really successful so far if you're willing to burn a lot of computer time reinforcement learning will solve some problems but it's not the kind of learning that's been most successful in AI so the difference is in reinforcement learning you get a single scalar you get one number whereas in error-correction learning you typically get a whole vector numbers right here hey so you mentioned that your goal kind of the bridge analogies your goal is to go from you know a computers and try to get to the brain so okay let's just say that kind of makes sense to think okay let's get some more generally I because I'd say humans are decently general and neuroscientists are trying to get from the other bridge the brain to generalize so you have these two kinds of debates and this happens quite often where is it correct to go from generally I to the brain first understand generally I then understand the brain or brain to generally I and so what would you say is the most practical way to probe problematize generally I I don't like the phrase general AI I don't think if you want intelligent devices I don't think you want to produce the sort of general purpose Android I think you want to produce different devices that are smart in different ways so basically if you want intelligent machines that do things you have a vacuum cleaner you have a backhoe you don't try make one thing that's a vacuum cleaner and a background yes it doesn't make sense what about connecting them through like kind of like different cognition areas in the brain yeah but I think it's the same with cognition - I think the the neural net that does machine translation isn't the same neural net as does vision I think yeah my guess is that people are thinking too much about making one neural net that does everything and not thinking enough about making more modular neural nets that are good at different things some more Universal than others but I think that's how progress that you made that's how progress has been made so far not by the people talking about general AI it's being made by people looking at saying how can I get in your laboratory vision or how can can I get it your machine translation thank you hi I'm just wondering about the role of hierarchy in general so I mean there are difference how hard he there's like different layers of your network yeah so you mentioned there's like fast memory and slow memory and then so I wonder are there more ways to add hard key hierarchy um to neural networks to make them more useful or emulate actual bring yes probably so envision for example I mean you you have multiple layers that is your multiple cortical areas in the visual pathway that's a very different kind of hierarchy from what you need for dealing with the sort of structure we also its own reality there's there's the universe may be many of them that's ones enough and then there's galaxies and then there's there's probably things about galaxies beginning there's been this in the galaxies the stars and then there's solar systems and then there's planets and and so on and we can do that all the way down to atoms right then you can imagine or what you can represent all that in your brain and clearly what's going on is out in the world there's this hierarchy that goes over many many orders of magnitude from the universe down to quarks whatever the smallest thing is now and you don't want that kind of hierarchy in your brain what you've got in your brain is the ability to deal with a little window of hierarchy where there's a sort of object in its parts and to deal with the whole universe what you do is you can take this window then you come map at a scale of the universe where there's the universe and the galaxies or there's the galaxies and the stars or just the Ataman there's the electrons and so you're using the same neural hardware but mapping reality onto it differently and so I think whenever we have to deal with anything complicated we use hierarchies but the way the brain uses them is by varying the mapping from reality onto the brain and it really can only operate with a small window on a hierarchy which you can move up and down much like you only have small region of high resolution which you move around still logarithm sorry like logarithm what about logarithms so that's what you're talking about right compressing a big range into something that is much manageable I wasn't thinking of it like that I was thinking of it is you have some fixed hardware and when I'm thinking about the solar system my fixed hub I couldn't possibly do with the universe that's much too big and it couldn't possibly with an atom as much too small but it's fine dealing with the Suns and planets and maybe a moon or two what I'm trying to get at is we need to make a big distinction between the hierarchy in the real world hierarchical structures in the real world and how we deal with them cognitively where we use attention and we only ever deal with a bit of the hierarchy that's not the same for say aspects of language where you have no say with vision I can use the same neurons for representing the Sun and for representing a nucleus it's just an analogy but it's the same neurons I'm using now for if I'm processing language I've got things that find me phonemes and things have turned phonemes into words and things gem words into sentences and those I can't move a window like that that's a fixed hierarchy with phonemes and those words and those phrases and the sentences and that's all sort of fixed in the brain you did that's not a flexible mapping you can't kind of move the sentences down so they're where the words move the words down so the way the phonemes work that doesn't work so the some hierarchies that really do relate to sets of neurons in the brain they're like the layers in the in the connectionist models there's other hierarchies like the whole spatial structure of the universe where what's in the brain is a window you move over that hierarchy thank you thanks doctor Hinton for excellent talk and excellent ideas about the feasibility of back propagation my questions may be more boring about the statistical comments that you made like is that Dale part of me are you Dale no I'm Kyle you sound like Dale Sherman's I'm at the University of Alberta hi well that's a coincidence that you sound just like - Amazon University they train us all to speak the same are you a student of Dale's No I think I've only man you know if you're student of Dale's I need to watch out because it's gonna be a very tricky question it's not it's the question is that I was trained with this intuition that you can't over parameterize your models that if you're trying to fit a line that you need two points if you're trying to fit a curve you need three and so on and that scales up and you should always have a little bit less data points so I know that you have shown clearly and the field has shown that that's not true what were the statisticians getting wrong in their logic to be to do with regularization that you need it to be highly regulated but first of all I'll show you that if you want to fit a if you want to sit fit two data points well let's take three if you want to fit three data points you would have told me you want a polynomial with only three degrees of freedom so you want a constant and a slope and a curvature and that's all you can afford with three data points mm-hmm that's wrong now this is where we need a pen oh sorry they don't work okay okay yes can you see that yeah okay and we're going to have three data points and actually if you're a statistician you'd probably say four three data points you probably ought to fit a straight line like this because I could fit a parabola and the parabola would fit exactly mm-hmm and that's a bit suspicious in other words the parabola fits exactly but do you really believe that if you were to ask when X is zero what's the value of y do you really believe that value for y because a straight line is far more conservative right so that's a position where they fit a straight line how ever that would be a frequentist statistician if you should obey the in statistician ice and this is in Chris buescher's machine learning textbook with a nice picture of it I think somewhat think it's that book a Bayesian statistician would say okay let's try fitting fifth order polynomials and fifth order polynomials we might even fit ones that don't exactly go through the data but for now let's make them go through the data so we fit a fifth order polynomial it goes kind of one two three well you know some order and we fit another one oh that didn't go through the page and we keep filling these guys and we fit a gazillion of them and what you see at the end is that these gazillion ones in between the data points they're kind of all over the place and their average is in a sensible place like here but their variance is big and what they're telling you is if you give me this x-coordinate I'm rather uncertain about this y-coordinate but this is a good bet and similarly here and if you go out here these polynomials are just all over the place and they'll tell you if you give me this x-value then it could be pretty much anything but that's not a bad bad but it could be pretty much anything and that's a much better answer than you get from a straight line yeah so by fitting a very large number of different polynomials and then averaging you get good mean answers and you also get a sense of the variance okay now drop egg is doing something like that yes yeah and that's brilliant thank you for coming up with dropout um the we many of us here working in a regime of sparse data and so we have a couple channels a couple signals a couple voxels and you've convinced us that we need more but how is is there a way forward in AI that can manage with more sparse taters is this the only regime that's gonna be able to make success so the really big successes have been on big databases and I think we should be using even bigger models but you can't get away from the fact that actually if you're gonna have something starts off random and sucks all its knowledge from the data you've better have enough data to suck all that knowledge from the bigger your models are better if you regularize it but you still need a lot of data so the way you should think about is this if you've got a hundred thousand data points that's small I know that's very depressing if you neuroscientist way it's not depressing it's it seems it's impossible if you want to personalize medicine for one individual and you want to train a model on their data from their brain it does it seems like there's going to be a disconnect between what these models can do and how they might help someone in the future and yes and no if I train a model on a very large number of people and then apply that model to one person that's a formal personalized medicine that really works great thanks oh and say hi to Dale for me thank you for all thank you for the presentation my last question is about this dropout you know the thing is that you randomly just drop some parts of the network and then you say okay that it works better so I would accept it but do we have any like intuition white for example some parts of it work better or if it try to embed this like network into like is it more graph is amorphous and whether these two different graph different graphs that we took where there are there any similarities between them or we just do it randomly I guess the problem with the randomness I guess we were trying to put the burden of prediction on the random part of the computer desk so I didn't hear the whole question but certainly in drop out what we do is we randomly leave out units we now you can also do block drop out you can take groups of units a randomly like the groups and what that does is it allows the units within a group to collaborate with one another and then between groups have to be fairly independent and that's called block drop right and that works quite well too but I didn't really hear the rest of your question okay the question was about that okay imagine that you have and you talk closer to the microphone because I'm partially dead okay the question is that imagine that you have a dropout of 50% and you're trying to get rid and get rid of like 50% of your nodes and in the nodes in the network and the question is okay whether we have any similarity between the types if we do it's like I to ative lee whether we would find any similarity on the structure of the network that would produce the best results and if it's so whether it would correspond to something physical like for example if you are doing it vision things whether it would correspond to something in brain or not Angus yeah lots of people have thought about whether you can do better than random and drop out and there's some work on that like block dropout the works come work for some things but I don't really have much to say about I don't really know the answer to sort of is there something much more sensible than dropout there's a lot more structured there might well be but I thank you okay so with that we're gonna have to end so please join me in thanking Jeff [Music] and I only also to thank Blake for hosting this event and I I felt the questions could have gone on all night but tip-off is in half an hour so some some of us have to move on so Thank You Blake [Applause]

Info

Channel: Canadian Association for Neuroscience

Views: 9,302

Rating: undefined out of 5

Keywords: Geoffrey Hinton, AI, neuroscience, backpropagation, brain, canadian neuroscience

Id: qIEfJ6OBGj8

Channel Id: undefined

Length: 82min 3sec (4923 seconds)

Published: Mon Jun 10 2019