Heroes of Deep Learning: Andrew Ng interviews Geoffrey Hinton

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome Jeff and thank you for doing this interview with Dee Fleming about AI thank you for inviting me um I think that at this point you more than anyone else on this planet has invented so many of the ideas behind deep learning and a lot of people have been calling you the Godfather of deep learning although it wasn't until we're just chatting a few minutes ago that I realize you think I'm the first one to call you that which was I'm quite happy to have done but what I'm gonna ask because many people know you as a legend I want to ask about your personal story behind a legend so how did you get involved in going way back how did you get involved in AI and machine learning and neural networks so when I was at high school I had a classmate who was always better than me and everything he was a brilliant mathematician and he came into school one day and said did you know the brain uses Holograms and I guess that was about 1966 and I said sort of what's a hologram and he explained that in a hologram you can chop off half of it and you still get the whole picture and that memories in the brain might be distributed over the whole brain and so I guess he'd read about Lashley's experiments where you chop out bits of a rat's brain and discover it's very hard to find one bit where it straws one particular memory so that's what first got me interested in how does the brain store memories and then when I went to university I started off studying physiology and physics I think when I was at Cambridge I see only undergraduate doing physiology and physics and then I gave up on that try to do philosophy because I thought that man giving me more insight but that seemed to me actually lacking in ways of distinguishing when they said something false and so then I switched to psychology and in psychology they had very very simple theories and it seemed to me it was sort of hopelessly inadequate for explain what the brain was doing so then I took some time off from a carpenter and then I decided I'd try AI and I went off to Edinburgh to study AI with longer Higgins and he had done very nice work on neural networks and he'd just given up on your networks I've been very impressed by winnig Wright's thesis so when I arrived he thought I was kind of doing this on fashion stuff and I ought to start on symbolic AI and we had a lot of fights about that but I just kept on doing what I believed in and then what I eventually got a PhD in ni and then I couldn't get a job in Britain but I saw this very nice advertisement for Sloan fellowships in California and I managed to get one of those and I went to California and everything was different there so in Britain neural Nets was regarded as kind of silly and in California Don Norman and David Ronald Hart were very open to ideas about neural nets it was the first time I'd been somewhere where thinking about how the brain works and thinking about how that might relate to psychology was seen as a very positive thing and it was a lot of fun there in particular collaborating with David relman art was great great so this is when you're at UCSD and UN Rama Hart around 1 1982 wound up writing yo the seminal backdrop paper right so actually it was more complicated than that what happened in I think Iraq early 1982 David drama Hart and me and Ron Williams between us developed the back rope algorithm it was mainly David Rama Hart's idea we discovered later that many other people have been vetted it David Parker had invented it probably after us but before we published Paul Robeson publishes already quite a few years earlier but nobody paid it much attention and there were other people who developed very similar algorithms it's not clear what's meant by backprop the using the chain rule to get derivatives was not a novel idea why do you think it was your paper that helps so much the community latch on to back probably feels like your paper marked an infection in the acceptance of this algorithm whoever accepted it so we managed to get a paper into nature in 1986 and I did quite a lot of political work to get the paper accepted I figured out that one of the referees was probably going to be Stewart Sutherland who was a well-known psychologist in Britain and I went and talked to him for a long time and explained to him exactly what was going on and he was very impressed by the fact that we showed that backprop could learn representations for words and you could look at those representations which were little vectors and you could understand the meaning of the individual features so we actually trained it on little triples of words about family trees like mary has mother victoria and you'd give it the first two words and it would have to predict the last word and after you trained it you could see all sorts of features in the representations of the individual words like the nationality of the person and their which generation they were which branch of the family tree they were in and so on that was what made Stewart Sutherland really impressed with it and I think that was why the paper got accepted very early were then beddings and you're already seeing features learn features of semantic meaning is emerged from the training algorithm yes so from a psychologists point of view what was interesting was it unified two completely different strands of ideas about what knowledge was like so there was the old psychologists food that a concept is just a big bundle of features and there's lots of evidence for that and then there was the AI view of the time which is a far more structuralist view which was the concept is how it relates to other concepts and to capture concept you'd have to do something like a graph structure or maybe a semantic net and what this back propagation example showed was you could give it the information that were going to a grass structure or in this case a family tree and it could convert that information into features in such a way that it could then use the features to derive new consistent information ie generalize but the crucial thing was this to and fro between the graphical representation or the the tree structured representation of the family tree and a representation of the people as big feature vectors and the fact that you could from the graph like representation you could get to the feature vectors and from the feature vectors you could get more of the graph like representation so this is 1986 in the early 90s benzio showed that you could actually take real data you could take english text and apply the same techniques there and get embeddings for real words from english text and that impressed people a lot i guess recently we've been talking a lot about how fast computers like GPUs and supercomputers there's driving deep learning I didn't realize that back in between 1986 and early 90s it sounds like between you and enjo there was already the beginnings of this trend yes it was a huge advance I mean in in 1986 I was using a list machine which was less than 1/10 of a mega flop and by about 1993 or thereabouts people were seeing like 10 mega flops so it was a factor of 100 and that's the point at which it was easy to use because computers are just getting faster over the past several decades you've invented so many pieces of neural networks and deep learning our mission here is of all of the things you've invented which the ones just still most excited about today so I think the most beautiful one is the work I did with toast an auskey on Boltzmann machines so we discovered there was this really really simple learning algorithm that applied to great big densely connected Nets where you could only see a few of the nodes so it would learn hidden representations and it was a very simple algorithm and it looked like the kind of thing you should be able to get in a brain because each sinner only needed to know about the behavior of the two neurons it was directly connected to and the information that was propagated was the same there were two different phases it is which we called wake and sleep but in the two different phases you are propagating information in just the same way whereas in something like back propagation there's a forward pass and a backward pass and they work differently they're sending different kinds of signals right so I think that's the most beautiful thing and for many years it looked like just like a curiosity because it looked like it was much too slow but then later on I got rid of a little bit of the beauty and instead of letting things settle down just use one iteration in a sim in a somewhat simpler net and that gave restrictive Boltzmann machines which actually worked effectively in practice so in the Netflix competition for example restrictive Boltzmann machines were one of the ingredients of the winning entry in fact of all of the meats and resurgence of neural net 2013 is starting about I guess 2007 was the restricted Boltzmann machine and Deve restricted Boltzmann machine work that you and your lab did yes so that's another of the pieces of work I'm very happy with the idea of then you could train the restrictive Boltzmann machine which just have one layer of hidden features and you could learn one layer of features and then you could treat those features as data and do it again and then you could treat the new features you'd learn to state and do it again as many times as you liked so that was nice it worked in practice and then Yee white a realized that the whole thing could be treated as a single model but it was a weird kind of model it was a model where at the top you had a restricted Boltzmann machine but below that you had a sigmoid belief net which was something that Redford Neyland invented many years earlier so it was a directed model and what we'd managed to come up with by training these restricted Boltzmann machines was an efficient way of doing inference and sigmoid belief Nets so around that time there were people doing neural nets who would use densely connected Nets but didn't have any good ways of doing probabilistic inference in them and you had people doing graphical models unlike my Jordan who could do inference properly but only in sparsely connected Nets and what we managed to show was as a way of learning these deep belief Nets so that there's an approximate form of inference that's very fast it just happens in a single forward parts and that was a very beautiful result and you could guarantee that each time you learned an extra layer of features there was a band each time you learned a new layer you got a new band and the new band was always better than the unbound yeah or the variation about showing the earliest yes yeah so that was the second thing that I was really excited by and I guess the third thing was the work I did with Bradford kneel on variational methods it turns out people in statistics had done similar work earlier but we didn't know about that so we managed to make eeehm work a whole lot better by showing you didn't need to do a perfect Estep you could do an approximate e step an e/m was a big algorithm in statistics and we'd showed a big generalization of it and in particularly in 1993 I guess some with bank camp I did a paper that was I think the first variational Bayes paper where we showed that you could actually do a version of Bayesian learning that was far more tractable by approximating the true posterior with a Gaussian and you could do that in the neural net and I was very excited by that I see Wow right yep I think I remember all of these papers the new and hinton approximately M paper writes then many I was reading over that and I think you know some of the algorithms you used today or some of the album's that lots of people use almost every day are what things like dropouts or I guess rarely activations so came from your group um yes and no so other people have thought about rectified linear units and we actually did some work with restricted Boltzmann machines showing that a renew was almost exactly equivalent to a whole stack of logistic units and that's one of the things that helped her lose catch on I was really curious about that their value paper had a lot of math showing that this function can be approximated this really complicated formula did you do that nav so your paper would get acceptance in academic conference or did all that math really influenced the development of max of 0 and X that was one of the cases were actually the math wizard important to the development of the idea so I knew about rectified linear units obviously and I knew about logistic units and because of the work on Boltzmann machines all of the basic work was done using logistic units and so the question was could the learning algorithm working something with rectified linear units and by showing the rectified linear units were almost exactly equivalent to a stack of logistic units we showed that all the math would go through IC and it provided the inspiration but today tons of people use relu and it just works without you know without the same without necessarily needing to understand the same old sedation yeah one thing I noticed later when I went to Google I guess in 2014 I gave a talk with Google about using revenues and earn it visualizing with the identity matrix because the nice thing about Ray Lewis is if you keep replicating the hidden layers and you initialize with the identity it just copies the pattern in the layer below and so I was showing that you could train networks with 300 hidden layers and you could train them really efficiently if you initialize with the identity but I didn't pursue that any further and I really regret not pursuing that we publish one paper with quickly showing as you could initialize recurrent and not exactly showing you could initialize recurrent Nets like that so I should have pursued it further because later on these residual networks there's really um that kind of thing over the years I've heard you talk a lot about the brain I've heard you talk about the relationship we back and the brain whether your current thoughts on that um I'm actually working on a paper on that right now I guess my main thought is this if it turns out the back prop is a really good algorithm for doing learning then for sure evolution could have figured out how to implement it I mean you have cells that can turn into either eyeballs or teeth now if cells can do that they can for sure implement back-propagation and presumably this huge selective pressure for it so I think the neuroscientists idea that it doesn't look plausible is just silly there may be some subtle implementation of it and I think the brain probably has something that may not be exactly back propagation but he's quite close to it and over the years I've come up with a number of ideas about how this might work so in 1987 working with Jamie McClelland I came up with the recirculation algorithm where the idea is you send information round and loop and you try to make it so that things don't change as information goes around this loop so the simplest version would be you have input units and hidden units and you send information from the input to the hidden and then back to the input and then back to the hidden and then back to the input and so on and what you want you want to train an autoencoder but you want to train it without having to do back propagation so you just train it to try and get rid of all variation in the activities so the the idea is that the learning rule for a synapse is change the weight in proportion to the presynaptic input and in proportion to the rate of change of the postsynaptic input but in recirculation you're trying to make the post synaptic input the trying to make the old one be good and the new one be bad so you're changing it in that direction and we invented this algorithm before neuroscientists that come up with spike timing-dependent plasticity spike time dependent plasticity is actually the same algorithm but the other way round where the new thing is good and the thing is bad in the learning rule so you're changing the weight in proportion to the prison opportunity times the new person athletic activity - the old one later on I realized in 2007 that if you took a stack of bolts restrictive Boltzmann machines then you trained it up after it was trained you then had exactly the right conditions for implementing backpropagation by just trying to reconstruct if you looked at the reconstruction error that reconstruction error would actually tell you the derivative of the discriminative performance and I add the first deep learning workshop at nips in 2007 I gave a talk about that that was almost completely ignored later on Yoshio benzio took up the idea and that's actually done quite a lot more work on that and I've been doing more work on it myself and I think this idea that if you have a stack of autoencoders then you can get derivatives by sending activity backwards and looking at reconstruction errors is a really interesting idea may well be how the brain does it one other topic that I know you thought a lot about and that I hear you still working on is how to deal with multiple time skills in deep learning so can you share with us on that yes so actually that goes back to my first year as a graduate student the first talk I ever gave was about using what I called fast weights so weights that adapt rapidly but decay rapidly and therefore can hold short-term memory and they I showed in a very simple system in 1973 that you could do true recursion with those weights and what I mean by true recursion is that the the neurons that are used for representing things get reused for representing things in the recursive call and the weights that are used for representing knowledge get reused in the recursive call and so that leaves the question of when you pop out of a recursive call how do you remember what it was you're in the middle of doing where's that memory because you used the neurons for the recursive call and the answer is you could put that memory into fast weights and you can recover the activity states of the neurons from those fast weights and more recently working with Jimmy ba we actually got a paper in nips about using first waits for recursion like that so that was quite a big gap I the first model was unpublished in 1973 and then Jimmy bars model was in 2015 I think or 2016 so it's about 40 years later and I guess one other idea approach is talk about for quite a few years now over five years I think is capsules where where are you with that okay so um I'm back to the state I'm used to being in which is I have this idea I really believe in and nobody else believes it and I submit papers about it and they all get rejected but I really believe in this idea I'm just going to keep pushing it so it hinges on there's a couple of key ideas one is about how you represent multi-dimensional entities and you can represent multi-dimensional entities by just a little vector activities as long as you know there's only one of them so the idea is in each region of the image you'll assume this at most one of a particular kind of feature and then you'll use a bunch of neurons and their activities will represent the different aspects of that feature like within that region exactly what R is x and y-coordinates what aren tation is in fact how fast is it moving what color is it hi brightest it and stuff like that so you can use a whole bunch of neurons to represent different dimensions of the same thing provided there's only one of them that's a very different way of doing representation from what we're normally used to a neuron that's normally in your own X which have a great big lair and all the units golf and do whatever they do but you don't think of bundling them up into little groups the represent different coordinates of the same thing so I think this I think there should be this extra structure and then the other part the other idea that goes with that since that this means in the destroy the representation you partition the representation to have different subsets to represent right rather than I called each of those subsets of capsule and the idea is a capsule is able to represent an instance of a feature but only one and it represents all the different properties of that feature so some it's a feature that has lots of properties as opposed to a normal neuron in a normal neural net which is just as one scalar property sure I see yep and then what you can do if you've got that is you can do something that normal neural nets are very bad at which is you can do what I call routing by agreement so let's suppose you want to do segmentation and you have something that might be a mouth or something else that might be a nose and you want to know if you should put them together to make one on one thing so the ideas should have a capsule for a mouth that has the parameters of the magazine and you have a capsule for a nose that has the parameters of the nose and then to decipher that you put them together or not you get each of them to vote for what the parameters should be for a face notice the mouth and the nose are in the right spatial relationship they will agree so when you get to captures of one level voting for the same set of parameters of the next level up you can assume they're probably right because agreement in a high dimensional space is very unlikely and that's a very different way of doing filtering than what we normally use in your own nets so I think this routing by agreement is going to be crucial for getting neural nets to generalize much better from limited data I think you'd be very good at dealing with changes in viewpoint very good at doing segmentation and I'm hoping it will be much more it's very efficient than what we're currently doing you're all mats which is if you want to deal with changes in viewpoint you just give it a whole bunch of changes in viewpoint and train it on them all I see right right so rather than FIFO donees supervised learning you can learn this in some different way well I've still planned to do it with supervised learning but the mechanics of the forward pass are very different it's not a pure forward pass in the sense that there's little little bits of iteration going on where you you think you find a my thing you think you found a nose and you do a little bit of iteration to decide whether they should really go together to make a face I see and you could do backdrops from all that iteration I suppose you could change it or discriminately I see and we're working on that now my group in Toronto so I now have a little Google team in Toronto part of the brain team see yep I see that's what I'm excited about right now yeah look forward to that paper when that comes out yeah if it comes out yeah your worth in deep learning for several decades I'm actually really curious how has your thinking your understanding of AI you know changed over these years so I guess a lot of my intellectual history has been around back propagation and how to use back propagation how to make use of its power so to begin with in the mid 80s we were using it for discriminative learning it was working well I then decided by the early 90s that actually most human learning was going to be unsupervised learning yeah and I got much more interesting unsupervised learning and that's when I worked on things like the wake-sleep algorithm and your comments at that time really influenced my thinking as well so right when I was leaving Google brains our first projects been the law of work and unsupervised learning because of your incidence right and I may mislead you that is in the long run but I think unsupervised learning is gonna be absolutely crucial but you have to sort of face reality and what's worked over the last 10 years or so is supervised learning discriminative training where you have labels or you're trying to predict the next thing in a series so that lapses the label and that's worked incredibly well and I still believe that unsupervised learning is going to be crucial and things will work incredibly much better than they do now when we get that working properly but we haven't yet I think many of the senior people in deep learning including myself remain very excited about it it's just none of us really have almost any idea of how to do it yet maybe you do I don't feel like well um variation water encounters where you use the rebrand or addition tricks seem to me a really nice idea and generally of adversarial nets also seem to me to be a really nice idea I think gender right for cereal nets are one of these sort of biggest ideas in deep learning that's really new I see yeah I'm hoping I can make capsules that successful but right now Jennifer adversarial Nets I think see have been a big breakthrough what happened to sparsity anslo features which were two of the other principles for building unsupervised models um I was never as big on sparsity as you were but slow features I think is a mistake you shouldn't say slip the basic idea is right but you shouldn't go for features that don't change you should go for features that change in predictable ways so here's the sort of basic principle about how you model anything you take your measurements and you apply nonlinear transformations to your measurements until you get to a representation as a state vector in which the action is linear so you don't just pretend it's linear like you do with Kalman filters but you actually find a transformer from the observables to the underlying variables where linear operations like matrix multiplies on the underlying variables will do the work so for example if you want to change viewpoints if you want to produce image from another viewpoint what you should do is go from the pixels to coordinates and once you got to the coordinate representation which is the kind of thing you're making capsules will find you can then do a matrix multiply to change viewpoint and then you can map it back to pixels right that's why you did oh that's a very very general principle that's why you did all that work on face synthesis right we take a face and compressors they're very low dimensional vector and so you can fiddle that and get back other faces I mean I had a student who worked on that I see I didn't see much work on that myself I see I'm sure you still girls all the time if someone wants to break into a deep learning why should they do so what advice would you have I'm sure you given a lot of advice to people and one-on-one settings but you know for the global audience of people watching this video what advice would you have for them to get into diverting okay so my advice is sort of read the literature but don't read too much of it so this is advice I got from my advisor which is very unlike what most people say um most people say you should spend several years reading the literature and then you should start working on your own own ideas and that may be true for some researchers but for creative researchers I think what you want to do is read a little bit of the literature and notice something that you think everybody is doing wrong and contrarian in that sense you look at it and it just doesn't feel right and then figure out how to do it right and then when people tell you that's no good just keep at it and I have a very good principle for helping people keep at it which is either your intuitions are good or they're not if your institutions are good you should follow them and you will eventually be successful if your intuitions are not good it doesn't matter what you do okay inspiring advice so my dessert go for it you might as well trust your intuitions there's no point not trusting them yeah you know I usually advise people to not just read but graphically publish papers and maybe that was the natural limit on how many you could do because replicating results is pretty time-consuming yes it's true that when you throw a replicator publish maybe you discover all the little tricks necessary to make it to work the other the other advice I have is never stop programming I see because if you give a student something to do if they're a bad student they'll come back and say it didn't work so the reason it didn't work will be some little decision they made that they didn't realize was crucial and if you give it to a good student like you might say for example you can give him anything you know come back and you'll say it work I've seen books I remember doing this once and I said but wait a minute you I um since we last talked I realized it couldn't possibly work for the following reason and you I said oh yeah well I realize that right away so I assumed you didn't mean that any any other advice for people there once I break into AI and deep learning I think that's basically reach enough so you start developing intuitions and then trust your intuitions I see cool I'm gonna go for it see cool I think don't be too worried if everybody else says is nonsense and I guess there's no way to know if others are right or wrong when they say is nonsense that you just have to go for it and then find out right but there is one way there's one thing which is if you think it's a really good idea and other people tell you it's complete nonsense then you know you're really onto something so one example of that is when Radford Alli first came up with variational methods I sent mail explaining it to a former student of mine called Peter Brown I knew a lot about e/m and he showed it to people to work with him called the delle pietre brothers there were twins I think yes yes and he then told me later what they said and they said to either this guy's drunk or he's just stupid um so they really really thought it was nonsense now it could have been partly the way I explained it because I explained it in intuitive terms but when people when you have what you think is good idea and other people think is complete rubbish that's the sign of a really good idea oh and and and research topics you know new grad students should work on what capsules and maybe unsupervised learning and the other one good piece of advice for new grad students is see if you can find an advisor who has beliefs similar to yours because if you work on stuff that your advisor feels deeply about you'll get a lot of good advice and time from your advisor if you work on stuff your buddy's not interested in all you get is you'll get some advice but it won't be nearly so useful and last one on advice for learners how do you feel about people entering a ph.d program versus joining you know a top company or a top research group in corporation yeah it's complicated I think right now what's happening is there aren't enough academics trained in deep learning to educate all the people we need educated in universities there just isn't the faculty bandwidth there but I think that's going to be temporary I think what's happened is depart most departments are being very slow to understand the kind of revolution that's going on I kind of agree with you that it's it's not quite a Second Industrial Revolution but it's something on nearly that scale and there's a huge sea change going on basically because our relationship to computers has changed instead of programming them we now show them when they figure it out that's completely different way of using abused and computer science departments are built around the idea of programming computer and they don't understand that sort of this showing computers is going to be as big as programming computers I said they don't understand that half the people in the department should be people who get computers to do things by showing them I see right so my aim to put my name Department refuses to acknowledge that it should have lots and lots of people doing this it thinks they've got it they got a couple of maybe a few more but not too many I and in that situation you have to remind the big companies to do quite a lot of the training so Google is now training people we call brain residents I suspect the universities will eventually catch up I see Yeah right in fact uh maybe a lot of students have figured there's a lot of top PhD programs you know over half the piece the applicants are actually wanting to work on showing rather than programming yes yeah yeah in fact you're probably worried do you whereas a deep learning dot a is creating a deep learning specialization far as I know the first deep learning MOOC was actually yours taught on Coursera back in 2012 as well and and and somewhat changed see that's when you first published the rmsprop algorithm which also too rough right yes well as you know that was because you invited me to do the work and then what I was very dubious by doing it you kept pushing me to do it so it was very good that I didn't although it was a lot of work yes yes and thank you for doing that I remember you complaining to me how much work it was and you're staying out late at night but I think you know many many learners that benefits it for your first MOOC and I feel very grateful to you for it so that's good yeah over the years I've seen you embroiled in debates about paradigms for AI and whether there's been a paradigm shift for AI whether you all can share your thoughts on that yes happily so I think in the early days back in the 50s people like from Newman insuring didn't believe in symbolic AI they were far more inspired by the brain unfortunately they both died much too young and their voice wasn't heard and in the early days of AI people were completely convinced that the representations you needed for intelligence were symbolic expressions of some kind sort of cleaned-up logic where you could do non monotonic things and not quite logic but something like logic and that the essence of intelligence was reasoning what's happened now is there's a completely different view which is that what a thought is is just a great big vector of neural activity so contrast that with the thought being a symbolic expression and I think the people who thought that thoughts were symbolic expressions just made a huge mistake what comes in is a string of words and what comes out is a string of words and because of that strings of words the obvious way to represent things so they thought what must be in between was a string of words or something like a string of words and I think what's in between is nothing like history of words I think the idea that thoughts must be in some kind of language is a city's the idea that understanding the layout of a spatial scene must be in pixels pixels come in and if we could if we had a dot matrix printer attached to us then pixels will come out but what's in between isn't pixels and so I think thoughts are just these great big vectors and the big vectors have causal powers they cause other big vectors and that's utterly unlike the standard air eye view with thoughts of symbolic expressions I see yep I guess AI is certainly coming around to this new point of view these thing some of it that's me I think a lot of people that I still think thoughts have to be symbolic expressions thank you very much for doing this interview is fascinating to hear how deep learning has evolved over the years as well as how you're still helping drive it into the future so thank you Jeff well thank you for giving me this opportunity okay
Info
Channel: Preserve Knowledge
Views: 125,508
Rating: undefined out of 5
Keywords: p vs np, probability, machine learning, ai, neural networks, data science, programming, statistics, math, mathematics, number theory, pi, terry tao, algebra, calculus, lecture, analysis, abstract algebra, computer science, professor, harvard, MIT, stanford, yale, prime, prime numbers, fields institute, hinton, deep learning, nips, CLVR, computer vision, AI, talk, LSTM, sutton, bengio, facebook, google, google brain, alpha go, ml, cousera, andrew ng, geoffrey hinton, toronto, goodfellow
Id: -eyhCTvrEtE
Channel Id: undefined
Length: 39min 45sec (2385 seconds)
Published: Tue Aug 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.