Season 2 Ep 22 Geoff Hinton on revolutionizing artificial intelligence... again

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Music] over the past 10 years ai has experienced breakthrough after breakthrough after breakthrough in computer vision in speech recognition in machine translation in robotics in medicine in computational biology protein folding prediction and the list goes on and on and on and the breakthroughs aren't showing any signs of stopping not to mention these ai breakthroughs are directly driving the business of trillion dollar companies and many many new startups underneath all of these breakthroughs is one single subfield of ai deep learning so when and where did deep learning originate and when did it become the most prominent ai approach today's guest has everything to do with this today's guest is arguably the single most important person in ai history and continues to lead the charge today award the equivalent of the nobel prize for computer science today's guest has their work cited over half a million times that means there is half a million and counting other research papers out there that build on top of his work today's guest has worked on deep learning for about half a century and most of the time in relative obscurity but that all changed in 2012 when he showed deep learning is better at image recognition than any other approaches to computer vision and by a very large margin that result that moment known as the imagenet moment changed the whole ai field pretty much everyone dropped what they had been doing and switched to deep learning former students of today's cast include vlognee who put deep mind on the map with their first major result on learning to play atari games and includes our season one finale guest eliza skiver founder and research director of openai in fact every single guest in our podcast has built on top of the work done by today's guest i am of course talking about no one less than jeff hinton chef welcome to the show so happy to have you here well thank you very much for inviting me well so glad to get to talk with you on the show here and i'd say let's dive right in with maybe the you know the highest level question i can ask you um what are neural nets and why should we care okay if you already know a lot about neural nets please forgive the simplifications um here's how your brain works it has lots of little processing elements called neurons and every so often a neuron goes ping and what makes it go ping is that it's hearing pings from other neurons and each time it hears a ping from another neuron it adds a little weight to some store of um input that it's got and when it gets it when it's got enough input it goes ping and so if you want to know how the brain works all you need to know is how the neurons decide to adjust those weights that they add when a ping arrives um that's all you need to know this there's got to be some procedure used for adjusting those weights and if we could figure it out we'd know how the brain works and that's been your quest for a long time now figuring out how the brain might work and what's the status do you do we as a field understand how the brain works okay i always think we're going to crack it in the next five years since that's quite a productive thing to think um but i actually do and i think we're going to crack it in the next five years um i think we're getting closer i'm fairly confident now that it's not back propagation so all of existing ai i think is built on something that's quite different from what the brain's doing at a high level it's got to be the same that is you have a lot of parameters these weights between neurons and you adjust those parameters um on the basis of lots of training examples and that causes wonderful things to happen if you have billions of parameters the brain's like that and deep learning is like that the question is how do you um get the gradient for adjusting those parameters so what you want is some some measure of how well you're doing and then you want to adjust the parameters so they improve that measure of how well you're doing um but my belief currently is that back propagation which is the way deep learning works at present is quite different from what the brain's doing the brain's getting gradients in a different way now that's interesting you're the one saying that jeff because you actually you wrote a paper on back recreation for training neural networks and it's powering everything everybody's doing today and now here you are saying actually it's time probably time for us to figure out how do you think we should change it close to what the brain is doing or do you think maybe back repetition could be better than what the brain is doing let me first correct you um yes we did write the most cited paper on back propagation on ronald hunt and williams and me um back propagation was already known um to a number of different authors what we really did was showed that it could learn interesting representations so it wasn't that we invented back propagation we ronald hart reinvented back propagation we showed that it could learn interesting representations like for example word embeddings so i think back propagation is probably much more efficient than what we have in the brain that's squeezing a lot of information into a few connections whereby a few connections i mean only a few billion so the problem the brain has is that connections are very cheap um we've got hundreds of trillions of them um experience is very expensive and so we are willing to throw lots and lots of parameters at a small amount of experience whereas the neural nets we're using are basically the other way around they have lots and lots of experience and they're trying to get the information about what relates the input to the output into the parameters and i think back propagation is much more efficient than what brain's using at doing that but maybe not as good at from not much data abstracting a lot of structure and well this begs the question of course do you have any hypothesis on approaches that might get better performance in that regard i have a sort of general view which i've had for a long long time which is that we need unsupervised objective functions so i'm talking mainly about perceptual learning um which i think is the sort of key if you can learn a good model of the world by looking at it um then you can base your actions on that model rather than on the raw data and that's going to make doing the right things much easier i'm convinced that the brain is using lots of little local objective functions so rather than being a kind of end-to-end system chain trained to optimize one objective function i think it's using lots of little local ones um so as an example the kind of thing i think will make a good objective function though it's hard to make it work is if you look at a small patch of an image and try and extract some representation of what you think is there you can now compare the representation you got from that small patch of the image with a contextual bet that was got by taking the representations of other nearby patches and based on those predicting what that patch of the image should have in it and obviously um once you're very familiar with the domain those predictions from context and locally extracted features will agree generally agree and you'll be very surprised when they don't and you can learn an awful lot on one trial if they disagree radically so that's an example of where i think the brain could learn a lot from this local disagreement um it's hard to get that to work but i'm convinced something like that is going to be the objective function and if you think of a big image and lots of little local patches in the image that means you get lots and lots of feedback in terms of the agreement of what was extracted locally and what was predicted contextually um all over the image and at many different levels of representation and so we can get a much much richer feedback from these agreements with contextual predictions but making all that work is difficult but i think it's going to be along those lines now what you're describing strikes me as part of what people are trying to do in self-supervised and unsupervised learning and in fact you wrote one of the breakthrough papers the sim clear paper with with a couple of collaborators of course in this space what do you think about the simclear work and contrast of learning more generally and what do you think about the recent mast auto encoders and how does that relate to what you just described it relates quite closely to what i've it's evidence that that kind of objective function is good um i didn't write the sim clear paper um team chain written simply a paper um with help from the major co-authors i was my name was on the paper for general inspiration but i did write a paper a long time ago with sue becker on the idea of getting agreement between representations you got from two different patches of the image so that was i think of that as the origin of this idea of doing self-supervised learning by having agreement between representations from two patches of the same image um the method that sue and i used didn't work very well because of a subtle thing that we didn't understand at the time but i now do understand um and i could explain that if you like but i'll lose most of the audience well i'm curious i think it'd be great great great to hear it but maybe we can zoom out for a moment before zooming back in you talk about current methods use end-to-end learning back propagation to power the end-to-end learning and you're saying switch to learn from less data and extract more from less data is going to be key as as as a way to make progress to get closer to how the brain learns um yes so you get much bigger bandwidth for learning by having many many little local objective functions and when when we look at these local objective functions like filling in a blanked out part of an image or maybe filling back in a word if we look at today's technologies in fact this is the current frontier you've contributed a lot of people are working exactly on on that problem of learning from unlabeled data effectively because it requires a lot less human labor but they still use back propagation the same mechanism so what i do what i don't like about the mast auto encoder is you have your input patches and then you go through many layers of representation and at the output of the net you try to reconstruct the missing input patches i think the brain you have these levels of representation but at each level you're trying to reconstruct what's at the level below um so it's not like you go through this many many layers and then come back out again um it's that you have all these levels each of which is trying to reconstruct what's at the level below um so i think that's much more brainlike and the question is can you do that without using back propagation obviously if you go through many many levels and then reconstruct the missing patches of the output you need to get information back through all those levels and since we have back propagation it's built into all the simulators you might as well do it that way but i don't think that's how the brain's doing it and now imagine the brain is doing with all these local objectives do you think for for our engineered systems will it matter whether in some sense there are three choices to make it seems one choice is what are the objectives what are those local objectives that we want to optimize a second choice is what's the algorithm to use to optimize it and then a third choice is what's the architecture of how do we wire the neurons together that are doing this this learning and among those three it seems like all three could be the missing piece that we're not getting right or what do you think i if you're interested in perceptual learning i think it's fairly clear you want retinotopic maps a hierarchy of written topic maps so the architecture is local connectivity um and the point about that is you can solve a lot of the credit assignment problem by just assuming that something in one locality in a retrotronpic map is going to be determined by the corresponding locality in the retinotopic map that feeds into it so you're not trying to low down in the system um figure out how pixels determine what's going on a long distance away in the image you're going to just use local interactions and that gives you a lot of locality um and you'd be crazy not to lose not to use that one thing neural nets do at present is they assume you're going to be using the same functions at every locality so convolutional let's do that and transformers do that too um i don't think the brain can do that because that would involve weight sharing and it would involve doing exactly the same computation at each locality so you can use the same weights i think it's most unlikely the brain does that but actually there's a way to achieve what weight sharing does what convolutional nets do in the brain in a much more plausible way than i think people have suggested before which is if you do have contextual predictions trying to agree with locally extracted things then imagine a whole bunch of columns that are making local predictions and looking at nearby columns to get their contextual prediction you can think of the context as a teacher for the local thing but also vice versa but think of the context as a teacher for what you're attracting locally so you can think of the information that's in the context as being distilled into the local extractor but that's true for all the local extractors so what you've got is mutual distillation where they're all providing teaching signals for each other and what that means is knowledge about what you should extract in one location is getting transferred into other locations if they're trying to agree if you're trying to get different locations to agree on something if for example you find a nose and you find a mouth and you want them both to agree that they're part of the same face so they should both give rise to the same representation then the fact that you're trying to get the same representation at different locations allows knowledge to be distilled from one location to another and there's a big advantage of that over actual weight sharing obviously biologically one advantage is that the detailed architecture in these different locations doesn't need to be identical but the other advantage is the front-end processing doesn't need to be the same so if you take your retina different parts of the retina have different size receptive fields and convolutional nets try to ignore that they sometimes have multiple different resolutions and do convolution at each resolution but they just can't deal with different front-end processing whereas if you're distilling knowledge from one location to another what you're trying to do is get the same function from the optic array to the representation in these different locations and it's fine if you pre-process the optic array differently in the two different locations you can still distill the knowledge across the function from the optic array to the representation even though the frontend processing is different and so although distillation is less efficient than actually showing the weights it's much more flexible and it's much more neurally plausible so for me that was a kind of big insight i had about a year ago that we have to have something like weight sharing to be efficient but local distillation will work if you're trying to get neighboring things to agree on a representation but that idea of trying to get them to agree gives you the signal you need for knowledge in one location to supervise knowledge in another location and jeff do you think so what you're describing one way to think of it is to say hey we're cheering is clever because it's something the brain kind of does too it just does it differently so we should continue to do weight sharing another way to think of it is that actually we shouldn't continue to do weight sharing because the brain does it somewhat differently and it might be be a reason to do it differently what's your thinking i think the brain doesn't do weight sharing because it's hard for it to ship cement strengths about the place it's very easy if they're all sitting in ram so i think we should continue to do convolutional things in convnets and in transformers we should share weights um we should share knowledge by sharing weights but just bear in mind that the brain is going to share knowledge not by sharing weights but by sharing the function from input to output and using distillation to transfer knowledge now there's the other topic that is talked about quite a bit where the brain is drastically different from our current neural nets and is the fact that neurons are work with spiking signals and that's very different from our artificial neurons in our gpus and so i'm very curious on your thinking on that is is that just an engineering difference or do you think there could be more to it that we need to understand better and benefits to spiking i think it's not just an engineering difference i think once we understand why that hardware is so good why you can do so much in such an energy efficient way with that kind of hardware thing we'll see that um it's sensible for the brain geospiking units the retina for example doesn't use spiking neurons the retina does lots of processing with non-spiking nerves so once we understand why cortex is using those um we'll see that it was the right thing for biology to do and i think that's going to hinge on what the learning algorithm is how you get gradients for networks of spiking neurons and at present nobody really knows the present what people do is say you see the problem with the spiking urine is there's two diff quite different kinds of decision one is exactly when does it spike and the other is does it or doesn't it spike so there's this discrete decision should the neuron spike or not and then this continuous variable of exactly when it should spike and people trying to optimize a system like that have come up with various kind of surrogate functions which sort of smooth things a bit so you can get continuous functions they didn't seem quite right um it'd be really nice to have a learning algorithm and in fact in nips in about 2000 andy brown and i had a paper on trying to learn spiking boltzmann machines um but it'd be really nice to get a learning algorithm that's good for spiking yards and i think that's the main thing that's holding up spiking neuron hardware so people like steve ferber in manchester have realized that many other people have realized that um you can make more energy efficient hardware this way and they built great big systems what they don't have is a good learning outcome for it and i think until we've got a good learning algorithm for it we won't really be able to exploit what we can do with spiking neurons and there's one obvious thing you can do with them that isn't easy in conventional neural nets and that's agreement so if you take a standard artificial neuron then you simply ask the question can it tell if it's two inputs have the same value well it can't it's not an easy thing for a standard neuron to do the standard artificial one um if you use spiking neurons it's very easy to build a system where if the two spikes arrive at the same time they'll make when you're on fire and if they're over different times they won't so using the time of a spike seems like a very good way of measuring agreement we know the biological system does that so you can see the direction a sound is coming from or rather here the direct sound is coming from by the time delay in the signals reaching the two ears and if you take a foot that's about a nanosecond for light and it's about a millisecond first sound and the point is if i move something sideways in front of you by a few inches the difference in the time delay to the two ears the length of the path to the two ears is only a small fraction of an inch and so it's only a small fraction of a millisecond difference in the time the signal gets to the two is and we can deal with that and owls can deal with it even better um and so we're measuring we're sensitive to times of like um 30 milliseconds 30 microseconds in order to get stereo from sound um i can't remember what i was sensitive to but it's i think it's a lot better than 30 microseconds and we do that by having um two axons with spikes traveling in different directions one from one and one from the other ear and then you have cells that fire if the spikes get at the same time that's the simplification but roughly that um so we know that spike timing can be used for exquisitely sensitive things like that and it would sort of be very surprising if the precise times the spike wasn't being used but we really don't know how and for a long time i thought it'd be really nice if you could use spike times to detect agreement for things like self-supervised learning or for things like um if i've extracted your mouth and i've extracted your nose or other representations of them and i from your mouth i can now predict something about your whole face and from your nose i can break something about your whole face and if your mouth and nose are in the right relationship to make a face those predictions will agree and it'd be really nice to use spike timing to see that those predictions agree um but it's hard to make that work and one of the reasons it's hard to make that work is because we don't know we don't have a good algorithm for training networks just like in neurons so that's one of the things i'm focused on now how can we get a good training out of the networks of spiking yours and i think that'll have a big impact on hardware that's a really interesting question you're putting forward there because i doubt too many people are working on that compared to let's say the number of people working on large language models or other problems that are much more i guess visible in terms of progress recently um i think yeah it's always a good idea to figure out what huge numbers of very smart people are working on and to work on something else yeah i think the challenge of course for most people i'd say including myself but i definitely hear the question from many students too is that it's easy to work on something else than everybody else but it's hard to make sure that something else is actually relevant because there's many other things out there that are not not very relevant you could possibly spend time on yeah that involves having good intuitions yeah listening to you for example could help um so i've actually a follow-up question something you just said jeff which is um that the retina doesn't use all spiking neurons are you saying that the brain has two types of neurons some that are more like our artificial neurons and some that are spiking neurons i'm not sure the retina is more like artificial neurons but um certainly the cortex has the neocortex has spiking neurons um and that's its primary mode of communication is by sending spikes to from one parameter to another parameter cell um and i don't think we're gonna understand the brain until we understand why he chooses to send spikes for a while i thought i had a good argument that didn't involve the precise time to spikes and the argument went like this the brains in the regime where it's got lots and lots of parameters and not much data relative to the typical neural nets we use and there's a potential overfitting in that regime unless you use very strong regularization and a good regularization technique is dropout where each time you use a neural net you ignore a whole bunch of the units and so maybe the fact that the neurons are sending spikes what they're really communicating is the underlying poisson rate so let's assume it's pass on but she's close enough for this argument um there's a price on process which spends sends spike stochastically but the rate of that process varies and that's determined by the input to the neuron and you might think you'd like to send the real valued rate from one urine to another um but if you want to do lots and lots of regularization you could send the real valued rate with some noise added and one way to add noise is to just use spikes that'll add lots of noise and so this was the motivation for dropout that the most of the times most of the neurons aren't involved in things if you look at any fine time window um and you can think of spikes as a representational underlying personal rate it's just a very very noisy representation which sounds like a very very bad idea because it's very very noisy but actually once you understand about regularization we have too many parameters it's a very very good idea so i still have a lingering fondness for the idea but actually we're not using spike timing at all um it's just about using very noisy representations of personal rates to be a good regularizer and i sort of flip between i think it's very important when you do science not to totally commit to one idea and ignore all the evidence for other ideas but if you do that you end up um flipping between ideas every few years so some years i think neural nets are deterministic i mean we should have deterministic neural nets and that's what backwards using another years i think it's about a five-year cycle i think no no it's very important the best stochastic um and that changes the play for everything so boltzmann machines were intrinsically stochastic and that was very important to them um but the main thing is not to fully commit to either of those but to be open to both now one thing if we think more about what you just said the importance of spiking neurons and figuring out how to train a spiking neuron network effectively what if we for now just say let's not worry about the training part given that seemingly it's far more power efficient um wouldn't people want to distribute pure inference chips that are you you pre-trained effectively separately and then you compile it onto a spiking neuron chip to have very low power inference capabilities what about that yeah so lots of people have thought of that and um it's a very sensible idea and it's probably on the evolutionary path to getting to use backing neural nets because once you're using them for inference um and it works and it's all people are already doing that and it's already working being shown to be more power efficient and various companies have produced these big spiking systems um once you're doing them for inference anyway you'll get more and more interested in how you could learn in a way that makes more use of the available power in these spike times so you can imagine a system where you learn using backprop um but not on not on analog hardware for example not on the this low energy hardware um and then you transfer it to the lower energy hardware and that's fine um but we'd really like to learn directly in the hardware now one thing that really strikes me jeff is when i think about your talks back around 2005 6 7 8 when i was a phd student essentially pre-alex ned talks those talks i think topically have a lot of resemblance to what you're excited about now and it almost feels like alexnet is an outlier in in your path um how did you go from thinking so closely how the brain might work to deciding that you know maybe you can first explain what was alex net and but also how did it come about and what was that path to go from working on restricted boltzmann machines trying to see how the brain works too i would say that the more traditional approach to neural nets that you all of a sudden show it can actually work well um if you're an academic you have to raise grant money and it's convenient to have things that actually work even if they don't work the way you're interested in um so part of it's that just go with the flow and um if you can make back prop work well and back then in about 2006 2005 i got fascinated by the idea you could use stacks of restricted voltage machines to pre-train feature detectors and then it would be much easier to get backdrop to work it turned out with enough data which is what you had in speech recognition um and later on because of faith ali and her team in image recognition with enough data you don't need the pre-training although pre-training is coming back i mean gpt-3 has pre-training and pre-training is a thoroughly good idea um but once we discovered that you can pre-train and that will make backdrop work better and that did great things for speech which george john and abdul rahman muhammad did um in 2009 then alex who was a graduate student in my group thing um started uh applying the same ideas to vision um and pretty soon we discovered that you didn't actually need this pre-training especially if you had the imagenet data and in fact that project um was partly due to india's persistence so i remember ilya coming into the lab one day and saying look we now that we've got speech recognition working this stuff really works we've got to do imagenet before anybody else does and retrospectively learned that janella come was going into the lab and saying look we've got to do imagenet with compliments before anybody else does and jan's students also and postdoc said oh but i'm busy doing something else so well he he couldn't actually get someone to commit to it and yeah initially couldn't get people to commit to it and so he persuaded alex to commit to it by pre-processing the data for him so he didn't have to pre-process the data the data was all pre-processed to be just what he needed and then alex really went to china and alex is just a superb programmer and it was alex was able to make a couple of gpus really sing he made them work together in his bedroom at home um i don't think his parents realized that they were paying most of the cost because that was the electricity um but he did a superb job of programming convolutional nets on them um so he said we've got to do this and helped alex with the design and so on alex did the really intricate programming and i provided support um and a few ideas like using dropout i also did some good management i'm not often very good at management but i'm very proud of the management idea which is alex kruszewski had to write a depth or to show that he was sort of capable of understanding research literature which is what you have to do after a couple of years to stay in the phd program and he doesn't really like writing um and he didn't really want to do the depth of it but it was way past the deadline and the development was hassling us so i said to him um each each time you can improve the performance by one percent on imagenet um you can delay your depth order by another week and alex delayed his death roll by a whole lot of weeks yeah and just for context for i mean a lot of researchers know this of course but maybe not everybody alex's result with you and ilya cut the error rate in half compared to prior work on the imagenet image recognition competition which was just more or less i i used to be a professor so it wasn't quite in half close it cut it from about a whole available well that's why everybody switched from what they were doing which was hand engineered approaches to computer vision try to program directly how can a computer understand what's an image to to deep learning i should say one thing that's important to say here um jalakhar spent many years um developing convolutional neural nets um and it really should have been him his lab that developed that system we had a few little extra tricks but they weren't the important thing the important thing was to apply convolutional nets using gpus to a big data set um so yam was kind of unlucky in that um he didn't get the win on that but it was using many of the techniques that he developed it didn't have the the russian immigrants that uh toronto and eu had been able to attract to make it happen well once russian one's ukrainian and it's important to confuse those even though the ukraine is a russian-speaking ukrainian don't confuse russian absolutely it's a it's a different country so now jeff that moment actually also marked a big change in your career because as far as i understand you've never been involved in corporate work but it marked a transition for you soon thereafter from being a pure academic to being ending up at google actually uh can you see a bit about that how was that for you like did you have any internal resistance i can say why that transition happened what triggered i'm curious so um i have a lonely disabled son who needs um future provisions so i needed to get a lump of money and i thought one way i might get a lump of money was by teaching a coursera course and so i did a coursera course on neural networks in 2012. and it was one of the early coursera courses so their software wasn't very good so it's extremely irritating to do um it really was very irritating then i'm not very good on software so i didn't like that and from my point of view it amounted to you agree to supply a chapter of a textbook one chapter every week um so you had to give them these videos and then a whole bunch of people are going to watch the videos like sometimes the next day yoshi or banjo would say why did you say that um so you know that it's going to be people who know very little but also people know a whole lot and so it's stressful you know that if you make mistakes they're going to be caught not like a normal lecture where you can just sort of press on the sustaining pedal and sort of blow your way through it if you get some slightly confused about something um here you have to get it straight and the deal with the university of toronto originally was that um if any money was made from these courses which i was hoping there would be um the money that came to the university would be split with the professor they didn't specify exactly what the split would be but one assumed it would be like 50 50 or something like that and i was okay with that the university didn't provide any support in preparing the videos and then after i started the course and when i could no longer back out of it the provost made a unilateral decision without consulting me or anybody else um that actually if money came from coursera the university would take all the money and the professor would get zero which is exactly the opposite of what happens with textbooks and the process was very like bringing textbooks i actually asked the university to help me prepare the videos and the av people came back to me and said do you have any idea how expensive it is to make videos and i actually did have an idea because i've been doing this so i got really pissed off with my university because they unilaterally sort of canceled the idea i get any immunization for this they said it was part of my teaching well actually it wasn't part of my teaching it was clearly based on lectures i given as part of my teaching but i was doing my teaching as well as that and that i wasn't using that course for my teaching and that got me pissed off enough that i was willing to consider alternatives to being a professor um and at that time we then suddenly got interest from all sorts of companies in the in recruiting us um either in funding giving big grants or in funding a startup it was clear that a number of big companies were just very interesting getting in on the act and so normally i would have just said no i'm i get paid by the state we're doing research um i don't want to try and make extra money from my research i'd rather get on with the research but because that particular experience with the university um cheating me out of the money no it turned out they didn't shoot me out of anything because uh no money came from of course anyway um but that pushed me over the edge into thinking well okay i'm going to find some other way to make some money that was the end of my princess oh no well but the result is that these companies are and in fact if you read the the genius makers book by kid metz which i reread last week in preparation for this conversation um if you read the book it starts off with actually you running an auction for these companies to try to acquire your company which is quite the start of a book um very intriguing but how is it for you oh when it was happening it was at nips and terry had organized nips in a casino um at lake tahoe um and so in the basement of the hotel there were these smoke-filled rooms full of people pulling one arm bandits and big lights flashing saying you won 25 000 and all that stuff and people gambling um in other ways and upstairs we were running this auction and we felt like we were in a movie we felt like this was like being in that movie the social network it sort of felt like that it was great the reason we did it was we had absolutely no idea how much we were worth and i consulted a lawyer who an ip lawyer who said there's two ways to go about this you could hire a professional negotiator um in that case um you'll end up working for a company but they'll be pissed off with you or you could just run an auction um as far as i know this was the first time a small group like that just ran an auction we ran it on gmail i've worked at google over the summer so i knew enough about google to know that they wouldn't read our gmail um and i'm still pretty confident they didn't read our gmail microsoft wasn't so confident and we just ran this auction where people had to gmail me their bids and we then immediately mailed them out to everybody else with the timestamp of the gmail and um he just kept going up by a half million dollars to and he was half a million dollars to begin with and then a million dollars after that um and yeah it was pretty exciting and we discovered we were worth a lot more than we thought retrospectively we could probably have got more but we we got to an amount that we thought was astronomical and then basically we wanted to work for google so we stopped the auction so we could be sure of working for you and as i understand it you're still at google today i'm still at google today i'm nine years later i'm in my tenth year there i think i'll get some kind of award when i beat them for 10 years because it's so rare um although people tend to stay in google longer than other companies yeah i like it there the the main reason i like it is because the brain team's a very nice team and i get along very well with jeff dean um he's kind of very smart but i'm very straightforward to deal with and what he wants me to do is do what i want to do which is basic research um he thinks what i should be doing is trying to come up with radically new algorithms and that's what i want to do anyway so it's just a very nice fit i'm no good at managing a big team to improve speech recognition by one percent i'd be happy well it's better to just revolutionize the field again right yeah i would like to do it one more time but i'm looking forward to it i wouldn't be surprised at all now when when i look at your career and some of this information actually comes from the book as i didn't notice before i had read the book the first time i mean you are you were a computer science professor at the university of toronto emeritus now i believe but computer science but you never got a computer science degree you got a psychology degree and you actually at some point were a carpenter how does it come about how do you go from yes studying psychology to becoming a carpenter to getting into ai what's the path for you there how do you look at that in my last year at cambridge i had a very difficult time and got very unhappy and i dropped out just after the exams i dropped out and became a carpenter and i'd always enjoyed carpentry more than anything else so at high school um to be sort of all the classes and then you could stay in the evenings and do carpentry that's what i really look forward to and so i became a carpenter and then after i've been a carpenter for about six months you couldn't actually make a living as a carpenter um so i was a carpenter and decorator and i made the money during decorating but i had the fun doing carpentry um and the point is carpentry is more work than it looks and decorating is less work than it looks um so you can you can charge more per hour for decorating um unless you're a very good card and then i met a real carpenter and i realized i was completely hopeless at carpentry and so he he's making a door for a basement for a coal seller under the sidewalk that was very damp and he was taking pieces of wood and arranging them so that they would walk in opposite directions so that it would cancel out and that was kind of a level of kind of understanding and thought about the process that never occurred to me he could also take a piece of wood and just cut it exactly square with a hand saw and he explained something useful to me he said if you want to cut a piece of wood square you have to line the saw bench up with the room and you have to line the piece of wood up with the room you can't cut it square if it's not aligned with the room which is very interesting in terms of coordinate frames so anyway because i was so hopeless compared with him i decided i might as well go back into am now when you say get back into ai as i understand this was at the university of edinburgh where you went for your phd yeah i went to do a phd there and i went to a phd on neural networks with an eminent professor called christopher longa higgins who was really very brilliant um he almost got a nobel prize when he was in his 30s to figuring out something about the structure of boron hydride um and i i still don't understand what it is because it'll do with quantum mechanics but it hinged on the fact that 360 degree rotation is not the identity operator it's 720 degrees um there's a thing you want to find one's books about it um anyway he was interested in neural nets and the relation to holograms and about the day i arrived in edinburgh he lost interest in neural nets because he read winograd's thesis and he became completely converted um he thought neural nets was the wrong way to think about it we should do symbolic ai he was very impressed by one great thesis um and so we had he had a lot of integrity so even though he completely disagreed with what he was i was doing he didn't stop me doing it he kept trying to get me to do stuff more like winner thesis but he let me carry on doing what i was doing um and yeah i was a bit of a loner everybody else back then in the early 70s was saying minsky and papua shown that neural nets are nonsense why are you doing this stuff it's crazy and in fact the first talk i ever gave to that group was about how to do true recursion with neural networks um so this was talking 1973 so 49 years ago and so my one of my first projects i discovered a write-up of it recently was um you want a neural network that will be able to draw a shape and you want it to pass the shape into parts and you want it to be possible for a part of the shape to be done drawn by the same neural hardware as the whole shape's being drawn by so the neural hub where they're storing the whole shape has to remember where it's got to in the whole shape and what the orientation and position size is for the whole shape but now it has to go off and you want to use the very same neurons for drawing a part of the shape so you need somewhere to remember what the whole shape was and how far you got in it so you can pop back to that once you've finished doing this subroutine this part of the shape and the question is how is the neural network going to remember that because obviously you can't just copy the neurons and so i managed to get a system working where the neural network remembered it by having fast heavy and weights that were just adapting all the time and were adapting so that any state that it had been in recently could be retrieved by giving it part of that state and then say fill in the rest and so i had a neural net that was doing true recursion reusing the same neurons and the same weights to do the recursive call as it used for the high level call and that was in 1973 and the i think people didn't understand the talk because i wasn't very good at giving talks but they also said why would you want to do recursion within your match you can do recursion with lisp um they didn't understand the point which is that unless we get neural nets to do something like recursion we're never going to be able to explain a whole bunch of things and now that's become sort of an interesting question again so i'm going to wait one more year until that idea is an antique a genuine antique it'll be 50 years old and then i'm gonna sort of write up the research i did then and it was all about fast weights for as a member so i have many questions here jeff one the first one is you're standing in this room where everybody's you're a phd student or maybe fresh out of phd you're standing in a room with essentially everybody telling you what you're working on is a waste of time and you were convinced somehow was not where do you get that conviction from i think a large part of it was my schooling so my father was a communist but he sent me to an expensive private school because they had good science education and i was there from the age of seven [Music] while they had a preschool and it was a christian school and all the other kids believed in god and it was just at home i was taught that that was nonsense and it did seem to me that it was nonsense um [Music] and so i was used to just having everybody else being wrong and obviously wrong and i think that's important i think you need you need i was back to say you need the faith which is funny in this situation um you need the faith in science to um be willing to work on stuff just because it's obviously right even though everybody else says it's nonsense and in fact it wasn't everybody else it was everybody else in the early 70s doing ai said it was nonsense or nearly everybody else um but if you look a bit early if you look in the 50s both von neumann and turing believed in neural nets turing in particular believed in neural nets training with reinforcement um so if i i still believe if they hadn't both died early the whole history of ai might have been very different because they were sort of powerful enough intellects to have swayed a field and they were very interested in sort of how does the brain work so i think it was just bad luck we both died early well british intelligence might have come into it but now you go from believing in this well at the time many people didn't getting the big breaks resulted in that power almost everything that's being done today and now there is this in some sense the next question right is it's not just that deep learning works and works great the question becomes is it all we need or will we need other things and you've said things maybe i'm not literally quoting you but to the extent of deep learning will do everything what i really meant by that i i i sometimes say things without thinking without being accurate enough and then people call me like saying we won't need radiologists um so what i really meant was um using stochastic gradient to send to it just a whole bunch of parameters that's what i sort of had in mind when i said deep learning the way you get the gradient might not be back propagation and the thing you get the gradient of might not be some final performance measure but rather these lots of local objective functions but i think that's how the brain works and i think that's gonna explain everything yes well nice nice to see it confirmed um so one other thing i want to say is the kind of computers we have now um are very good for um doing banking because they can remember how much you have in your account it wouldn't be so good if you went in and they said well you got roughly this much we're not really sure because we don't do it to that precision but roughly this much um we don't want that in a computer doing banking um or in a computer guiding the space shuttle or something we would really rather it got the answer exactly right um and they're very different from us and i think people aren't sufficiently aware that we made a decision about how computing would be um which is that um our computer our knowledge will be immortal so if you look at existing computers you have a computer program or maybe you just have a lot of weights for a neural net that's a different kind of program um but if your hardware dies you can run the same program on another piece of hardware and so that makes the knowledge immortal it doesn't hinge on that particular piece of hardware surviving now the cost of the immortality is huge because it means the two-bit different bits of hardware have to do exactly the same thing obviously zero correction all that but after you've done all the error correction they have to do exactly the same thing which means there better be digital or mostly digital um [Music] and they're probably gonna do things like multiplying numbers together which involves using lots and lots of energy to make things very discreet which is not what hardware really wants to be and so as soon as you commit yourself to the immortality of your program or your neural net you're committed to um very expensive computations and also to very expensive manufacturing processes you need to manufacture these things accurately and probably in 2d and then put lots of 2d things together um if you're just willing to give up on immortality sort of in fiction normally what you get in return is love um but if if we're willing to give up immortality what we'll get in return is very low energy computation and very cheap manufacturing so instead of manufacturing computers what we should do is grow them um we should use nanotechnology to just grow the things in 3d and each one will be slightly different so the image i have is if you take a pot plant and you sort of pull it out of its pot there's a root ball and it's the shape of the pot right and so all the different pot plants have the same shaped root ball but the details of the roots are all different but they're all doing the same thing they're extracting nutrients from the soil and they got the same function and they're pretty much the same but the details are all very different um so that's what real brains are like and i think that's what what i call mortal computers will be like so these are computers that are grown rather than manufactured you can't program them they just learn they obviously have to have a learning algorithm sort of built into them they learn they can do most of their computation in analog because analog is very good for doing things like taking a voltage times a resistance and turning it into a charge and then adding up the charge and already chips that do things like that the problem is what do you do next um and how do you learn in those chips and at present people have suggested back propagation or various versions of boxing machines um i think we're going to need something else but i think sometime in the not too distant future we're going to see mortal computers which are very cheap to create have to get all their knowledge in there by learning and are very low energy and these mortal computers when they die they die and their knowledge dies with them and so and it's no use looking at the weights because those weights only work for that hardware um so what you have to do is distill the knowledge into other computers so when these multiple computers get old they're gonna have to do lots of podcasts to try and get the knowledge into younger modern english first one you build i'll happily have that one on let me know so jeff this reminds me of another uh question that's been on my mind for you which is when you think about today's neural nets the ones that grab the headlines are very very large i mean not as large as as the brain maybe but in some sense starting to get in that way size right the large language models um but and and the results look very very impressive um so one i'm i'm curious about your take on those kinds of models and what you see in them and we see as limitations but two i'm also curious about what do you think about working on the other end of the spectrum for example ants have much smaller brains obviously than humans yet it's fair to say that our visual motor systems that we have developed artificially are not yet at the level of what ants can can pull off or b's and so forth and so i'm curious about that spectrum as well as the the recent big advances in language models where you think about those so b's they may look small to you but i think a b has about a million neurons so i think a b is like closer to gpg3 um certainly closer than romantics um but a b is actually quite a big neural net um my belief is that um if you take a system with lots of parameters and they're tuned sensibly using some kind of gradient descent in some kind of sensible objective function then you'll get wonderful properties out of it and you'll get all these emerging properties um like you do with gpg3 and also the the the google equivalents that i've talked about so much that doesn't sort of settle the issue of whether they're doing the same way as us and i think um we're doing a lot more things like recursion which i think we do in neural nets and i tried to address some of these issues in a paper i put on the web last year called glom um well i call it glom it's how you do part hole hierarchies in neural nets so you definitely have to have structure and if what you mean by symbolic computation is just that you have part whole structure then we do symbolic computation that's not normally what people meant by symbolic computation the sort of hardline symbolic computation means you're using symbols and you're operating on symbols using rules that just depend on the form of the symbol string you're processing and that a symbol the only property a symbol has is that it's either identical or not identical to some other symbol and perhaps that it points to something it can be used as a pointer to get something um the neural nets are very different from that so the sort of hard-line symbol processing i don't think we do that but we certainly deal with pothole hierarchies but i think we do it in great big neural nets and i'm sort of up in the air at present as to to what extent does gpt3 really understand what it's saying i think it's fairly clear it's not just like the old eliza program which just rearranges strings of symbols and had no clue what it was talking about um and the reason for believing that is you say you say in english show me a picture of a hamster wearing a red hat and it draws a picture of a hamster wearing a red hat um and you're fairly sure it never got that pair before so it has to understand the relationship between the english string and the picture and before it had done that if you'd asked any of these um doubters these neural net skeptics um neural net deniers let's call them neural net deniers um if you'd ask them well how would you show that it understands i think they'd have accepted that well if you asked to draw a picture something draws a picture of that thing then it understood just as with winograd's thesis you ask it to put the blue the blue block in the green box and it puts the blue block in the green box and so that's pretty good evidence it understood what you said um but now that it does it of course the skeptics then say well you know that doesn't really count there's nothing that was satisfied basically yeah the goal line's always moving uh for true skeptics yeah now there's the recent one um the google won the paul model that uh in in the paper showed how it was explaining effectively how jokes work that was extraordinary that just seemed a very deep understanding of language no it was just rearranging the words it had in its training you think so no no it had i i didn't see how it could generate those explanations without sort of understanding what's going on now i'm still open to the idea that because it was framed with back propagation it's going to end up with a very different sort of understanding from us and obviously adversarial images um tell you a lot that you can recognize objects by using their textures and you can be correct about it in the sense that it'll generalize to other instances of those objects but it's a completely different way of doing it from what we do and i like to think of the example of insects and flowers so insects seeing the ultraviolet so two flowers that look the same to us can look completely different to insects and now because the flowers look the same to us do we say the insects are getting it wrong um because these flowers evolved with the insects to give signals to the insects in the ultraviolet to tell them which flower it is so it's clear the insects are getting it right and we just can't see the difference and that's another way of thinking about adversarial examples um it looks you know this this thing that it says is an ostrich looks like a looks like a school bus to us but actually if you look in the texture domain then it's actually an ostrich so um the question is who's right in the case of the insects um just because two flowers look identical to us it doesn't mean they're really the same the insects are right about them being very different in that case it's different parts of the electromagnetic spectrum that are indicating the difference that we can pick up on but it could be in the case of image recognition for our current neural nets so you could argue maybe that um since we build them and we want them to do things for us in our world then we really don't want to just say okay they got it right and we got it wrong i mean they need to recognize the car and the pedestrian yeah i agree i just want to show it's not as simple as you might think about who's right and who's wrong and part of the point of my glom paper was to try and build perceptual systems that work more like us so they're much more likely to make the same kinds of mistakes as us and not make very different kinds of mistakes and obviously um if you've got a self-driving car for example if it makes a mistake that any normal human driver would have made that seems much more acceptable than making a really dumb mistake so jeff as i understand it sleep is something you also think about can you say a bit more yes i often think about it when i'm not sleeping at night um so there's something funny about sleep which is um animals do it fruit flies sleep and it may just be to stop them flying around in the dark but um if you deprive people of sleep then they go really weird like if you surprise someone for three days they'll start hallucinating if you surprise someone for a week they'll go psychotic and they never recover um these are nice experiments done by the cia i think um and the question is why why do we what what is the computational function of sleep there's presumably some pretty important question for it if depriving you of it makes you just completely fall apart and so current theories are things like it's for consolidating memories or maybe for downloading things from hippocampus into cortex which is a bit odd since i had to come through court exactly on the campus in the first place um so a long time ago in the early 80s terry sanofsky and i had this theory called baltimore machines and it was partly based on an insight of francis crick um when he was thinking about hopfield nets princess quickly rare mitchelson had a paper about sleep and the idea that um you would hit the net with random things and tell it not to be happy with random things so in a hot field next you give it something you wanted to memorize and it changes the weights so the energy of that vector is lower and the idea is if you also give it random vectors and say make the energy higher the whole thing works better and that led to boltzmann machines where we figured out that um if you instead of giving it random things you get things generated from a markov chain the model's own markov chain and you say make those less likely and make the data more likely that is actually a maximum likelihood learning and so we got very excited about that because we thought okay that's what sleep is for sleep is this negative phase of learning it comes up again now in contrasting learning where you have two patches from the same image you try and get get them to have similar representations and two patches from different images you try and get them to have representations that are sufficiently different once they're different you don't make them any more different but you stop them being too similar and that's how contrastive learning works now with boltzmann machines you couldn't actually separate the positive face from the negative face you had to interleave positive examples and negative examples otherwise the whole thing would go wrong and i went to i tried a lot not interleaving them it's quite hard to do a lot of positive examples followed by a lot of negative examples what i discovered a couple of years ago they got me very excited and caused me to agree to give lots of talks that i then cancelled when i couldn't make it work better um was that with contrastive learning you can actually separate the positive and negative phases so you can do lots of examples of positive pairs followed by lots of examples in negative pairs and that's great because what that means is you can have something like a video pipeline where you're just trying to make things similar while you're awake and trying to make things dissimilar while you're asleep um if you can figure out how sleep can generate video for you um so it makes it makes a contrastive learning area much more plausible if you can separate the positive and negative phases and do them at different times and do a whole bunch of positive updates followed by a whole bunch of negative updates and even for the standard contrastive learning you can do that moderately well you have to use lots of momentum and stuff like that there's all sorts of little tricks to make it work but you can make it work um so i think it's quite likely that the function of sleep is to do unlearning or negative examples and that's why you don't remember your dreams you don't want to remember them you're unlearning them qrik pointed this out you'll remember the ones that are in the fast weights when you wake up um because the fast weights are a temporary store so that's not unrunning that still works the same way but the long-term memory um the whole point is to get rid of those things and that's why you dream for many hours a night but when you wake up you can just remember the last minute of the dream you're having when you woke up um and i think this is a much more plausible theory of sleep than any other i've seen because it explains why if you got rid of it the whole system would just fall apart you'll go disastrously wrong and start hallucinating and doing all sorts of weird things and let me say a little bit more about the need for negative examples that you're having a trust of learning if you've got a neural net and it's trying to optimize some internal objective function something about the kinds of representations it has or something about the agreement between contextual predictions and local predictions it wants this agreement to be a property of the real data and the problem inside a neural net is that you might get all sorts of correlations in your inputs i'm a neuron right so i get all sorts of correlations in my inputs and those correlations have nothing to do with the real data they're caused by the wiring of the network and the way it's in the network if these two neurons are both looking at the same pixel they'll have a correlation but that doesn't tell you anything about the data and so the question is how do you learn to extract structure that's about the real the real data and not about the wiring of your network and the way to do that is to feed it positive examples and say find structure in the positive examples that isn't in the negative examples because the negative examples are going to go through exactly the same wiring and if the structure is not in the negative examples but it is in the positive examples then the structure is about the difference between the positive and negative examples not about your wiring so as people don't think about this much but if you have powerful learning algorithms they you better not make them learn about the neural network's own weights and wine that's not what's interesting now when you think about people who don't get sleep then and start hallucinating is hallucinating just effectively trying to do the same thing you're just doing it while you're awake obviously you can have little naps and that's very helpful and maybe hallucinating when you're awake is serving the same function in sleep and it's i mean all the experiments i've done say it's better to not have 16 hours awake and eight hours of sleep it's better to have a few hours awake in a few hours of sleep so and a lot of people have discovered that little naps help einstein used to take little naps all the time and he did okay yeah he did very well no for sure now there's this other thing you uh you've brought up this notion of student beats teacher what does that refer to okay so um a long time ago ago i didn't experiment on mnist which is a standard digit database for every 900 digits where um you take the data the training data and you corrupt it and you corrupt it by substituting the wrong label one of the other nine labels eighty percent of the time so now you've got a data set in which the labels are correct um 20 of the time and wrong 80 of the time and the question is um can you learn from that and how well do you learn from that and the answer is you can learn to get like 95 correct on that so now you've got a teacher who's wrong 80 of the time then the student is right 95 of the time so the student is much much better than the teacher and this isn't um each time you get an example you corrupted you take the training examples you can wrap them once and for all so you can't average away the corruption over different you might be able to average it away over different training cases that happen to have similar images but and if you ask well how many training cases do you need if you have corrupted ones and this was a great interest because of the tiny images data set some time ago where they had 80 million tiny images with a lot of wrong labels in and the question is would you rather have a million things that are flakily labeled or would you rather have 10 000 things with accurate labels and i had a hypothesis that what counts is the amount of mutual information between the label and the truth so if the labels are correct corrupted ninety percent of the time there's no mutual information between the labels and the truth if they corrupt eighty percent of the time there's only a small amount of mutual information is that's what you think is i think it's about my memory is it's 0.06 bits per case whereas if it's uncorrected it's about 3.3 bits per case um so it's only a tiny amount and then the question is well suppose i balance the size of the training set by putting as much mutual information in there um so if if there's like a 50th of the mutual information i have 50 times as many examples do i now get the same performance and the answer is yes you do to within a factor of two i mean the training set actually needs to be twice that big but roughly speaking you can see how useful a training example is by the amount of mutual information between the label and the truth and i noticed recently you have something for doing sim to real where you're labeling real data using a neural net and those labels aren't perfect and then you take the student that learned from those labels and the student is better than the teacher it learned from and people are always puzzled by how could the student be better than the teacher um but in neural nets it's very easy the student will be better than the teacher if there's enough training data even if the teachers are very flaky and i have a paper a few years ago with melody guan about this for some medical data the first part of paper talks about this but the the rule of thumb is basically what counts is the mutual information between the assigned label and the truth and that tells you how valuable a training example is and so you can make do with lots of flaky ones that's so interesting now in the work we did that you just referenced javen and the work i've seen quite popular recently usually the teacher provides noisy labels but then not all the noise labels are used there's a notion that only look at the ones where the teacher is more confident your description doesn't matter that's obviously a good hatch yeah you don't need to do that you don't need to do that it's a good hack and it probably helps to only look at the ones where you have reason to believe the teacher got it right but it'll work even if you just look at them all and there's a phase transition so with with mnist melody plotted a graph and as soon as you get like 20 of the labels right your student will get like 95 correct wow but as you get down to about 15 right you suddenly get a phase transition where you don't do any better than chance because somehow the student has to get it the teacher is saying these labels and the student has to in some sense understand which cases are right and which case is wrong and sort of see the relationship between the labels and the inputs and then once the student's seen that relationship a wrongly labeled thing is just very obviously wrong um so it's fine if it's randomly wrongly enabled but there is a phase transition where you have to have it good enough so the students sort of get the idea but that explains how our students are all smarter than us we all need to get it right a small fraction of the time right and i'm sure the students do some of this data curation where you say something and the student thinks oh that's rubbish i'm not going to listen to that those are the very best students you know yeah those are the ones that can surprise us um now one of the things that is really important in neural net learning and especially when you're building models is to get an understanding of what is it what is it learning and often people try to somehow visualize what's happening during learning and one of the most prevalent visualization techniques is called disney which is something you invented jeff so i'm curious how do you come up with that what if maybe first describe what it does and then what's the story behind it so if you have some high dimensional data and you try and draw a 2d or a 3d map of it you could take the first two principal components and just plot the first two principal components but what principal components cares about is getting the big distances right so if two things are very different principal components is very concerned to get them very different in the 2d space it doesn't care at all about the small differences because it's it's sort of operating on the squares of the big differences so it won't preserve similarity very well high dimensional similarity and you're often interested in just the opposite you've got some data you're interested in what's very similar to what when you don't care if it gets the big distances a bit wrong as long as it gets the small distances right so i had the idea a long time ago that what if we took the distances and we turned them into probabilities of pairs there's various versions of tc's but suppose we turned them into the probability of a pair such that we say pairs with a small distance are probable and players with a big distance are improbable so we're converting distances into probabilities in such a way that small distances correspond to big probabilities and we do that by putting a gaussian around a point a data point and computing the density of the other data point under this gaussian and that's an unnormalized probability then you normalize these things and then you try and lay the points out in 2d so as to preserve those probabilities and so it won't care much if two points are far apart they'll have a very low pairwise probability and it doesn't care the relative positions of those two points but it cares about the relative positions of the ones with hyperventilators and that produced quite nice maps and that was called stochastic neighbour embedding because we thought of this you put a gaussian and you stochastically pick a neighbor according to the density under the gaussian um and i did that work with samurais and it had very nice simple derivatives which convinced me that we were onto something and we got nice maps but they tended to crowd things together and there's obviously a basic problem in converting high dimensional data into low dimensional data so sneak tends to crowd things together stochastic in everybody and that's because of the nature of high dimensional space and low dimension spaces in a high dimensional space a data point can be close to lots of other points without them all being too close to each other in a low dimensional space they all have to be close to each other if they're all close to this data point so you've got a problem in embedding closenesses from high dimensions to low dimensions and i had the idea when i was doing snee that since i was using probabilities as this kind of intermediate currency there should be a mixture model it should be a mixture version where you say in high dimensions the probability of a pair is proportional to e to the minus s squared distance on my gaussian um and in low dimensions suppose you have two different maps the probability of a pair is the sum of e to the minus the distance in the first 2d map and e to the minus the squared distance in the second 2d studio and that way if we have a word like bank and we're trying to put similar words near one another bank can be close to greed in one map and can be close to river in the other map without river ever being close to greed so i really pushed that idea because i thought this was a really neat idea and you could have a mixture of maps and we managed to get to where elio was one of the first people to work on that and james cook worked on it a lot and several other students worked on it and we never really got it to work well um and i was very disappointed that someone hadn't been able to make use of the mixture idea and then i went to a simpler version which i called unicene which was a mixture of a gaussian and a uniform and that worked much better um so the idea is in one map all pairs are equally probable and that gives you a sort of background probability which goes through the big distances a small background probability and then in the other map you contribute um a probability proportional to your squared distance in this other map but it means in this other map things can be very far apart if they want to be because the fact that then they need some probability is taken care of by the uniform and then i got a review paper from a plumber called lawrence van der marton which i thought was actually a published paper because of the form it arrived in but wasn't actually a published paper and he wanted to come do research with me and i thought he had this published paper so i invited him to come do research um it turned out he was extremely good and it's lucky i've been mistaken in thinking it was a published paper um and we started on unisoning and then i realized that actually eunice me is a special case of using a mixture of a gaussian and a very very broad gaussian which is a uniform so what if we used a whole hierarchy of gaussians many many gaussians with different widths and that's called a t distribution um and that led to t-sne and t c works much better and tc has a very nice property that um it can show you things at multiple scales because it's got a kind of one over d squared property that um once distances get big it behaves just like gravity and clusters of galaxies and things you have pluses of galaxies and galaxies and clusters of stars and so on and you get structured many different levels in it you get the course structure and the fine structure all showing up now the objective function used for all this which was the sort of relative densities under a gaussian came from other work i did with alberto pacinero earlier um that we found hard to get published i got a review saying yeah i got a review of that work when it was rejected by some conference saying hinton's been working on this idea for seven years and nobody's interested um i take those reviews as telling me i'm on to something very original um and that actually had the function in it that's now used i think it's called nce it's using these contrastive methods um and t-sne is actually a version of that function um but it's being used for making maps so it's a very long history of tc of getting the original sne and then trying to make a mixture version and it's just not working and not working not working and then eventually getting the coincidence of figuring out it was a t-distribution of what you wanted to use that was the kind of mixture and lauren's arriving and laurence was very smart in a very good program really made it all work beautifully this is really interesting because it seems a lot of the um a lot of the progress these days the the bigger idea place plays a big role but here it seems it was really getting the details right was the only way to get it to fully work you typically need both you have to have a big idea for it to be interesting original stuff but you also have to get the details right and that's what graduate students are for so jeff thank you thank you for such a wonderful uh conversation for our part one of our season finale you

Info

Channel: The Robot Brains Podcast

Views: 45,472

Rating: undefined out of 5

Keywords: The Robot Brains Podcast, Podcast, AI, Robots, Robotics, Artificial Intelligence, Deep Learning, Geoff Hinton, backpropogation, neural networks, Google, Google Brain, Geoffry Hinton

Id: 2EDP4v-9TUA

Channel Id: undefined

Length: 88min 20sec (5300 seconds)

Published: Wed Jun 01 2022