Geoff Hinton speaks about his latest research and the future of AI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi i'm craig smith and this is i on a i [Music] this week i speak to jeff hinton who has lived at the outer reaches of machine learning research since an aborted attempt at a carpentry career a half century ago after that brief dogleg he came back into line with his illustrious ancestors george boole the father of boolean logic and george everest british surveyor general of india an eponym of the world's tallest mountain jeff is one of the pioneers of deep learning and shared the 2018 touring award with colleagues joshua benjio and yan lacoon a year earlier he had introduced capsule networks an alternative to convolutional neural networks that take into account the pose of objects in a 3d world solving the problem in computer vision in which elements of an object change their position when viewed from different angles he has been largely silent since then and i'm delighted to have him on the podcast we began like so many of us do today trying to get the teleconferencing system to work i hope you find the conversation as engrossing as i did [Music] i don't think i need to introduce you uh or that you need to introduce yourself i do want to sort of recap what's gone on in the last year it's been quite a year capsule networks had sort of faded from view at least from the layman's point of view and resurfaced at nurips last december with your introduction of stopped capsule auto encoders then in february at the aaai or aaai conference you talked about capsule networks as key to unsupervised learning and in april you revive the idea of back propagation as a learning function in the brain with the introduction of neural gradient representation by activity differences or n grads i think it would be better if we started with capsules okay and we do three different topics we do capsules okay and we do sim clear okay and then we do the n grad stuff okay can you talk about your your new capsule idea not new a year old or longer now but how that has influenced your research okay so several things have changed and more things are changing right now um so originally capsules was a supe we used supervised learning yeah we thought it'd be easier to get things working like that even though i don't really believe in supervised learning and last year we switched to unsupervised learning and we also switched to using set transformers so what capsules are trying to do is recognize whole objects by recognizing their parts and the relationships between the parts yeah so if you see something that might be an eye and you see something that might be a nose the possible eye could say where the face should be and the possible nose could say where the face should be and if they agree on where the face should be then you say hey they're in the right relation to make a face so we'll instantiate a face we'll activate the face capsule um so there's various problems with that um one is the issue of whether you try and train it supervised or unsupervised and it's gonna be much better to use unsupervised because then you don't need labels but the other problem which we overcame with stack capture auto encoders is that if you've seen say a circle in a line drawing you don't know whether it's a left eye or a right eye or the front wheel of a car or the back wheel of a car and so it has to vote for all sorts of objects it might be a part of and so you know if it's the back wheel of a car it knows roughly where the car should be and it can vote for there should be a car there um but of course it might not be that it might be a doorknob or it might be a left eye and so it makes lots and lots of votes and now what happens is every higher level capsule gets a huge cloud of votes nearly all of which are wrong and one way to try and rectify that is to say well if any other capsule likes the vote if any other capsule can make use of that vote to make be part of its object then route the vote there and don't route it to me and so that was the idea of dynamic routing that you try and get all the bad votes to go to the places where they're good votes that's complicated to make it work the alternative which we use in stack capsule auto encoders is to say if you discover parts suppose you discover a circle and a triangle and a rectangle um you don't really know what they're parts of there's many many things there could be parts of um so what you ought to do is have them interact with each other a bit and use the spatial relations between them to allow each part to become more confident about what kind of part it is so if you're a circle and there's a triangle at the right relative position to be a nose if you're a left eye then you get more confident that you're a left eye and that's what transformers are very good at transformers have a representation of in the case of language a word fragment so it might be the fragment may which happens to be a whole word um and they don't know whether that's a modal like would and should or whether it's a month like june and july and so what they do is the that the representation of that fragment interacts with the representations of other fragments and if there's another fragment in the sentence for example june then the representation from may gets more month like whereas if there's another fragment that's um would or should it gets more modal like and after a few layers of that the fragments have been disambiguated that is you know much better what each fragment is meant to be so in language that means you have a contextually sensitive representation of the word was disambiguated between different meanings um in vision if you have something like a circle you'd like to know whether that circles an eye or the wheel of a car and you can do that without yet creating a face or a car um by this interaction between parts and in stack capsule launching coders that's what we do we take the first level parts and they all interact with each other so they become more confident about what kind of a part they are and once they're more confident about what kind of a part they are then they vote for what they might for what whole they might be a part of and that way they can have far more specific confident votes so they don't make lots of crazy votes once you've become convinced that a circle is probably a left eye it doesn't vote for being the back wheel of a car that means you've got far fewer votes to deal with and so it's far easier to find the clusters we made it so instead of trying to learn by supervision by giving it labels it learned to create a whole that was good at reconstructing the parts and so that's unsupervised learning so at some point you need to connect it to to language yeah so all the learning in stat capture also encodes almost all the learning is unsupervised that is you have these part you recognize these parts which are sort of templates that occur a lot a lot um a bit more complicated than that but and then you recognize holes that are combinations of these parts and the objective function is to find holes that are good at reconstructing the parts in particular find hold so if i tell you the pose of the whole thing you can tell me the pose of the part like if i tell you there's a small face of 45 degrees in the bottom right hand corner of the image you can tell me that there should be a nose at 45 degrees that's even smaller in the bottom right hand corner of the image so the the hole can predict the parts and i didn't need any labels for that um now once you've done that once you've got these holes you can then learn what they're called so then supervised learning consists of taking holes and learning their names but you're not learning to recognize them when you're doing that you're just learning what things you can already recognize are called much like a little child learns to recognize cows and sheep he doesn't know that cows are called cows and sheep are called sheep and that's why it needs his mother to tell it but his mother doesn't is not the one who tells her how to tell the difference between a cow and a sheep yeah with this kind of unsupervised learning uh in in a larger system um also be able to make uh assumptions or inferences about uh relationships between objects or or the laws of physics for example those are two somewhat different questions yeah in the long run we'd like you to do that but let's return to the laws of physics later on when we talk about sinclair okay um for now um it recognizes objects by seeing parts in the correct relationships and you recognize scenes by seeing objects in the correct relationships in a scene the relationships between objects are typically somewhat looser um but yes it can do that it can recognize that objects are related in the right way to make a particular kind of a scene sinclair uh came up uh later in the year uh can you can you talk about uh sinclair and and how that relates so that's a different learning algorithm it's different in many ways it's not for example focusing on the problem of dealing with viewpoint equivariance that is as the viewpoint changes you get a representation that changes so that you can cope but you can cope with viewpoint easily that's not the primary goal of simply what simcleo is doing is saying i want to learn to represent a patch of an image in such a way that other patches of the same image have similar representations so what you do is you take a crop of an image and then you take another crop of the same image and you say we're going to have a neural net that converts those crops into a vector representation pattern of neural activities and we want those patterns to be similar if the crops came from the same image and different if they came from different images if you just say make them similar that's easy you just make all of the vectors be identical um the trick is you have to make them similar if they came from the same image and different if they came from different images and so that's called contrastive learning and ting chen in the google app in toronto um with some help from others of us um made that work extremely well he wasn't the originator of the idea in fact the first idea comes from work i did with sue becker in 1993 or 92 um and then later work i did in 2002 but we never really made it work well for images and other people revived the idea in 2018 and got contrasted learning working for crops of images and then ting chen made it work worked considerably better and that made people sit up and so what happens is once you've got this representation of a patch of an image or this neural net that can convert a patch of an image into a representation such that you get similar representations if two patches come from the same image then you can use those representations to try and recognize what the objects are in the image and that stage is supervised learning but that doesn't require a deep net so the idea is you do unsupervised learning by using this deep net to try and get the same representation or very similar representations for two different patches of the same image and different representations for patches of different images after you've used the deep net to do that so tim uses a resnet which is a standard kind of deep net after you've done that you then just directly learn to turn those representations with no extra hidden layers into class labels so that's a called a linear classifier it doesn't have hydrogens in it and it does remarkably well so a linear classifier based on those representations that we've got by pure and supervised learning with no knowledge of the labels can do as well now on imagenet as a supervised method provided for the unsupervised learning we use a bigger resnet if you use a standard size resnet on imagenet you get a certain error rate and we can get pretty much the same error rate by using a bigger respect training entirely unsupervised with no knowledge of labels and then on top of the representations we extract training a linear classifier and in that training in one of the things i read you talked about using augmented data yes it's very important when you do this you can think of the two different crops as different ways of getting representations of the same image but it's a that's the major thing you do but you also have to do things like mess with the color balance so for example if i give you two different crops from the same image you can often recognize that they're from the same image by looking at the relative distribution of red green and blue the color histogram and we don't want it doing that so to stop it cheating like that you take two different crops of the same image and one of the crops you change the color balance and now it can't they can't recognize they're saying just by using the color distribution and those are the two most important ones doing different crops and changing the color balance yeah is that augmentation something that the data scientist does in the data crop it's it's not part of the model the model doesn't automatically augment the data well it's not part of the data prep really as you're training on the data you'll you'll get an image you'll take two different crops of the image and then you will augment those crop you'll change the color balance right so you can't really think of it as modifying the data so much as given an image you then get these crops with modified color balance and you can modify all sorts of other things like orientation and stuff like that and and that sounds uh very uh similar uh from layman's point of view to what uh jan lacroon is doing with video where uh where he he takes a video and and tries to predict what the next frame will be uh in an unsupervised manner is am i wrong in that well it's not the same as trying to predict the next frame of a video it is the same however as trying to extract a representation from the next frame that's easily predicted by the representation you extracted from the current frame so that's contrastive learning you can do contrastive learning for videos and you can say you're really asking the question did these two frames come from the same video and that's a bit like asking did these two crops come from the same image right you can use the same contrastive learning technique for that yeah and then at uh triple a i when you were talking uh on on the stage with jan and joshua benjio you talked about uh capsule networks uh as a form of unsupervised learning uh that that has promise going forward uh this simclear is another method are they related can they be blended uh in in making unsupervised methods more powerful they're somewhat different approaches at present you could clearly try and combine them we're not doing that at present yeah in nature uh there was a paper i believe it was in nature about sort of reviving the idea of back propagation as a function in in learning in the brain and you introduced this idea of of n grads neural gradient representation by activity differences um can you talk about that neuroscientists have been very skeptical about whether the brain can do anything like back propagation yeah and one of the big problems has been how does the brain communicate gradients because in back propagation you need to change your weight in proportional to the gradient of the error with respect to that weight yeah whatever your error function is and the idea is that you represent an error by the rate of change of neural activity and that's nice because it can have both signs that is neural activity can be going up or it can be going down so you can represent both signs of error and it also implies that the learning rule it which uses a gradient is going to be something called spike time dependent plasticity that is when you change your synapse strength you're going to change it in proportion to the error derivative and that means you're going to want to change it in proportion to the rate of change of the postsynaptic activity it's going to be the presynaptic activity times the rate of change of the postsynaptic activity and that's called spike time dependent plasticity which they found in the brain and in fact i've been suggesting for a long time that we use activity differences i had a paper with j mcclelland in 1987 suggesting that temporal differences activity are used as aeroderivatives and that was actually before spiked independent plasticity had been discovered by 2005 i got interested in activity differences again and much more recently people have managed to make that work quite well i'm still somewhat skeptical i think the brain could do back prop if it wanted that way it's a little clumsy and i'm now skeptical because i think backprop is too good an algorithm for the brain so the brain is actually dealing with a very different problem for what most neural nets are dealing with most neural nets want to get a lot of knowledge represented in a modest number of parameters like only a billion parameters for example for a brain that's a tiny number of parameters that's the number of parameters you're having a cubic millimeter of brain roughly so we have trillions and trillions of parameters but we don't have many training examples we only live for like a billion seconds or two billion seconds and so we we don't get much experience and we got a huge number of parameters and neural nets mostly are in the other regime they get lots of training and they don't have many parameters now if you've got lots and lots of parameters and not much training data what you want to do is somewhat different from back propagation i think so i got very interested in the idea that that is one way of making this activity difference method work nicely of trying to generate agreement between a top-down representation and a bottom-up representation so the idea is you have say some hierarchy of parts you look at an image you instantiate parts of different levels and then from the high level parts you top down predict the low level parts and what you'd like to see is agreement between the top-down prediction which depends on a larger context and the bottom-up extraction of a part which depends on a smaller context so from some local region of the image you extract apart from many of those parts you predict a whole from the whole you now top down predict the individual parts but those predictions of the parts have used more information because they're based on the whole that got to see more and what you want is agreement between the top-down prediction and the bottom-up extraction of part representation and you want it to be significant agreement so what you really want is on the same image they agree but on different images they disagree so if you take the paths from one image and the top down predictions from another image they should disagree and that's contrastive learning as in simply but it also just suggests a learning algorithm for the brain that is somewhat different from backprop and i got very excited it's not quite as efficient as backprop but it's much easier to put into a brain because you don't need to go backwards through many layers you just need to compare top down prediction with a bottleneck prediction i call it back relaxation and over many time steps it will get information backwards but it won't get information backwards in one trial and in back propagation sends information all the way backwards through a multi-layer net on a single presentation of an image and back relaxation just gets it back one layer each time and it needs multiple presentations of the same image to get it back all the way so i got really interested in back relaxation and whether that might explain how the brain was doing this learning of multi-layer nets but then i discovered that sort of pure greedy bottom-up learning did just about as well i hadn't done the controls carefully enough and that was the bottom marker algorithm that i introduced in 2006 actually worked as well as this back relaxation and that was a huge disappointment to me i still want to go back and see if i can make back relaxation work better than greedy bottom up i see that and that's the june tweet that that you know that's when i discovered the back relaxation doesn't work any better than greedy bottom up learning it's the assumption that the brain is so efficient that even if uh greedy bottom up uh can do it on its own that there wouldn't be this top-down function or is it possible that that top-down function exists as a as a uh an optimizer or something well you'd like this top-down prediction and making it agree with the bottom-up extraction you'd like that to be better than just training a stack of encoders one layer at a time otherwise it's not worth doing and training a stack of auto encoders one hidden layer at a time turns out to be pretty good and what's happened recently in these big neural nets is deep learning really got going in about 2006 when we discovered that if you train stacks of auto encoders or restricted boltzmann machines and one one hidden layer at a time and then you fine-tune it it works very well and that got neural nets going again people then did things like speech and vision on imagenet where they would they said you don't need the pre-training you don't need to train these stacks of auto encoders you can just train the whole thing supervised and that was fine for a while but then when they've got even bigger data sets and even bigger networks people have gone back to this unsupervised pre-training so that's what bert is doing bert is unsupervised pre-training and gpt-3 uses unsupervised pre-training yeah and that is important now so there was this on again off again where there was supervised learning and then i introduced unsupervised pre-training then people said ah but we don't need that we just use supervised learning but now they're back to saying oh but we do need some unsupervised learning right but the unsupervised learning algorithms are now getting more sophisticated yeah uh and again the the simclear is is at least as it relates to computer vision is one method uh the stock capsule auto encoders is another method yeah uh and and there may be uh others still the the uh the learning in the brain that you know i had a long conversation about a year ago with rich sutton about temporal difference learning and there is a view that that that algorithm is uh is describes what's happening in in lower brain function uh and what you're talking about is cortex yeah learning uh and and uh at what point do they are they completely different systems different systems yes yeah big successes of computational neuroscience has been taking the work that rich sutton and others did on temporal differences and relating it to experimental studies on the brain and dopamine so peter diane in particular um was very important in showing the relationship between this theoretical learning algorithm or what's actually going on in the brain but that's that's for reinforcement learning and i think reinforcement learning is kind of the icing on the cake most of the learning is going to be unsupervised learning you have to learn how the world works and you don't want to learn how the world works by using reinforcement signals you don't want to learn to do vision by stubbing your toe all the time you want to learn to do vision some other way yeah this is giving you further insight into into learning the brain i remember that that was really your initial im impulse in in getting involved in all this study yeah my main goal in life has been to understand how the brain works and all of this technology that's come out of attempts to understand how the brain works aren't really how the brain works it's it's useful spin-off but it's not what i was really after yeah it's not uh all part of one uh general uh stream that that you're pursuing that's that's headed to a particular goal it's like this um if your research has been around for a while you have a number of kind of deep intuitions about how things should be and then you have particular projects that are like particular instances that combine those intuitions and often projects that seem quite separate eventually merge but for now the work on capsules is somewhat different although all three of them can merge together that is if we can get the idea of top-down and top-down predictions and bottom-up predictions agreeing in a contrastive sense that is they agree well for the same image and they're very different for different images that will fit in with stack captured auto encoders but it will also it's also an example of contrastive learning as in simply it may also explain how the brain can learn multi-layer nets so obviously i would like to i'd like to have one solution to everything that's what everybody always wants um it's just you have to be more realistic and get parts of this you can't get the whole thing all at once yeah with the rise of transformers in in models like gbt3 uh and now uh in uh in capsule networks uh is [Music] which is primarily a computer vision there's kind of a convergence between computer vision and natural language processing how do you see that convergence progressing and and those are the two principal components of consciousness uh if i'm not wrong uh so i mean are we are we working toward uh a model that uh looks or is it has that can perceive the world and a ai model that can perceive the world that's closer to uh human perception in that it blends yes so one of the big motivations of capsules was that it would be it would have representations more like the representations we use so a classic example is if you see a square rotated through 45 degrees you have two completely different ways of perceiving that one is there's a tilted square and the other is an upright diamond and what you know about it is totally different depending on which representation you use now convolutional nets don't have two different representations of that they just have one representation of that to get two different representations you need something that imposes a frame of reference and a very strong feature of our perception is that we impose frames of reference on things and understand them relative to those imposed frames and if you get someone to impose a different frame they'll understand things quite differently that was one of the big motivations for capsules it's also what happens in computer graphics so in computer graphics you represent a house with a particular coordinate frame and then relative to that coordinate frame you know where the windows and the door are again that's the kind of representation we need to get into neural nets if neural nets are going to get more like us at representing objects at present deep neural nets are very good at doing classification but they do it a completely different way from people so they rely much more on things like texture and they can see all sorts of complex textures that we aren't sensitive to and that's why you get these adversarial examples where two things look totally different to us but very similar to a neural net and vice versa google has just filed a uh for a patent on capsule networks is is that uh because of the successes of uh stock uh capsule auto encoders and i don't know um all the motivations for finding the patent um but i think the main motivation which is true for most of the pan finance google does is protective in the sense they don't want other people to sue them for using stuff they developed yeah so google really isn't interested in making its money out of puns it's interesting you make he's making his money out of having great products and it doesn't want to be prevented from using his own research in his great products and the pattern laws have changed in such a way that the first to file is not the first to invent it it's the first person to file it yeah and so you have to file patents just protectively there's this paper uh that that's uh uh under review right now transformers for image recognition at scale uh you said that you're not one of the authors are you one of the reviewers i have read the paper but i'm not wondering yeah does that relate at all to this use of transformers in capsule networks yes it does a bit so what it's showing is that the kind of interactions between parts that works so well in things like burt for words where you're getting word fragments to interact also works when you're getting representations of patches of images to interact and it's also what's happening in stack capsule auto encoders where we have um a set transformer that's getting the representations of parts to interact with one another and become refined but then in stat capital auto encoders we then jump to high level representations and what we're doing it all unsupervised whereas in the um paper with the 16 by 16 patches they're training it's supervised to perform classification they're not training it unsupervised so it's somewhat different but this general trend of extracts and pieces and then get them to interact so you get clearer about what their pieces are which is what transformers do that seems to be a very good way to go about doing um building laser representation yeah i'm going to ask a question that i'll probably cut out because it's going to sound ignorant uh but in in transformers both uh in capsule networks uh or in uh uh natural language processing models like bird or gpt3 uh it relies on uh massive uh parameters right and billions of parameters there are actually less parameters than convolutional neural nets but okay uh but but it goes out into the world of the internet in this in gpt theories case and and ingests all of this data and then it it looks to me like it's a kind of search it's going out and and uh you and finding uh something that matches her representation and then fills it in with with what's already out there is that wrong that's wrong yeah you can go and find the closest thing i mean if you give it a story to complete you can find the closest match on the web and it'll do completions that are nothing like the closest match on the web basically it's taking all this information in this data it's observed and it's boiling it down into these parameters that allow it to produce similar stuff but not by matching to particular instances it's seen and and in the same way capsule networks are creating new representations and capture networks should be able to deal with a new view of the same object yeah so where is your research going now i mean on these three streams or there are other streams my main interest has always been unsupervised learning because i think that's what most human learning is i'm interested in developing capsules further and in things like simclear i'm also interested in making distillation work better so the idea of distillation is you have a great big model and you've trained it on data and it's extracted the regular pans in the data and got them into its parameters and now you want to train a much smaller model that would be as good as the big model or almost as good as a big model but you couldn't have trained directly on the data and so we see this all over the place so insects are like this you the way ins roughly speaking the way most insects work is they have one stage that's just about extracting nutrients from the environment and that's called the lava okay and it's just an eating machine and you this great fat ugly grub or a caterpillar for a butterfly just gets fat that's its role in life and then it basically gets turned into a soup and after that soup you build the adult which may look nothing like the lava i mean a caterpillar and a butterfly are very different things and they're optimized for different things so the lava is optimized for sucking nutrients out of the environment and then the butterfly is optimized for traveling around and mating and those are very different activities from sucking nutrients out to the environment now butterflies also get nutrients out of the environment but they're not machines for doing that like larval you also see it in mining a nice canadian example so if you want gold first you take a chunk of the earth then you convert it to pay dirt then you have one way of doing that and then you take the pay dirt and you heat it up very hot to try and get the gold out i think that's how it works and the same for data mining so you've got a big set of data and you you'd like to end up with a small agile model that can look at a new example and tell you what class it is for example but the kinds of models that are good at sucking structure out of the data are not necessarily the same as the models that are going to be small and agile and easy to use on your cell phone to making the right decisions and so the idea is you use one kind of model for sucking structure out of the data a great big model once you suck the structure out of the data you get the great big model to train a small model and it turns out the big model is much better at training a small model than just the raw data and it's like an apprenticeship once well like the way science works once scientists have done their research and figured out how things work they can then teach school kids how things work so sort of any smart school kids can learn newton's mechanics yeah but not only smart small kid could have invented newton's mechanics once students invented it which is kind of tricky you can then explain it quite well and you can you can instill a model of it in a school kit and so the idea of distillation is we use great big neural networks for getting structure out of the data and then much smaller more agile networks for actually using what we discovered and it's now being quite widely used it's used in burp for example to get more agile networks but i think it could i think there's probably ways to make you much better and that's another thing i'm working on yeah on on capsule networks in the in the uh aaai talk you all agreed you and yan la kun and joshua on the on the direction of your research and unsupervised learning and and a lot of what uh yen lacuna's been doing with video uh sounds similar to to what you talk about uh both with capsule networks and sinclair jan and i share a lot of intuitions we we work together for a while yeah we have a very similar view of the world so can you talk about how again at some point all of these ideas will converge how his research relates to your research particularly his research on uh on video our goals are very similar and the methods are quite similar and as we start applying sinclair-like methods to video they're going to get even more similar so this idea of contrastive representation learning seems to be very powerful and yan's exploiting it ting chen made it work really well for static images and we're now trying to extend that to video but we're trying to extend it using attention which is going to be very important for video because you can't possibly process everything in a video at a higher resolution yeah yeah and and that's interesting uh when when you relate uh this machine learning to learning in the brain and certainly attention is is uh critical yeah uh and how can you talk a little bit about how uh these models even if they're not using the algorithms that you think are operating in the brain how they are analogous to uh to human learning i mean there's this this huge amount of unsupervised learning that goes on uh with a as you were saying with capsules at the end there's a little bit of supervised learning that that puts labels to representation let me just clarify that the first few versions of capsules we did we're all using supervised learning yeah yes because we thought that would make life easier even though that's not really what we believed in but now we've got it doing unsupervised learning and it works better and it's it's ideologically far more satisfactory yeah but but in the unsupervised capsule networks uh at the end you connect it to language through superb that's just to show that it's learned something sensible i mean obviously you want to connect to language there's very nice work going on at google now in robotics where they're using deep learning for getting robot arms to do things to manipulate things but they're also interfacing it with language so pierre sermone and vincent van hook and others uh now have things where you can tell a robot what to do and the robot can also tell you what it's doing and that seems very important it also seems that if the robot can tell you what it's doing like you know it's opening the draw the objections of people like gary marcus have to natural language processing saying he doesn't really understand what's going on well you know if it says i'm opening the drawer and i'm taking out a block and it opens the drawer and takes out a block it's very hard to say it doesn't understand what's going on yeah the the learning that's going on for example we mentioned at the beginning law of the laws of physics learning the laws of physics and you don't need language to learn the laws of physics you do need a linguistic interface uh to look at a tree and a car and be able to identify them as a tree in a car can you talk about that the the learning something like the laws of physics that doesn't require language to be attached to it but nonetheless there's learning that takes place yeah at high school you may learn the laws of physics we learn sort of common sense physics um so we learn you know if you throw something off it comes down again and if we're good we learn how to throw a basketball so it goes through the hoop and that's a very impressive skill because you're throwing it from like 20 feet away and you have to get it right to a few inches that's amazing thing to be able to do and we we don't learn that by being told how to do it we don't learn that using language at all we learn it from people who say trial and error but we're understanding how the world works just by observing the world also by acting in the world so just passively observing the world will allow you to understand it but it's not nearly as good as acting in the world and in fact if you think about perception for robots that wander around and act in the world it changes your view of how perception should work so if you're just taking images or videos and just passively processing them it doesn't make you think about attention but as soon as you have a robot that's moving around in the world it's got to decide what to look at and the sort of primary question in vision is where should i look next and that's been sort of widely ignored by people who just process static images attention is crucial and it's sort of central to how human vision works can you sum up a little bit everyone likes to hear about convergence of all of these things you know convergence of computer vision with natural language processing uh convergence of of uh unsupervised learning with uh supervised learning and reinforcement learning is is that beyond what what you're really focused on because you're focused on on the basic research not necessarily building models that pulls it all together let me just say something about supervised learning versus unsupervised learning because um it sounds like a very simple distinction um but actually it's very confusing so if you ask when when a kid's mother says that's a cow we tend to think of it in machine learning as the mother supplied a label but what's really happening is this the child has some sensory input and the child is getting a correlation between the visual sensory input and the auditory sensory input now the top level the auditory thing gives you the word cow and the visual thing gives you whatever your visual representation of a cow is then you learn they go together but actually supervision when you actually get it in reality is just another correlation and there's so it's all about complex correlations in the sensory input yeah supervising unsupervised learning and then there's correlations with payoffs and that's reinforcement learning but i think the correlations with payoffs don't have enough structure in them for you to do most of the learning so most of the learning is unsupervised okay well let's let's leave it there i really appreciate it and uh it's been a fascinating conversation and i'll edit it down to be uh coherent on my side bye for now yeah bye-bye that's it for this week's podcast i want to thank jeff for his time if you want to learn more about the episode today you can find a scrolling transcript on our website www.eyehyphenon.ai and remember the singularity may not be near but a.i is about to change your world so pay attention [Music] you

Info

Channel: Eye on AI

Views: 18,978

Rating: undefined out of 5

Keywords: Geoff Hinton, Aritifical intelligence, podcast

Id: N0ER1MC9cqM

Channel Id: undefined

Length: 49min 4sec (2944 seconds)

Published: Thu Dec 17 2020