Geoffrey Hinton talk "What is wrong with convolutional neural nets ?"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
he among his many titles he was the founding director of the gadgety unit and University College London he is University professor at the University of Toronto and he was also recently appointed distinguished researcher at Google of course Jeff is one of the founders and the main name in the fields of new official neural networks as well as in the field of machine learning he's coming at MIT was a bit of an eventful Azur at the talk at csail yesterday there was 500 plus people and basically people couldn't get in today we starting at 4:00 because I'm sure it's going to be similar effect there's also a panel discussion that Tommy is organizing at 6:00 with Jeff in this same room about path to intelligence so I won't I won't say much much more because we all came to hear him just to say that the the Ransom of success might be what you see a home is a webpage which says information for prospective student I will not be taking any new graduate students visiting student summer students or visitors so please do not apply to work with me Jeff thank you so much if however you'd like a job at Google that's something else so there's a lot of things wrong with the neural nets were using they'll be quite successful at speech recognition object recognition particularly recently but there are other things that are very unlike the brain and that I believe are making them work not as well as they could so one thing wrong is any complex set engineered system should have various levels of structure neural nets have very few levels of structure those neurons those layers of neurons which aren't at all like letting cortex and there's homeo nets and that's it for most these neural nets one thing that's missing in these your nets is has no explicit explicit notion of an entity so those are natives in the audience I am going to admit that it might be worth building into a neural net something to do with the idea that there are entities and I want to build that into the architecture that's what this talk is going to be about so what I want to do is take the neurons in what we call a lair and greet them into subsets and have activities and neurons in those subsets represent different properties of the same entity I want the neural net to decide what the entities are and how they interact with each other but I want the sort of built-in property that there's going to be entities that's all I'm going to concede to DNA tests I also want to push the idea that a mini column is the place where you represent an entity one entity per mini column and I'm going to call this thing in the artificial Nets a capsule and the idea is a capsule is going to have two kinds of Association parameters is cannae one that says whether it's entity is present also provide images mainly so in the current input image and then it's kind of other things that describe properties of this entity if the entry is not present you can say all you like about the property who doesn't matter the event is present and you want to know its properties and the property is going to be things like this orientation its size its velocity its color all sorts inside active and what a capsule outputs that goes to higher level capsules because I want this in hierarchy is the probability that the entity is present and the generalized pose of the entity which in vision is going to be an object or part of an object and that includes all sorts of parameters and what capsules do that does not go on in normal neural nets this is a sort of basic computation they are meant to do is they take predictions from low-level capsules about what they generalize pose should be so about a multi-dimensional vector and they look for predictions that agree tightly and they don't care if there's lots of predictions that are outliers or they're concerned with is there are a small subset of predictions that agree well if you did computer vision many many years ago before he got silly this is things like Rancic and Hough transform so a capsule it has a high dimensional pose space maybe 20 dimensions or 50 dimensions so this is a picture of that space run sort of 20 dimensional space let's say and there's predictions camera from capsules below and these are the predictions these are the vectors that are predicted for the the pose of this capsule and what we want is something that will find this cluster and what it'll output is it'll output a probability that says hey I really am present because look that didn't happen by chance I'm really there I've got lots of hurdles I exist and I can output the center of gravity of that and then you ignore all these and you'll know what's the present are not good at doing that they may somehow be able to fake it but they're not built to do that kind of thing now the point about AI dimensional coincidences they don't happen by chance even if I have a six dimensional thing I'll have two six dimensional things that would agree on every dimension to within ten percent relative to the normal variation then there's a chance of one in a million of that so if you see a few things that agree sharply it's not a code this coincidence can't just be a coincidence if you sort of mean there's something really must have caused it a model of this if you think about it's filtering intelligence information if you see all sorts of conversations that say New York or say September that doesn't mean much but if you see a whole bunch of conversations maybe as four of them that say New York's and ten of the 11th and you see four of those you should get extremely suspicious because now it's a high dimensional confidence and even if there's lots of other things to say Chicago and blah blah the fact that is high dimensional it's a coincidence should make you think that something real going on ok and I'm going to argue that I'm going to take a Mauryan perspective and say in order to understand the brain we need to figure out what is computing and I can give you a reason why the brain needs to do this computation and that is yes this is what many columns are doing it's wild speculation the least is Mari and wild speculation because it's based on it comes from the computation that needs to be done okay so here's hanging on that's currently draw object recognition and do it very well compared with methods that are even worse this is largely due to Jana car who's since 1987 has been getting back pop to do this and has developed the technology a lot he calls them commnets and these multiple layers of learned feature detectors so that's good and the feature detectors are local the locality gets bigger they get bigger domains as they go up they replicated across space because you believe that if a feature is worth having here it's worth having there that's good as you go up the spatial remains of these piece rejecters get bigger that's good and the feature abstraction layers are interleaved with max pooling layers or average pooling as or some kind of pooling there and what the pooling layers do is they say I'm a pooling your and I'm going to look at the neurons in a layer look at nearby ones and I you know attend to the Cuba clanging attend to the most active one it may be probabilistic but let's say I just attend to the most active one so I'm going to report the activity level of the most active one I forget where it was that's going to give you a small amount of translational invariance and it's going to give you less active neurons so you're going to be able to have more feature types of the next layer this is what I don't believe in this is an integral part of the things that work really well the fact that it works so well is extremely unfortunate because it's going to make it harder to get rid of um but that's just the way it is so as I was saying the pooling gives you some translation events because it throws away where the most active feature is actually it doesn't have to lose positional information if you have lots of pools that overlap but if you do that you get less advantage from reducing the number of units we built that for that now if you take these convolutional nets I kind of showed yesterday that with one of these convolutional nets you can take the activities in the last hidden layer if it's been trained to recognize lots of different categories of object you can now train it so this is a repeat of what I said yesterday just briefly you can train it with a recurrent neural net to actually give you a caption and I can skip a bit yeah well I'll just show you what it does so you showed this and the recurrent and you ask you what do you see and the recurrent neural in effect and they're pairing your net says two pizzas sitting on top of a stove top oven if you run it again it's stochastic it'll say a pizza on top of a pan on top of an oven whether or not he really understands these relations or whether that's just in the language model we don't yet know um because it's learnt a model language as well but that's just two components can do impressive things there's another captioned and now here's why I don't believe in the pool I believe in the convolution I just don't believe in the fooling it's a really bad fit to the psychology of shape perception the things about the psychology of shared perception that argue very strongly that we're not using convolutional neural nets it's solving the wrong problem we do not want the neural activities to be invariant to viewpoint at least not till the very top what we want is that the knowledge is invariant to viewpoint the same knowledge can get applied to something with a new viewpoint that doesn't mean activities have to be invariant of viewpoint I'll go over each of these in more detail later the worst property of convolutional nets is they fail to use an underlying linear manifold which is going to make it very easy to deal with the effects of viewpoint if the linear manifold is used in computer graphics here's a piece of natural reasoning there's two things that have no problem with viewpoint one is us and other brains and the other is computer graphics and therefore they work the same way okay um and the last thing is pooling is a very bad way or very primitive way to try and do routing so envision what happens is you changed the viewpoint the same thing shows up on different pixels that's not usually machine learning change illumination for example that doesn't happen will you change viewpoint you've moved information from one set of pixels to another set of pixels so that's what I call dimension-hopping and for machine learning you better unpick that otherwise you're not going to able to make any sense of things imagine two hospitals and one codes patients by age weight and financial status Chris America and the second codes patients by weight financial status and age if you just took those records and applied machine learning without sorting that out you wouldn't expect it to work very well um but in vision that's exactly what viewpoints doing and we'd better sort it out and that's a routing problem we need to take the information of the pixels and route it correctly to the neurons and now had to deal with that kind of information and people haven't really motion you're likely but haven't really faced up to the problem that we're having to solve a routing problem people doing parts based model to face up to that but not the convolutional net people so I'm going to go over these four arguments now the talk is mainly going to be about I'm trying to convince you the convolutional nets are no good even though they work very well or rather the max pooling expected contribution rates is no good and I'm going to start by saying it's a bad fit to the psychology of shape perception so when people do shape perception they do it by imposing rectangular coordinate frames on things and if you take the same object in you impose a different rectangular coordinate frame you don't even realize it's the same object that's how much effect it has I can show you an object and show you exactly the same object again with exactly the same retinal image and if you impose a different coordinate frame you won't realize you've even seen it before it's a huge effect and convolution that's just can't explain that they kind of say how the same pixels can be processed completely differently depending on the coordinate frame because they've got no notion of imposing a coordinate frame now this isn't necessarily quite true it may be that neural nets are so powerful with these multiple layers they kind of say composing coordinate frames and we don't know whether that's true or not but let's suppose it isn't trying to try and give you a demonstration that illustrates the power of coordinate frames here's something you wouldn't believe I take a simple object like a tetrahedron a tetrahedron is a pyramid with a triangular base and I slice it with a plane that's a flat thing I slice it with a plane so I get two pieces and then they take an intelligent person and I give them the two pieces and they say okay make the tetrahedron and I make sure they know what I'm treating is and they can't do it now that clearly means that now you don't believe that presumably it's just two pieces surely you can put the bits together the next edge region um well I'd be doing an experiment today on MIT professors I got one sample thirty years ago that was a professor called Carl Hewitt he was very smart I gave him the two pieces and he looked at them for a long time he didn't any flaking looked over on Tyne after ten minutes he wrote down a proof it was impossible so his time to solve the puzzle is intimate okay today I've been doing experiment on MIT professors and the number of minutes they take to solve the problem is about the number of years they've been at MIT roughly speaking it's definitely very the length of time is very positively correlated with how long they've been at MIT so now I'm going to show you this puzzle because it's extraordinary it can be so hard because it's completely trivial and there's a very obvious way to solve it that people don't figure out they figure out after a few minutes but okay here's the two pieces all right I can do my sort of conjuring trick now these two pieces in there the same and here's what people do they say okay tetrahedron yeah like well that's not a tetrahedron um that's not a tetrahedron and then they try sticking the ends together and then they do this and they really do this and people of witnessin doing this they do that say well that's not a tetrahedron well what about that that's not a tetrahedron and after a few minutes if they're young they go oh that's a tetrahedron so now why is this puzzle almost impossible like it's so hard because there's a two piece jigsaw puzzle and there MIT professors and Oh incidentally I tried this on the Google vice president just to reassure the MIT professors I gave the two pieces of your wife Ezrin letter they're really hard tough can you make a tetrahedron and he said yes honestly um so they didn't pay me to say that is true Oh so why how could this be so hard because look you know it's all made of triangles or places look here's a couple squared so you've got to put the squares together because you've got to get rid of those squares somehow so you have to put the squares together and everybody loves that when that doesn't work they do that and that doesn't work either it doesn't work and why don't they do that well the answer is when I show you this piece there's a natural oops there's a natural coordinate frame and you can see the natural rectangular coordinate bring to this piece it's got a long axis and it's got an axis across an axis up and down and that coordinate frame does not line up at all with the natural coordinate frame you use for tetrahedron unless you're over 50 in a winter in American public school if you're over 50 mention American public school you would have got milk in tetrahedral cartons that were stacked and they were stacked like whoops they were stacked like that and you have a model that says it's a line there a line hearing it drawing up all those lines and the tetrahedron looks like that and if you also have that model which I'll call a Quadra he drew because it's so different then it's obvious that if you slice it here you get a long thin rectangle that way if you slightly there you get a long thin rectangle that way and some mathematician said if you slice it half way better be in between and so you better get a square somewhere if you have that model I call it the quadrille model this puzzle is completely obvious and a crystallographer will do that instantly but a normal person just can't do it's possible and it's because of the RIC the frame of reference they use for decoration doesn't line up with the frame of reference you naturally impose on one of these pieces there's other subsidiary things like the pieces have a different relationship to the whole and you don't expect that because isometrically you step down to the lecture relationship to the whole anyway that's it that's the end of my evidence the convolutional nets are not psychologically correct yeah that's sufficient to convince me to the project um Irving Rock pointed out all this stuff a long time ago if you show people a map like this and ask them what country it is most people say well it's a little bit like Australia but it's not Australia but if you tell people what country do you think Sarah Palin might think this is people know politically savvy are they say this is Africa um Sarah Palin actually thought it was a country um and if you see the top here then it's immediately obvious Africa but if you impose the top one you don't recognize it and the very familiar example is a tilted square if you see that the tilted square you see that the diamond is two different internal objects if you feed of the diamond you're totally unaware of whether this is the right angle I can make that be 85 degrees or 95 degrees or 90 degrees none of them looks better than the other if you sit as a tilted Square you're you're accurate within about one degree and whether that's the right angle so lots of evidence for rectangular frames and convolutional nets don't use them one more bit of evidence okay so that tells me that people's visual systems are recognizing things by imposing rectangular coordinate frames and probably a hierarchy of them just like computer graphics does computer graphics has to say what the relation is between a part in the whole and so it has to impose a frame on the whole impose a frame on the part and then tell you the matrix that will map points in the whole relative frame the whole two points relative frames apart um so it's all linear that's the linear manifold but you can only express it by imposing this coordinate frame luckily there's a chance in the discussion later when Tommy can point out how what I say is all um I can read his mind and now let me give you some more evidence and this is evidence not only do we impose coordinate frames but we represent the imposed coordinate frames by a bunch of separate neural activities not by having one neuron that says I'm imposing this frame I want to show you in a bunch of different neural activities so suppose I shown you this well you know within about 250 milliseconds or less is that it's a capital letter R and the top is here what you don't know is whether it's a correct R or mirror image R and to figure out that what you do is you go come come come come come come come come come to the top right and then when it's upright you say I faces the wrong way so it's a mirror image are why is that a more impressive demonstration of that is if you go with a woman to a fancy shoe shop and you show her a shoe she can tell you within about 250 milliseconds how much it costs who made it which other stores is available in it's a bit sexist to think um she can't tell you whether it's a right hand sure a left-hand shoe now that's extraordinary I mean she knows all these properties of this shoe but she doesn't know where that right hand the left hand she has to do this kind of Chungcheong content in canonical orientation so what how could that be so my argument is that you have you know the relationship between this and the viewer okay so you have the pose of this in VO center coordinates and you need to know whether the matrix that relates the intrinsic frame of this to the viewer frame is a left-hand one or right hand one and if you have the mage you represented by a bunch of separate numbers like you're doing computer graphics to know whether it's a left-hand one or right-hand one you need to know the sign of the determinant of the matrix in other words you need to solve a high order parity problem reversing any one changes the answer if you rust the growth the value of any one number the answer has changed so your nets aren't good at higher the parity problem so that's why we can't do handedness and the fact that we can't do hand even if I take a strong evidence that our representation of the impose coordinate but the relationship between that in the viewer is spread over many numbers just like it is in computer graphics okay and so the way we solve it is we do a continuous transformation that preserves handedness until we've got it down to just checking one direction seeing which way it goes and then we solve it and that's what that's how we do it that's all those mental rotation things they're not to recognize objects that you deal with problem that you don't know the handedness okay so the conclusion for argument one is people use rectangular coordinate frames embedded in objects and embedding parts of objects and they represent those coordinate frames but if they represent the pose of the object the relation between its embedded coordinate frame and the viewer they represent that spread over a bunch of numbers not just in one yarn argument to equivariance what convolutional nets are trying to do is make the representation not change with viewpoint of course that's what you want the label to do but when you look at a face for example it's not you say face and I don't know where it is or what it's orientation is or what it's ID is you look at a face and you exactly what it's orientation is and the vac you where it is so you haven't lost that information like convolutional let's try and do you've got it all a very precisely now many neuroscientists think you have big receptive fields that means you've got low accuracy that's exactly the opposite of the truth if you want to have very high accuracy for where something is and you have a limited number of neurons you should make them have very big receptive fields that overlap a whole lot you'll get much higher accuracy that way because your device based in too many tiny regions and the number of regions you make is portion up to the surface area of these receptive fields and to get a lot of surface area you make it feels bigger so the signature of wanting really high accuracy for these instantiation practices you have big fields of course you lose resolution so you can only afford to do that for things where there aren't many of them you can't do it for edges because there's too many of them around because you and you don't want to lose the resolution but as you go up to high level things like faces in some reason all area there's only a few faces so you got big fields ok so what we want is we want a representation where as you change viewpoint than your elect images change and they change just like the viewpoint does so convolutional nets without max pooling alike that they say they got you Polly bears but not at all here's a - and here's the neurons that get active you translate the two and then you are they get active translate if the same pattern but it's different neurons this is equivariance right change this and this changes and it changes the same way now I want you to see which two types of echo variants using pseudo neuroscience terminology in this if we translated by a whole number of pixels then these will move across and this is what I call place equivariance as you change the pixels this is on the neurons the represented change call that place equivariance there's a different kind of experience which is raid coded echo areas which says and I move the object around the same neurons are encoding it but the activities of the neurons is changing okay so it's a rate code rather than a place code and what's happening in the visual system my belief is low level we have very small domains so tiny changes change the rates and then if you change by more than a tiny amount you change to another neuron or another bunch of neurons another capsule and so that's my place coding it's sort of like cell phones with little cell domains you move around a little bit and it's stay in the same domain and then you jump to another domain and you move running a little bit in there as you go up these domains get bigger so when you're at a high level you can move a long way without changing which neurons are coding it but the activities of the neurons changes to tell you where it is okay now argument 3 I think this is the most powerful argument um if you ask how current neural networks deal with invariance what they do is they just train all along different viewpoints and that's quite sensible he just requires a lot of training data and going through a lot of training takes a lot of time they don't have a built-in bias to generalize in just the right way across viewpoint so I guess it's another bit of an ATM I'm getting soft in my old age and allowing these tiny little you allow those Anytus to get even a foothold on your model and pretty soon they'll be detecting it um ok so a much better approach would be to say the linear manifold which is what computer graphics uses if we get from pixels to the coordinate representation of pieces of objects or objects that it is this kind of thing and here's its pose in a bunch of numbers then everything is linear in that and now you can do massive extrapolation you can train on little faces like this you and then I can show you a huge upside-down face and you'll correctly recognize it because on the linear manifolds you can extrapolate that's what your nets can't do and if they could do that we could train them on a whole lot less data ok so the idea is a long time ago on for many years people would be saying you could think of vision as inverse graphics but they didn't say they didn't mean it literally and I want to do it literally I want inside my systems of computer vision I wanted to be doing graphics backwards when graphic takes a hole and takes the pose of a hole and multiplies it by a matrix to get the pose of the part I want my computer vision system to be taking apart and taking the pose of the part and multiplying it by the inverse matrix to get the pose of the hole that's what I call inverse graphics so when I say it's literally doing inverse graphics I mean literally literally um and if you do that then you can represent relationship between a hole and a part as just a matrix of weights and that matrix of weights is completely viewpoint invariant so how much I change the pose is the same matrix of weights to take suppose of the hole and gives you the pose of the part or goes the other way the pose of the partner gives me the pose of the hole so I've got complete independence of viewpoint in the weights that's why your name variance we don't want to in the neural activity is wandering the weights and we can get perfect invariants in the weights this way now this is messed up by the fact that if you have multiple instances of the same entity you could have small domains because you have to code multiple ones and each of these capsules can own deal with one thing at a time because it's using simultaneity to do the binding it's saying within my capsule I got a bunch of neurons the activities of the neurons represent different properties are the same thing and the better only be one thing and so when the when things are this far apart you need feel this size when things are this far apart you can use field v size you can make the fields overlap a bit and fudge it a bit but if you violate that you're going to your perception is going to go all wrong and that's called crowding if you put things too close together you're very bad at seeing them you're better at seeing smaller things sort of apart um but I'm not going to push that so given this view of the world here's how your to do shape recognition and this is computer vision from 1980s you identify some familiar part like a nose and you have a logistic unit that says what's the probability that is there so this varies between 1 and 0 so it's true this is invariant for just this little bit um this guy in here has the pose parameters of the nose so for adjusting geometrically that will be say six numbers in here that tell you the pose of the nose relative to the viewer you then have a matrix that operates on this pose and gives you a prediction this thing here for the pose of the face here you've got an identified legs you have the post parameters you operate on them with this completely viewpoint invariant matrix then you get another prediction for the pose of the face if those predictions are roughly the same that's very good evidence of the face those predictions will only be roughly the same if the nose and the mouth that are related in the right way to make a face so really by checking that identity you're checking these are related correctly to make a face if they are you say pay the space here and you send out the average of what they predict the average of this and this and you may guess be one and you carry on so this is a way of doing coincidences filtering to recognize larger parts from smaller parts and this is very robust there can be other capsules that are making different predictions for the place and you will just ignore those if you have a way of finding the little cluster the degrees among a lot of noise and old-fashioned visual people will say why unless you have transformed that is a half-crown so that's what has consoles all about and it is just to have transform but it's a modern Hough transform in the following sense in the old days only did have transform so it had fairly low dimensional features trying to predict the pose high dimensional poses and if the dimensionality the feature is lower than dimension as the object you have to predict a subspace and so you have to have bins and you have put lots of votes and lots of bins if you just want to make one vote so you don't need all these bins then you better have it that the features have as many dimensions as the high dimensional things and to get features like that you can reliably extract from pixels you need machine learning you need good machine learning and the reason they couldn't make Hough transfers work before was because they couldn't get the low-level features have enough dimensions and be reliable so that they could make point predictions okay that's my claim about why this is different more fashion have transforms believe the hub stands and the last argument I want to make is to do with this routing so the idea is what viewpoint does is it changes where things show up in the image and so a knows is here now on these pixels and let's suppose let's suppose it with a face and that says you just have one great big face capture at the top so you need to route these pixels to the base capture and then different image you need to these pixels for base capture and I want to show you there's a nice way to do routing so the first order of doing routing is you make an eye movement you put all your interested in the middle of your retina and that's a very powerful routing algorithm but even if I I do Atticus except so I forced you to fixate in one point you can still do routing and here's how you might do it well let's say commnets I already mentioned this the way they do routing have a pooling unit and it just attends to the most active guy and the same all these levels up so that'll route things but just based on finally how loud they are I want a routing principle where you route the information to the capsule that knows how to deal with it so the idea is that you assume that we're going to assume the world's opaque we're going to assume that it can be modeled by a parse tree so we're going to assume each part that we discover has one parent that's a single parent constraint or possibly no parents they don't have multiple parents so we want to find what it's a part of and there's just going to be one of those and so when you discover a pilot suppose I discover a circle a circle if they have a limited world that contains any cars and faces the circle might be a left eye of a theta might be a right out of a face it might be the front wheel of a car or the back wheel of a car and I don't know from the circle which it is so what I'm going to do is take the poems they sent to all those places but I know sender kind of waited with that which says my bet is a quarter that it's this and a quarter the design the quarter this happened got it on that so you sent it to these higher level capsules lots of weak beds and then with a high level capsule needs to do is look at all these incoming weak bets and find a bunch that agree and when it finds a bunch that agree then so the low level capsule is sending its pose to several high level capsules and initially it waits by the prior so there's going to be a prior which a circle might be part of a face it might be picada a car is probably not part of a fridge so you don't send it to the fridge or you send it with very low weight to the fridge and with highways to a face in a car um then in these capsules you find the clusters this is a magic computation I'm not telling you how you do this computation but let's suppose we can do this magic computation like I say this is a Mauryan approach of what computation do you need to do and that's what we're going to suppose the capsule does and we reason about this computation by the nature from the nature of the task and then it's somebody else's problem for how it does it well it's my problem and I can't make it work very well but here we go I'll show you how I do it currently later so you sent it here in it with this blue prediction and it agreed with the cluster you're sending here in it with this prediction it didn't agree with the cluster so now what you want to do is you want to send top-down feedback or you want to have lateral interactions that say let's move this top-down feedback the top that feedback from this capsule says send me more I can account for this stuff please send me your output and this guy says I can't account for this stuff please don't send me your iPod and a few iterations of that and you'll be sending all of the output of this guy to here and sending none of it to there and now you have a palace tree you will have established that this belongs to that there's obviously competition um [Music] but yeah I'm trying to get the basic ideas over here so that's my proposal either bilateral interaction between these two guys saying that they're the weights on the must add up to one and this guy fits in nicely so he'd like more weight in this guy doesn't fit in nicely so it'd like less weight so give this guy more weight in this careless weight or you could do that by top-down but notice the top-down isn't gonna be like loopy belief propagation is not going to say I'm going to use this to revise my opinion about what's here it's just going to use this for routing it's going to say okay this capsule can account for me very nicely I want the information be routed to this capsule so so routing by agreement algorithm and it's using consistency to do the routing as opposed to just saying I'm going to ignore everybody but the live this guy that I can see which is what max pooling is doing so now I'm going to show you two computer programs unfortunately one of them was written by me so it's not a very good program it runs very slowly but the idea is that if we could make this basic module that can check high dimensional agreement and throw it outliers then we can have a deep system we've got a front end problem which is how do you get to the first things that I opposes that's what I call D rendering the image is getting from pixel intensities usually with light to the poses we should do with geometry and then the high levels are going to work the same way they're going to put pieces together into larger and larger pieces I'm going to have a not very deep system to begin with but what we want to do is be able to use the power of stochastic gradient descent and lots of data to learn apart spaced hierarchy with no hand engineering other than what I said already other than putting in this idea that there's going to be constant scattering Andry routing based on agreement so here's my proof of concept and you take Emma's digits and I'm going to have little patches it's going to be convolutional in the following sense that in the first stage I'm going to take a patch she's going to have some weights connecting it to some hidden units and that's the black patch there's going to be a blue patch it's going to have exactly the same weights connecting it to 300 blue units okay so the way I process a patch is completely the same wherever the patch is that's convolutional these are going to be non linear units and what I'm going to ask these units to do is to give me the poses of actually u7 but I've shown three here the poses of three different types of capsule so capsules that detect different kinds of entity this might be vertical energies horizontal edges corners or something like that I'm not going to specify what these entities should be I'm learning all by gradient descent from the right answers but the way the first stage of the system is going to work it is going to convert some pixel intensities into some vectors of pose parameters pose instantiation post parameters and these are going to be different post parameters again if you post mountains of different kinds of things in addition it's going to say whether things this thing is there or not so that's going to be a logistic unit and everything is going to be learned but it's all going to be shared across the image by all these patches and these are going to be linear units now those things are going to be what I call the primary captions for the first level of capsules the first level of which you have explicit pose coordinates allowing to have a very flat system to give me that it's just got two levels he goes from that first level to a second level I've made deeper systems but this V is one to understand um in the second level it has a bunch of capsules there should be the same number of guys heroes are there I was just lately with the PowerPoint okay sorry about that I should have put more things in here um so you take a type a primary capsule there's looking at the black patch you've extracted some numbers from it you learn to extract these numbers which they have in a minute and you apply a corner transform to that vector opposed parameters to make a prediction for the pose of the 0 this is a 0 there also from whether or not this is active you make a prediction for whether the 0 is present that's what I call a Picasso wait so a Picasso wait says let's look at the type of the capsule and predict whether this thing is present solely based on the type we're ignoring all geometric constraints so if you see an eye there might be a face and even if the eye using a breathing the wrong place if I put enough eyes and noses around you'll see a face or you'll think sort of a face there even if the geometry is all wrong just because I is going to faces so this is the reason it says because it's an eye there might be a face this is doing the real work it's saying I take the pose of this feature of identified this high dimension feature and I predict the pose of the zero and you can do the same from all of these different primary capsules and you also have a weight on this prediction so this is a weighted bed and now for the same patch for the same black patch in the original image I'll have a type b capsule that is taking a different kind of feature and it's making predictions it's predicting the pose of the zero and the pose of the one and as well as having the prediction has to come from the black patch for the different types of feature I have predictions that come from all the other patches from the different types of feature so type a feature type B feature from the Apache making predictions so I get a whole lot of predictions here and in particular this way matrix that relates this pose of a type a feature in the black patch to this prediction is going to be the same as the corner transform for the red patch to make a prediction and that's not quite right so we have to fix that up because the red patch is in a different place and so it'll predict a different pose and so we have to fix up the translational bit and we're going to do that by saying when I make my prediction from this primary capsule about the pose of this digit I'm going to take the first two coordinates I can say their position and I'm going to add on the location of this patch in the image to the first two coordinates I'm just going to add it to whatever this computes and so that's the offset you get in a position due to the offset of the patch everything else about the patch is the same but as you offset the patch you should offset where you say the thing is and that's implemented by this it also means these first two coordinates are going to be interpretable so that's the whole system and then what we want to do is say if you get a whole bunch of agreements here it's zero if you get a whole bunch of agreements here it's a 1 and so we want to back propagate we know what the class is let's suppose it's a 2 we want lots of agreement here and not much agreement there so we need some measure of agreement and we need to be able to back propagate it and we need to take you're not getting much agree here please get more and if you're getting too much agreement here please get less and then we can love all of these weights we can learn the weights that some pixels into the poses of primary capsules we can learn these coordinate transforms that's fixed for now so we don't learn that and we learn various parcels and things so yes no because if you want to get from pixel intensities to the poses of features that's highly nonlinear this is getting the coordinates of the feature from pixel intensities but if you want to get from the pose of a feature to the pose of a larger thing attending it that's completely linear so you really really don't want any non-linearity in there it's a very good question so in this little system i implemented i'm each of the higher level capsules the seven feature types there's 181 patches in the image and it gets stat many predictions it gets a whole lot on many which are very weak so it's organized so that blank patches may very weak predictions and coming with each prediction as a bet about so for any one primary capsule it makes predictions and the some of the bets for this prediction add up to one and this is treated as a fractional observation for what comes next the high level capsule then is looking for agreement and so we need some way with all these bets with these over a thousand bets come in we need some way to find a subset that agree nicely and so what we're going to do is we're going to try and compute for this high level capsule whether it's got good agreement and the way we're going to do it is in this closed space for the high level capsule we're going to fit a mixture of a gas in a new uniform to the predictions that supposes a 60 space and we're going to ask how good a model is that of my predictions compared with how good a model I get by just fitting a uniform and if when I fit in mixture of again in the uniform I get a much better prediction a much better log prompt for these predictions so he chooses camp unsupervised learning problem you gotta lose predictions I want to build a model of them and I have two alternative models one it's just uniform the other is uniform another Gaussian somewhere and I'm going to allow the mean of their guessing to float around so as this neural network is running the meaning of its Gaussian is floating around and the variance of Accession is floating around and the links in proportion of the Gaussian is floating around what you have enclosing around what is I'm floating around Easy's call in the transforms and the weights on the bets to begin with at least none so we're going to get a score for how good a cluster this is so let's suppose for this cluster these red dots have very high posterior probability and of the Gaussian these ones that sort of can't defy well under the Gaussian of the uniform and these ones are under the uniform and so we compute the sum of all these data points each weighted by the fraction of an observation it is of the log probability of seeing that under and make sure we get in a uniform we also compute the log probability of seeing that just given a uniform and our score is the difference in these log props it's important to use something like that because what you want is that if I put in a straight back here it doesn't affect the school then you won't affect the school much I mean depending on how you do an effective tool because that'll be carried under the uniform in both cases but if you find a cluster this called be big in if it's a tight cluster this score would be much better so now let's look at what it does so that's the score you then take this score and you put it into a soft max to try and make a decision about what class it is and so so you treat this as the loggia that goes into a soft my you scale it first because that's a slow hack um so the hack scale here if everything was probabilistically correct you should need a scale of 1 that you don't and you look at the decision you may give us the wrong decision you say up the score for the right guy down the score for the wrong guy and you back propagate derivatives of that score through the whole system so there's an inner loop that involves doing a em and you don't chase by four iterations so you find the cluster now I don't believe that's how the brain does it that's just how this model does it I believe the various s some other way of solving this computation but for now that's the best I got I do that computation and now I just learn the whole system and after I've learned it if you're sure to digit like this and you look at the scores that you get and the clusters that you get for the various high level capsules I'm not showing you all of them there's other ones over here I'm just showing you the first two coordinates because I know this is going to be to do with position and so the zero it gets lots of very weak bet so the coming from patches they don't have any in them the size of a circle here is the posterior for that thing being accountable by the Gaussian this is a two standard deviations I think yeah and so the votes that you get for a zero look like that the votes you get for a five look like this then much more tightly clustered and so your guys ian is a sharper gaussian which means it can give higher probability to things so the this is the log of the variance and when the log of the variance is negative that means it's sharp this is a sharp gaussian that's a much less sharp Gaussian this is the score you get this is the difference in the probability under the mixture and the probability under the uniform and this gets a high school in the other guy to get low scores so it says it's five one more example so I'm showing you more of the digits here this one gets a high school because the six and it's a tight cluster this clusters relatively tight so this has a lot variance of minus one this has a lot variance of minus 1/2 so this is fairly tight it's a bit zero like and but in those is a six now a system like this does about as well as a convolutional neural network on M NIST the difference is that if you take am a can you find a convolutional neural network you can train it in sort of ten minutes to half an hour and if you take my Mac air and you train this it takes two days and that's because this inner loop of doing all this eeehm to get the scores and stuff is very slow compared with just making operations through the neural net but if we could find a fast rate tumor that's tough and this would be fine but it actually works it manages to recognize digits by finding the agreement in the pose predictions that's one demo there's another demo that's a bit more convincing so there's all sorts of things that need to be done to make this better we need multiple simultaneous digits I can get that like and that's fine in this example I showed you we didn't redistribute the vote suggested bottom-up vote so we didn't rewrite them based on routine by agreement so I need to currently someone is adding routing by agreement to this and it will work better we didn't have deeper higher our kids we didn't have real images there's just endless there's lots to be done one thing we have done is tried to get the primary capsules not by back propagating the errors you make in digi classification but by doing unsupervised learning you'd like to do unsupervised learning to get from pixels to entities that have poses for that first stage if you can do it on supervised learning it's going to be much less even be much less able data for example so [Music] we want to dear ender the image and get it to entities that have poses unsupervised and here's we have several ways of doing that and here's so here's what we'd really like to do and I've to simplify the PowerPoint I just assumed we're getting two-dimensional pose which just position but really we're going to do this for full pose which is for laughing um so you have an image and you would like to go through some nonlinear units and get out in these different capsules you'd like to get the position of an entity and in this case the intensity of the entity so I change from a phone bill teacher intensity which is a slight ledge which I need what I do later the work I'm describing now is work that was done in taming telemon's thesis which recently came out ah so that's what we'd like to achieve we'd like these to learn to be entities and they decide what kind of empty they should be and we'd like them to give us their poses and how how much they're present and the way we're going to do it is by having a simple graphics model oh do you see I told you if you let me in nature sin then they'll take over so we're going to have an innate graphics model now but that's just to this system um and we're going to try and learn to reconstruct the image by first extracting the capsules and then we construct an image from the capsules they're using building graphics so it works like this you saw this bit already this is going to learn all these learn on all these learn it's going to learn to extract some kind of entity and the pose of the entity and the intensity of the entity and it knows which entity it is because these neurons and then these instantiation parameters are going to be fed to the graphic system and the graphic system from these parameters is going to have to reconstruct the image and the graphic system is going to learn a little template for each entity but what the graphic system is going to be able to do unlike a normal neural net it is going to be able to translate this template according to this x and y and if you're able to scale the template scale the intensities here according to the eye and then it's just going to add it to the image so you've wired in some graphics and you're learning to invert graphics yet but it's going to also learn what these entities should be you just what you widen when you add in practice is just the ability to take some entity that you've learned and translate it and more generally do a full effort okay so now if you ask what templates does it learn and modeling digits these are the templates in them and you can see their little piece of straight with feathered ends so you can add them together and then make a nice digit this is now the case when it can yes this thing can apply full outlines to these templates and so let's see what how it now decompose these digits so if you show it this digit this is its reconstruction and for each of its ten capsules this is the contribution from the capsule so if you look at this capsule it's making different contributions but you'll see it sort of find the corresponding pieces for this six that part of the loop is coded here for this two is cutting this piece of the curve here for this eight is cutting this bottom piece of the curve here and so these contributions when you add them up to a jolly good job of reconstructing the gee-gees it works very well for reconstruction okay so so so far what we've done is we've learned bottom-up - what the pieces should be and we've learned how to look at an image and extract the poses of those pieces so now having done that unsupervised let's now try and do some supervised learning on top of this and remember after you've got two poses everything's linear so life's going to be easy so here's the idea if we just had to post parameters you take each of the capsules and you take its post parameters you extract it from the image and you just concatenate them all you get a big vector and of course if it's a particular shape there's lots of mutual information between all these guys they're all related in the right way they can all change as you change your view on the shape but they have relationships between them so if you do factor analysis on this factor analysis will find underlying factors so six for the affine and fourth of the deformation underlying factors that model is very nicely because as you change your viewpoint all of this stuff is changing and by changing the half Iying represented by the factors you can model out perfectly so the factor loadings the three factor loadings here will actually be modeling the relationship between the hole in the part and by fitting the factor analyzer if I just have one digit I could do that now I've actually got a mixture of digits so I'm going for the mixture factor analyzes it would be nice if I could just take ten of them but that doesn't work so well so we fitted a mixture 25 background Isis actually if you just go to make sure you can factor analyzers you complete that ran a lot analyzing the mixture photo analyzers you can fit it to the raw pixels if you fit a little pixels these are the means of the background Isis if you fit the factor analyzers to the capsules and then take what they tell the capsules to do and reconstruct from what they tell the capsules to do these are the means of the vector analyzers this is all unsupervised so you can tell it's done a pretty good job Falls and nineties a bit confused about but it's really nailed the ten way the ten classes year um and in fact if you look at this factor analyzer and you look at the ten factors and you look at what changing a factor does if you take a fact take the mean value and subtract two standard deviations you'll get that and if you add two standard deviations you'll get that so this is a fact of the ceiling with italic Ness this is the fact of a loopiness and so on now in this stage we didn't actually do much offline transformation so most of the factors right deformation if you use twenty five factors forget that one if you want to recognize these are very few labeled examples now what you can do is you can you stand a backdrop stand back prop on any list with 60,000 examples gets about 1.6 percent error and with all sorts of tricks you can get that in time to about 1 percent um if you do a very clever thing called scattering transforms developed by Bruno mullet they were very pleased because they managed to divide that by a factor of about 30 actually with 30,000 you at one point seven so say 15 with 2,000 labeled examples they can do just as well so we're getting much more statistical efficiency they can get this this error rate with 2,000 examples if you do unsupervised learning if the kind I just showed you and you have 25 clusters that is your mixture 25 factor analyzers on these concatenated Association parameters the primary capsules after you've done that you just take each factor analyzer when you say for the example that I'm most confident should go to that factor analyzer just tell me what its name is tell me what its classes and so you have to ask 25 questions and if you ask 25 questions you can get 1.75 percent errors on average so now that's much closer to what humans can do you show them a lot of digits and they ask you a few questions and then they know how to classify new digits and so this is an example where because we grabbed the linear manifold we may be unsupervised learning work right so it found the natural classes and having done that you can learn with very few labels yeah I just said a what unless the end okay I'm done unsupervised then okay so the question was how does that compare with those people out there how does that compare with doing unsupervised followed by supervised like unsupervised pre-training presume you followed by supervised is much much better so if you do standard unsupervised followed by supervised - if you say stack up ordering coders you'll do about the same as the brewer or mallet that you won't do much better than that maybe you can get down to a thousand but this is like almost two orders of magnitude better than one because you grab the linear manifold yeah yeah so I showed you it wasn't nearly as good as this if you so you can look at the purity of the next you combined yes um I imagine well that just won't get you down to one point seven percent errors right I think to get 21.7% areas you probably you probably need to fix something like a thousand components I did I I mean I have to checkup I don't you be able to with Andrew Bunsen I tried those models tried lots of components and with 100 components and be very surprised if you could get down to 1% are you trying to get at the idea that actually they're low resolution is helpful by suppressing details and they just see the sort of essence of it at low resolution and they can learn more easily um possibly I don't want a commercial I haven't thought about that issue and I'd want to think about it before I said anything I'm not disagreeing it might be helpful but I am there hang on when you say why are humans so bad the question was a critical aspect of this is that the relational thing is hard why'd the fact that is a linear transformation so why are humans so bad during this rotational transformation of objects so there's two things you might mean by why they so bad humans are actually pretty good at recognizing objects in funny orientations you're not going to doing very fine discrimination for that but if I show you an upside-down face you know right away it's an upside-down face you don't know whose it is but you know right away it's another damn face so you do that kind of instantly um this was probably a 10 millisecond penalty for that but the mental rotation is going to be like over 100 milliseconds you heads up right so it's a different order of magnitude so the things where you really have to do the slow mental rotation and to make handedness judgments or to do things like will this grand piano fit through the studio door where you have to do physical but recognition you're actually pretty good at dealing with different orientations that capital letter R I showed you you instantly knew it was a capital letter R and when I say instantly within about 250 milliseconds and then it took you 100 milliseconds to rotate you know who is a 3d thing usually even slow that was two-dimensional in 3d you do mental rotation too but it's quite a lot slower for the learning for the learning I'm sure you can bring down by a whole lot you could start off by not programming it in MATLAB well even better you can start off by not having me program it that would give you about an order of magnitude but the real essence of it is this thing that what's different from neural normal neural nets here is the core of it if something looks for a green machine activity vectors it's not looking for a green between a weight vector and activity vector that's what a filter does well we wanted agreement between activity directors which is going after covariance structure and that computation to find these high dimensional agreements among a whole bunch of random stuff there must be ways to make that efficient and I have some ideas about how to make it efficient and I'll talk about those if they work the Bassel slowing up some so let's take speech I have a long-suffering graduate student who I got to do a thesis I'm trying to apply these ideas to speech he was a very good student and these ideas are quite difficulty at working on speech but for example you can apply you can save as changes in frequency and changes in onset time and changes in amplitude and you can try and get entities that tell you this complicated acoustic event is happening and here's when it's happening and here's what frequency is happening at and here's the amplitude is happening at and he managed in the end to get things that worked as well as standard neural networks using capture of ideas in speech but they're more natural for vision let me remind you with the peril in this rule at six from after intelligence and between your values reception outside [Applause]
Info
Channel: trwappers
Views: 184,030
Rating: undefined out of 5
Keywords: geoffrey hinton, machine learning, ml, capsules, mit
Id: rTawFwUvnLE
Channel Id: undefined
Length: 71min 35sec (4295 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.