Geoffrey Hinton: What is wrong with convolutional neural nets?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so it's about an alternative to convolutional neural nets I should say the version I'm going to talk about doesn't really work very well since then Sara sabor at Google has got it working quite a lot better but I'll talk about the old version because that's what I haven't had the slides for so most people who do neural nets now sort of 10,000 graduate students in China who are doing your own nets believes that the way in neural networks is that you have layers of railings maybe you have an LS TM and that's how your network what they don't realize is we just made that up it doesn't have much to do with the brain I mean it's inspired by the brain but because we just made it up it's perfectly possible there's very different architectures that work better and I'll give you a little lesson in history for 30 years people tried making your own networks work with a sigmoid unit you know some brave people said let's not use a sigmoid unit like Yan Lacan said let's not use a sigmoid unit let's shift everything down a bit and multiply by a factor of 2 nu is at an age unit and there was a big debate about with the issues a sigmoid right an age because 10 H is better if you got ill condition problems in it and that was pretty much it for the expert exploration of the kind of non-linearity you should use for about 30 years and it turns out that if user the max of the input and 0 that actually works quite a lot better it's easier to back propagate through many layers layers there's all sorts of reasons why generally that works better but people didn't weren't looking right they just weren't looking so if he took him 30 years to discover that a rectified linear unit which people in neuroscience were recommending ages ago I was actually using them by 2004 because a guy called Hugh Wilson at York said that's much more like a neuron than this sigmoid units because neons almost never saturated in the normal regime so you shouldn't be using sigmoid units so I tried you what and they sort of worked but I never did the experiments to show them actually better and later on we did so what I want you to understand is that our field is in a situation where suddenly a particular recipe is working extremely well it's actually working well enough that it's useful and it beats conventional computer vision techniques it's pretty much the recipe Jana car has been advocating for many years except with rectified linear units and there's no reason for believing that there aren't very different things that are much better okay so here's a very different thing that isn't much better so I sold half the problem um maybe in the end it will be and if you go right back to the beginning of neural nets at the beginning neural nets I'm going to do a lot of odd living in this section because I got a license chip at the beginning of neural nets people said well you can't do XOR and people other people said well it's true you can't do XOR with one of the kind of units were using which takes a sum and puts it through a logistic or a hard threshold that if you can't make one zero and zero one as inputs give a 1 and 0 0 and 1 1 give a 0 just in case you are too young to remember XOR now that's very unfortunate that they chose X or they could have chosen something that's logically equivalent to that which is the function saying you can't make one of these units give a 1 if it gets 2 ones as input or 2 zeros as input and give a 0 if it gets 1 0 it's just the same function with the negation so you can't make one of our sons units do same so what's the standard solution to it the solution to it is to put in a hidden layer and basically subdivide the cases into this sort of one one case you have something recognize that and there's a zero zero case and you have something else recognize that and then if it's either of those it's the same so that's one way to handle what we might now call the there's a completely different way to handle the same problem which is to give yourself a yawn that can do same now you might ask if there's any neural inspiration for this well there is actually in your own inspiration for it although it's not very relevant to this talk which is your sense spikes and we know that in some places like for example dealing with the tiny delay between the two years that tells me that different scene that direction in that direction that's a delay of about probably this much for sound and this much for sound is a millisecond so it's a delay of like much less than a millisecond the difference between there and there and we can deal with that when we do it by sending spikes either way I'm aware that Doug tweed is sure and probably knows like how we actually do it but we send spikes either way and we have things arranged to look for the coincidence and then you can tell so you do it by the constants the spike so we do have coincidence detectors and we suspect they're used in other places too so it's not totally in your eminence bad but when I forget the brain we could have a different kind of unit that can do sane and that's got very different properties from the neurons we have a presence so the neurons we have a present can do the following they can take an input vector and they can take a weight vector and they can tell you how well they correlate they take the same scalar product but they could normalize if they like they can tell you to see a simple vector and this weight vector correlate what they can't do conveniently is take two input vectors and tell you to those correlate to activity vectors they measure correlations between weight vectors and activity vectors so if you want correlation string activity vectors you could have a whole bunch of different units that measure do they correlate with this weight vector in this weight going and this vector and does this one do it you can imagine making a circuit out of no what are called normal neurons but you might just say why don't we have neurons that can detect where the two things two high dimensional vectors are the same and maybe if we can do that the certain computations that will get a whole lot easier that were difficult and certain computations obviously would be more difficult it might sort of change the way you do business if you had as a sort of basic unit something that you can detect coincidence okay so that's one of the motivations here another motivation is why should we stick with neurons that do scalar nonlinearities so present on neurons they apply a linear filter they get one number and then this debate about what's scalar non-linearity to apply there's one place where we use neurons that are use vector nonlinearities which is a soft max in a soft max the output of every unit depends on the inputs so that's a vector nonlinearities and trinsic lis takes a vector and gives you add a vector and everything depends on everything else it's not a bunch of scaling on the erica's you can quibble but basically not but what I want to say is maybe let's think about the space of vector nonlinearities it took us 30 years to go from this guy to this guy and I think you can imagine there's a lot more vector nonlinearities than there are scaler than linearities but we should be exploring that space in cases good ones and we better have some motivation we better have some insight into what we're after and I think being able to deal directly with covariance and activities is what we're after if you take an image and you measure the covariance structure and then you change all the intensities but keep same covariance structure and you show it to someone they'll say oh that's just a different rendering of the same thing if you keep the intensities pretty much the same change them a bit but change the covariance structure they'll say that's something quite different it's the covariance structure that really matters in an image and so why not have neurons that pick up on covariance structure directly actually I think my ad-libbing is much more convincing than the slides because the slides you know you get to see hard facts you can disagree with so the idea of a capsule is a capsule is a vector thing it's got a vector activity and capsules in one layer send information to capsules in the next layer and the capsule in the next layer gets active if it sees a bunch of incoming vectors that agree if you're a computer vision person think hough transform now of course the capsule in the next layer doesn't see just the output of caption there but it sees the output multiplied by weight matrix and so this capsule sees the output this guy must buy all right matrix and the output of this guy multiplied by weight matrix and if those two products of our activity vector and weight matrix agree and maybe it sees much more many more if it gets good agreement even if the some outliers they will say hey I found something and of course a high dimensional coincidence she gets a six dimensional things to agree even if only diamond each I mentioned only agrees to within ten percent the chance of a six dimensional thing agreeing is like one in a million I mean if it's sort of a tenth of the normal norm to disparity on each dimension then it's a millionth of the disparity of two random things so a high dimensional agreement is a really really significant thing and is a much better filter than what we do at present which is you apply some weights and see if you get above a threshold obviously we know that if you stack that up and you train it by stochastic gradient percent it can do anything actually into amazing things but if you go back to basics it's not in its nature to some automatically be looking at covariances between vectors and I want units where that's part of their nature okay the idea of a capsule is it'll have a vector and what that vector will do is it'll represent the different properties of a thing of an entity and at the end she might be a little fragment or it might be something much bigger and if we weren't doing vision it might be something else altogether but the other aspect of capsules is we're going to try inside the net to get entities if you take your typical neural net it's not sort of committed to entry because it has a big layer of Raley's and then another big layer of rail is there's no there's not more structure than that there might be local connectivity and convolution and so on but it's not committed to finding multi-dimensional entities whereas if you look at how people deal with the world they deal with the world in terms of objects that have properties so if we go back to the time I was reading the literature head published a book in 1949 called the organization of behavior where he said something very interesting everybody here knows about herb because he talked about the coincidence of post-synaptic and presynaptic activity being something to do with how neurons change you didn't say was the product he said both of them were involved in house in exchanges but he said something else which people ignore which is we're never going to understand anything about vision until we understand why vision why why when you do vision you get objects why it always happens in terms of objects and we don't really understand that and capsules is also addressing that concern it's saying we're going to understand the world in terms of entities and these entities is going to have properties and what a capsule is going to do is it's going to make a fundamental commitment so if you go back to a long time ago in nips and there was something that in the nips community was called the connectionist binding problem some people thought that meant there were too many papers submitted to nips to be able to hold in one volume without the binding breaking but that wasn't the connections binding problem the connections binding problem was if you've got say a blue circle on a red square and you try and represent them in terms of one-dimensional things like blueness and shape you'll activate blue and red and your actuate square and circle and the connections binding problem is she has blue and red she's scoring circle my different coordinates and we don't know whether there's something here here or something here and here you don't know what goes with what and so long as you have multiple things you can't represent them by just representing their coordinates you have to sort of put brackets around them if you see what I need to whole coordinates together so the idea of a Capshaw there's another idea in the capsule it will works for vision nicely which is that we can assume that in one region of an image you only have one thing of a particular site obviously load that when we're dealing with things like edges that means we better use pretty small regions if we if we have something that's detects an edge you better say in one tiny region of the image there's only one edge and I just have to decide what our intention is in and whether it's there or not this will work when you're dealing with vision of opaque objects because in any direction you only see one thing when you look through windows that's violated but let's not worry about that so if you want to use the active chance of a bunch of different units to represent different properties of the same thing which we do in casual you'd better believe there's never more than one of those things in your receptive field in fact you better make that that's the sort of fundamental commitment that you make that enables you to use simultaneity to solve binding if I'm a capsule I got a bunch of units and several of them are active at once then just the fact that they're active together means they apply to the same thing and so I solved the binding problem I sort of cheated we actually solved the binding problem at a higher level by going sequential if you go sequential that is only do one thing at a time then you combined together arbitrary things just by simultaneity I want to do that at a low level too in each of these caches okay so there's a big commitment here which says you can't have two the same thing in the same place in the same receptive fields now you might think that's a very very strong commitment and not a really not really good idea there's lots of psychological evidence that I take to corroborate that position it's certainly compatible with that position which is all crowding when you put in the same receptor field several different features of the same kind it gets much much harder to see things psychology from Dennis Pelle at New York is I'm a lot of research on this and it turns out the difficulty seeing things isn't really how big they are you can make them very smallest you'll see them very clearly is to do with whether the similar features in their immediate surroundings if you have that you can't see them very well the heart is hard to read you probably all have this experience if you take a page where you've got nice font if you enlarge the letters but keep the spacing the same it gets much harder to read and that's because you get crowding and if you've got something like that and you shrink the letters it gets easier to read that's kind of amazing okay so that's the evidence that really we may have this commitment to only one thing of the kind in a position as low down in your visual system it's simple things as you go higher up it's complex things so when I get up to face capsules I can make the commitment that obviously fixate somewhere in any region I might have multiple things that can detect faces but each one of those capsules can only see one face okay associated with each of these fights associated with each of these capsules I'm also going to have something that says whether it's active or not so I'll have a logistic unit because logistic units a good probabilities that says is this in present so let's suppose I'm capsule and I'm getting votes typically we think of these as high dimensional votes but I'll just use two dimensions because easier draw so all those red thing all those things are votes if I see a pattern of votes where there's all these kind of stray votes and then there's a there's a number not necessarily a big number that all agree tightly that's evidence that my my object is there my entity is present because there's a whole bunch of low level languages all saying I think the kind of entity I might be part of is present and these are its coordinates these are its properties if the properties agree if you think about sort of filtering terrorist chatter if you see some message you intercept some message and it says September and you intercept another message and it says New York and you intercept another message and it says 2001 and you might see lots of messages that say September and lots of messages in say 2001 also messages say New York but suppose you get a small number of messages and each of them says September the 11th in New York that is it's just a coincidence of a place in a time but if that same coincidence keep cropping up then that's much stronger evidence than if just in the individual terms pop up okay so you can think of that is you're looking at the covariance of things in messages and seeing the same covariance pattern and that's really good evidence as opposed to the individual things so now I'm gonna trash one of my favorite researchers Yan um if I'm a bit late actually the field has already trashed this when I wrote this talk everybody is doing max pooling like people are quite happy not to do max pooling you can just you could stay if you liked with the pixel grid and go all the way up with the big cell grid and these complex was still work just fine so really when yam was doing commnets he thought the max pooling was a really crucial point to the complet and that's how you're getting in variances by ignoring exactly where a feature occurred I don't think that's the most interesting aspect of calmness and I don't think that's the right way to get in variances so when I see a face and the face moves or changes orientation I'm aware of the change of orientation when I see a face I see not just that it's a face but I see all the relationship of the face to my retina that is the things to do with viewpoint I'm aware of all those so it's not that my represent a representation of face my full representation in the face is independent of the viewpoint obviously the bit that says the face captial got excited that will stay excited even less I change the viewpoint so that's an invariant bit but the representation of the face is includes all these properties and these properties all change when the face changes so what we really want is at least in capsules is we want equivariance rather than invariance we wanted that when the properties of the object change the properties of representation change in the same way and combatants do that for you do I have a picture so an important property of countenance was pooling and capsules are designed to be an alternative to pooling if you ask how one of these complex sees things after these layers of pooling it's lost information about the pose and this is lost explicit information like those of the object obviously the information is in there implicitly in the relative activations and all sorts of filters and one particular thing a convent can't do is it can't have very different percepts at the same input so here's a bit of psychology now if I yeah this talk is going to be very rambling I I have a wonderful excuse for a rambling talk and eventually the clock will stop and we get to one o'clock and I'll stop um if you take a square and you rotate 45 degrees so here's our square rotated 45 degrees you can perceive that there's a square a change of 45 degrees or you can see if it is not pregnant and those are utterly different percents that is if I show it to you and then take it away and ask you questions the questions you cancer are completely different depending on which way you perceived it so if you'd received it as a tilted Square and I asked you which was higher up in the image this corner of this corner you'll have no idea if you perceive it as an upright diamond you'll be exquisitely accurate to wear that which corner was higher if I show you an upright diamond and the angles aren't quite right angles and you see it as an upright down and you'll be completely insensitive to whether the angles are right angles you just won't know if you see it as adjusted square it's slightly off a right angle you'll be exquisitely sensitive okay so you know different things about the object depending on the frame you in pounds along as our knowledge of objects is all relative to the frame we impose I've got lots of demonstrations like resist giving you my favorite demonstration which some of you will have seen already but hopefully most of you a young enough not if I take a cube or a wireframe cube on the tabletop how many people know this demo not very many great I take a wireframe cube on the tabletop and it's a big wireframe cube it's made of black matte black rods ok in order to survive this thick so here it is big cube and you can imagine that and if I ask you where the corners are you can say they're here here and so now from your point of view I take the top front right hand corner and the bottom back left hand corner okay there's the two corners of the cube and what I'm gonna do is I'm gonna rotate the cube so that this corner a cube right what could be simpler than a cube I rotate it so that this corner is vertically above this quad so here we go okay and now here's what I want you to do I want you to take the finger tip of your left hand and put it at a fixed position in space and that represents this corner off you go left hand okay my iron and take your right hand and put it and above above this corner I want you to point in space to where the other corners of the cube are obviously not the top one that's working about where are the other corners of the cube off you go you have to do it to be convinced okay so the modal response particularly if this is a bad audience because you guys are sort of used to spatial things you take a normal audience they go well here here hearing here they point out four corners and they're sort of completely unaware that they've lost some corners somewhere there's a queue bro think of it in this normal it's got eight corners and some of you I won't call you out will have pointed out for you did something very interesting which is you pointed out a shape with four corners like this which is basically two square based pyramids stuck base to base right that's the shape most people point them that's the modal shape not most people but a large number of people point out that shape that shape is a very interesting property which is that is the jewel of a cube it's got exactly the same symmetries if you replace corners by faces and faces by corners you see it's got eight faces whereas a cube only has six faces and these faces come with fours so what you know about a cube which is strange given its name is that things come in fours now there's a completely different view of a cube which I call the hexahedron it's such a different view that it needs a different name and a hexahedron consists of a tripod like this the right angles in the tripod and another tripod like this that's rotated 60 degrees okay so I'm showing you where all these fingers join is the top corner on the bottom corner and there's a benzene ring there which is the other corners okay they form a zigzag ring or six cones now you just don't see that if you're a crystallographer you see that but if you're not a crystallographer and maybe if you're burning my position you see that so you have to be careful wrench it but I'm most people don't see that at all and this is just the square diamond is kind of people he used to that and it doesn't really impress them but this thing um how that you just know nothing about well this is our and it's a cube and the answer is I forced you to use a coordinate system where the body diagonal was one of the axes by saying there's the bottom corner here's jockle I'm forcing you to do it relative to this coordinate system and relative to that coordinate system there's a nice simple shape called a hexahedron which is this guy but it's utterly different from a cube we take one arm about this is because then your necks we have neither division do not have these two completely different views of a cube you put in the queue they're doing stereo let's say they're doing 3d vision they can see where all the edges are but they represent they don't have internally too utterly different representations of the same thing and that tells us something it tells us they're not doing it the way people do it um for those of you like these kinds of puzzle I'll give you one more puzzle that you can do in your spare time take a tetrahedron that's a triangular base and three triangular faces okay and now imagine slicing it with a plane so you had a square cross-section most people think wait a minute it's all triangles it turns out now it's something more impressive is if I take a solitaire in my slices so I get a square cross-section and I give you the two pieces and I tell you these are two halves of a tetrahedron all you have to do is put them together to make a tetrahedron okay I did this on MIT professors and it turns out the number of minutes it takes you is roughly proportional to the number of years you had tenure it's slightly dodgy because someone who had tenure a long time just gave up and Carl Hewitt who invented things called actors after ten minutes he had a proof that it was impossible now this is a two piece jigsaw puzzle right well it's more all the faces of different shapes or you can solve it by just saying which two faces have the same shape they're the ones that have to go together because I've heard two things happens don't have the same shape together I'm gonna get a little there's going to be this where it doesn't fit right in Trinity it doesn't seem like a different difficult puzzle but what happens is your perceptual system sees these pieces and imposes a frame of reference on the pieces and then you just can't see the answer because it does the equivalent of the sort of hexahedron versus cube it sees the pieces in a way where it's using coordinate frames that don't line the coordinate frames use for that region and you can't do it it just blocks you we have to take this kind of evidence into account if we want to know how people see things it's convex can't do that convex can't fail that way okay my capsules can okay give me lots of motivation you probably want to know how the things work well we can skip all this stuff so okay here's an here's another argument people doing your own nets will keep telling you it's very important to take this high dimensional data and find the underlying manifold because once you've got the underlying manifold you've been interpolation you doing floss things they won't often say once you've got the underlying manifold you can extrapolate for huge distances because after all the manifold is kind of the underlying manifold it is kind of like this you can use a Taylor expansion to get some distance but you can't go huge distances but what if the data they were dealing with was well-known to have an underlying manifold that's completely linear what would you think of researchers this that I'm preaching that you should find this underlying manifold because that'll make life easy but there was a well-known linear manifold that they weren't finding they would that would be silly right um so what's that linear manifold envision well we use it all the time if you want to take two faces and you want to blur between them you could get yourself a great beginner on that but an easier way is to find the coordinates of identified feature points in the faces like the Centers of the eyes and the tip of the nose and the corners of the mouth get those coordinates figure out how to redraw the face from the coordinates so do that the other way and then if you want to interpolate between two faces you have a texture bit which we will ignore but if you want to uh plate the shape what you just do is interpolate coordinates and in fact you can now extrapolate called nuts to make cartoons and things because in the coordinate representation things are linear was this wonderful linear manifold where as I changed coordinates linearly I get more faces I don't get things around faces if I applied the same linear transformation to all the coordinates right if I applied one transformation to where the eyes and a different transformation to other nose is it messes up the face but if I take the coordinates of the eyes and the tip of the nose and the mouth and I apply the same linear transformation to all of them hey presto I get another perfect face it's just a very different size or position or share or whatever so there's this very prominent linear manifold that underlies shape graphics people know all about it because that's the representation they use graphics people use this representation this is this is graphics for about 30 years ago where I read the first three pages of a textbook of that are they do it - um so they take these coordinates and they transform them they take coordinates of a big thing from that they compute coordinates like corners for house and from that you can beat the corners of a window and the coordinates of a door and the corners or a roof they do a whole bunch of that until they get room triangles and then they do rendering which is something quite else but it's not until they get to there that it start dealing with the properties of light before that is in geometry and the way they deal with all this geometry of parts and holes is by using the coordinate representation so the idea of capsules is we're going to do the same thing we're going to try and get those properties or an object some properties will be things like the albedo or the velocity or stuff right now but we want a bunch of properties that represent the coordinates that is the in the right matrix representation we want a representation of the that captures position and scale and orientation and share as a bunch of separate numbers so a long time ago is really interested in how you deal with these 60 view points and I propose that you have neurons that have great big receptor fields to represent viewpoint and then you overlap a bunch of them and it turns out in 6d if you have big fields overlapping you can be much more efficient than if you have small thing else so after Tyler space with small fields you need a huge number if you make the feels bigger if you make the field 10 times as big you need 10 to the 5 less neurons it's the dimension RT minus 1 so it doesn't work in 1d that's called course coding but it's still not efficient enough and that was because I wanted to be able to represent several things at once but if you make the commitment that you anywhere extend one thing at a time you can get away with just having coordinates to do it so you can go back to the graphics representation and of course once you've got to that representation you can do massive extrapolation so if you think about how neural nets deal with viewpoint variation when you're let's Judah present is viewpoint variation in the training data and if you ask isn't you're not going to be able to see small amount of viewpoint variation in the training data and then deal with a massive viewpoint variation no it's not capsules if they ever worked would be able to do that I'll talk about dynamic routing at the end yeah I talked about that so here's another example just in case you're not sick of them yet if you show people this country and last what country it is many people say not now you've been primed to know the orientations crucial many people say well it's a bit like Australia but it's sort of the wrong way around so Sydney is sort of here but you know Perth is over there but as soon as you realize that the top might be up here you realize is Africa right now you can get away with this in the States calling Africa our country here you realise Africa's not actually a country but they I'm gonna cut the mental rotation so I mentioned this already what I want in these neural nets is not that when you change viewpoint the representation is invariant I want the representation to be equivariance so here's a basic convolutional neural net you take an image and you convolve it with a filter and you get those hits now you move the image over and you convolve it with the same filter you get those hits and what you'll notice is that is not the same hits it's the same pattern of hits so what's happened is that's an echo very representation the activation moved when the image moved and you can imagine the same for all the other degrees of freedom so that's intrinsically equi variant so there's nothing about convolution license invariant it was the max pooling that made things in marionettes and I wanted to see which two time types of equivariance that we're gonna try and get into capsules if you move something slightly within the receptor field of a capsule what will happen is let's suppose we were coding its position by two activities that represented its X location relative to the center of the receptive field on its Y location relative to the center of the receptive field if you move it slightly it'll stay represented by the same capsule so the logistic unit that says that this is present won't change but the numbers that say where it is will change I call that rate coding okay because the values of the numbers changing but which numbers were involved didn't change with your zone or didn't change so move it a lot this can move to another receptive field and now the fine position is going to be coded by where these relatives enter that other receptive field but the course position is going to be headed by which receptor field is named which capsule is that good so we've got two different ways of coding position we've got um place coding what a neuroscientist called place coding which is which capsule is active tells you what it is and there's rate coding which is how active things act as you want this and of course in the brain when you're dealing with only one thing like if you only got one right eyeball and you want to sort of specify where it is you can use rate coding for that I'm just buying what angle you should be at and when you want multiple things you have to use place coding so low down increasing the visual image there's lots of edges and so we've better use a whole bunch of capsules of a small each capsule by being active we'll say we've gotten aged around about here but it will have several dimensions and the relative values of those dimensions will tell you where it is within that receptive that's what in you're a scientist we call the phase of a kanji a pair of simple cells and Railly moves a lot it'll go to just different simple cells but if you have two simple cells on the capsule then the fine position will be carried by the face and the idea is what we can do with capsules is we're going to try and turn place coding so the ultimate place coding is pixels which just have places well they have an RGB rate and rate for our human be but that's all and they're very very fine places and what the visual system is going to be doing the infra critical pathway is turning place coding into rate coding when you get high up I don't want a whole bunch of different face capsules the reason I don't want that is I want to associate properties with the fact that this logistic unit that says as a face layer is active I don't want a whole bunch of different ones what I want is a sort of face capsule that covers a big area that I'm looking at in central vision and I want the different properties of the face like where and exactly where it is and what already I want that in rate coding so the in for example pathway is a device for turning place coding into a Kennedy I said that already if you if you can get coordinates then you should be able to do huge extrapolations so here's probably the most important picture this is the basic idea of the talk here's a capsule and let's just concentrate on geometry for now it's got some active units that represent the coordinates of the mouth these are the things someone would extract if they would want you to blend this but for one place that extract these things and in Scott because the capsule is doing vision it's got this logistic unit it's trying to decide whether there's actually amazing we've also got our over here we got a capsule that detects noses it's got the coordinates of those and something who says whether it exists then up there we've got a capsule gonna detect a face so you could just say you know a mouth is evidence for a patient or noses evidence for a face let's just add them up and save his face and if you're looking at a Picasso that might be good bet but notice even in a Picasso Picasso will jumble up the relationship between the nose in the mouth but of course within the mag's he still has to keep the relationship between the pieces because otherwise it wouldn't be a max so you can cheat to a certain extent but for you to identify any part you do it by the spatial relation so you okay sorry about that um so now the idea is if I find a nose and I know it's Holmes I'll call out the pose I can loss by that by a matrix and get a prediction for the pose of the face let's suppose faces a rigid things and if I find a mouse I can predict the pose of the face and the way I'm gonna see whether this nose and this mouse are correctly related to be segmented together is by getting them each to make a prediction for a face if they agree hey presto is a face thing ask what happens now if I change viewpoint so notice what I end up here is an ability to look at this vector coming in actually this is a little matrix really look at this little matrix coming in that's a prediction for the relation of the face to the retina and look at this matrix that comes in that's this guy times this guy and I need to take those two matrices and say are they roughly the same that's why I need this thing that looks of covariance covariance of activities rather than covariance of activity with the weight vector and the thing to notice about this is that if I now change viewpoint suppose I've learned this suppose I now take a radically new viewpoint which by sheer good luck still left the same pieces visible so we're not dealing with occlusion yet I get a very different pose for the notes I get a very different pose for the mouse they multiplied by the same matrix here because this matrix represents the relationship between a mouth and a face and that is something that is genuinely viewpoint invariant how your mouth is related to your face doesn't Mendel viewpoint that's just an intrinsic property of the face of course in order to be able to represent this intrinsic property by a matrix like this you have to have a call at system for missing a coordinate system for that and if you choose different coordinate systems you'll get very different matrices here that is you'll get well if different G compositions and that's what all this hexahedral versus cube is about but if you take this matrix it'll vary as you vary a point you multiply by this matrix is fixed and really does capture the intrinsic spatial structure in completely independent of viewpoint so that's much better than what pooling does which it says over a little region will try and make it independent this says at least for this capsule which I agree only works over some small region will get perfect independence here of the knowledge from the viewpoint and you make a prediction this fair is this fair is but the fact that are the same doesn't change now you should immediately start worrying about a scale on how similar these are depends a bit on the scale and so on but so you're gonna have to do some normalization too but you get the basic point this system if you can extract these is a viewpoint in varying way of wrecking recognizing the face okay so now the question is simply can you make that work can you design these this thing that detects agreement so that you can make a visual system like this one notice this is much more like good old-fashioned geometric vision so I'm Dan vision was back in the sort of 80s yeah I talked to David Lowe in about 1981 and this is David Louis view of vision from about 1981 it's sort of geometric vision this is a Hough transform essentially a high dimensional half transform it's a nonparametric of transform because in half transforms what they would do is they would take this space up here and grid it and look for hits in the same cell of the grid you can't do that in the six dimensional space it's not efficient enough what we need to do is just take a bunch of votes and have some ways seeing if the votes agree you can't do it by gridding but but it's still the same idea what happened in vision is um David Lowe design sift features and sift features were design so you can get these guys but sift features also had a whole bunch of descriptors of the sift future and then machine learning came along and those dumb people and machine learning threw away this and just kept the 128 descriptors of what's going on in the image patch and said look we can do machine learning on this and then vision turned into doing machine learning our sift features or their derivatives which was just crazy and eventually people sort of started thinking about you know how do we deal with geometry properly so you could think of me as saying I want to I want to take vision back about 30 years when you're old the only way you can stay current is to drag the field back to 1986 if you can drag the field back to 1986 you're in good shape because that's when you can still think um so I'm gonna try and do this for neuron that's one more time so what we really ought to be doing is saying what's the real lesson on neural nets is the lesson of neural nets at relu are the way to do everything well maybe it is I mean referees with other way to do everything not quite not recurrently so sorry but that's a very interesting lesson but they're not qualitatively better than 10 ancient things they're just better is the lesson that a big flat layer of neurons and lots of layers is the way to do everything well it's true it works depressingly well but it's not very satisfying if you need a scientist you believe that the way to deal with structure is with modularity so I have some sympathy for crazy people like Steve think who think everything emerged oh um and this big flat stuff is very unsatisfying this is much more modular right we have these little modules is the lesson of neural nets that stochastic gradient descent will really fit models well if you have a lot of data and you train them discreetly I think that's the real lesson on your nets I don't think I mean these architectures that we apply it in sure the architectures have been selected so that this stochastic gradient descent on mini-vac is on huge data sets with a discriminating objective function works well but once you've seen that you ought to believe that if I take any old combination of functions maybe they need to be smooth maybe they don't even need to be smooth there shouldn't be two on smooth they shouldn't be gratuitously discontinuous take any old combination of nonlinearities stack it up and maybe vector nonlinearities why not stack it up go from pixels to categories and train it with stochastic gradient descent and if Ilya were here he would say how could it fail it's gonna work right and some will work better than others and what you believe that rebels will work best so calm Nets definitely work better than non comp nets I mean is overwhelming the evidence now that if you're going to put in any structure put in replication a feature detectors over position that's just a no-brainer because you know in advance knowledge which is you want your visual system to be able to deal you want the knowledge in the network to be the same every way so on the knowledge to be translation invariant it's not that you on the representation to be translation invariant when you translate the representation can change the rate coding of it and the place getting rid of it but the knowledge should be the same and that's what you put in when you make a convolution like you're saying the knowledge should be the same over and if I learn something here I want that knowledge transfer to and I do that with the weight sharing but once you've seen that there's a huge wing from putting in a basic observation about the nature of the task which is the knowledge should be the same everywhere why not get other huge wins by putting in a slightly more detailed observation about the nature of vision which is the thing that causes real problems in vision is viewpoint foundation there's the two biggest things and viewpoint causes real problems because it takes the information about a part of an object and moves it to different pixels if you think about that that's crazy for machine you have an input vector and what you're doing is taking some information and moving it to different components of the input vector that's like if a medical record and one Hospital codes things as blood pressure age financial status hospital that's more in tune with American medicine case things as financial status age blood pressure and you try and do learning without unscrambling that that would be insane machine it would actually work in the end maybe but you'd be making life very hard for yourself that's what viewpoints do images and you have to unscramble sensibly this can deal with viewpoint in a more principled way convolutional gets deal with the translational aspect of it and people are tracks tending that to deal with rotation scale but then you end up grading a high dimensional space in this openness this convolutional net see simply using gridding the pasol here is we'll deal with that translational aspect which is the biggest aspect by gridding again we'll have capsules and if you move something over a lot it will be a different capsule this will be another nose capsule that captures noses in a very different position but within a certain region we'll deal with it by varying the activities of things that code the position relative to the center of the capsule so we're gonna this is going to be convolutional it's going to have a 2d grid of these capsules to deal with the big translations but then the other things it's going to deal with in a totally different way it's going to deal with them by using these matrices are multiplying by matrices their independent view point and having devices that can detect agreement of two incoming things I've almost run out of time so I will show you something like this sort of working okay just to show you that this was done in 2014 at Google what we're going to do is we're going to have endless digits because we run - it won't work and we're going to have ten top-level capsules that are meant to cover the whole image and lots of other capsules which are all going to be trained by stochastic gradient percent using a discriminative cost function and I'm showing you the votes that the top-level capsules get from all these other capsules and there's a lot of votes and votes the the strength of a vote is the size of a circle so the little dots are tiny votes the circles are big strong votes the colors tell you where these votes something right where that which other capsules which type of capture all these votes are coming from and you show it that and you look at the votes you get for the top ten capsules the 10-digit capsules I'm just showing you a few of them there's other ones over here that don't fit on the screen and what you see is that the big votes cluster tightly for the five they cluster a bit for the one but there's less of the votes and if you look at the variance here the log of the variance this is tight because that's a negative number if you look at the one it's not such a tight variants that socially variance the reason you can get big posterior weights on these votes when they're more than a standard deviation away from the mean is because these are votes in on 12 dimensional space and I'm just showing you the first two dimensions so this guy is actually quite close to the 12 dimensional mean he's just a bit farther in the first two dimensions but I can't show you to him didn't welding so this is an example of something that on a system like this the basically one got down to about 70 hours on a list a variant on it got down to like 35 errors on a list no 37 hours mm list so what really well Sara sabor has a more recent version of this that got down to 25 errors in EM list and 25 errors ironist is a bad as good as one system does unless it's done by Jimmy bar in which case you might get 22 errors but you have to add three because it was done by Jim McGrath um okay so he can actually be made to work I didn't tell you all the details of this recently serviceable a version of this she's got it to get the record on basic knob so the Nord database has objects were all the same color and taken from many viewpoints so it's kind of ideal for this approach and the people don't use the basic version much anymore because it's fairly straightforward but the record on that was 5% and she's got 94.8% and we still don't understand exactly what's going on inside this capsule system so we're going to be able to make it work much better so it's one o'clock and under yeah I have meeting 1:30 so I need to leave at 5:00 do you need to know the viewpoint no you do actually know that information but that's not used in training um but the other these parameters you're seeing first two parameters here but the other ones these are the activation parameters of this of the five capsule they do code viewpoint information so here's something crazy people do a present and when they're training on shape recognition they take the images and they transform them so they get a bigger dataset and they train their net to get the right class of the transformed image so they apply transformation that they believe will not change what shape it is like a shift or a rotation but they don't tell the neuron that what the transformation was and they can't really tell in your net what the transformation was because they don't have a canonical basic frame of reference but they could easily give them your own data parents and say what the transformation difference was and if you gave it pairs of images and told it the transformation difference you'll be giving it much more information I'm much more helpful information so these kinds of nets the right thing to do is to give them if you're going to give them transform data if you're given transform data and you tell them the Delta you're giving much more helpful information yes I think that's just a property that's just the nature of vision now it's not quite true because we fixate on interesting places and so the statistics of fixation centered images is not uniform but let's suppose it is because the first order it is you want the same knowledge everywhere and you low level knowledge by edges is the same everywhere high level knowledge about faces is the same everyone so I want the knowledge at every at every layer of this net to be the same but what I want is to get that knowledge into a capsule that works over a big range at a high level and deals with the variation using rate coding by changing the relative activities in neurons rather than by using place coding by changing which you're under active because once you've got it into rate coding it's much easy to learn properties of the object I don't really get the point of the question it's obvious that if you have a bigger image detail a song but I think that just grows so linearly with the number of pixels which quadratically with it so I think the number of caps you need the number of pixels but if you wanted to do more things to have more defects of things it knows about you need more capsules so I think it's a good bet that if you want a capsule system that works really well you're going to need about a billion neurons because that's what we have last question if this is the right thing to do when you train this net full of rellis that has less it's gonna do it but if there's a huge room for doing this kind of thing and relatives of course are kind of linear at least in their working range so the thing that keeps me up at night is it's all very well to have this kind of theory but maybe REM can just fake it maybe runners are doing this but because you can put everything and it still works the same we just can't see that they're doing that but I think that's a really good question you
Info
Channel: Fields Institute
Views: 17,809
Rating: undefined out of 5
Keywords: Hinton, Geoffrey, machine learning, deep, learning, 2017, fields institute, fields, toronto, canada, neural, network, google brain, google, brain, professor, convolutional, nets
Id: Jv1VDdI4vy4
Channel Id: undefined
Length: 62min 29sec (3749 seconds)
Published: Tue Sep 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.