Geoffrey Hinton – Capsule Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you John for that nice introduction before I start I'd like to say something about Ian Howard I used to love to visit York and talk to Ian Howard he was a really great scientist he always struck me as the closest I'll ever get to meeting a true Victorian scientist and he loved making things so your horrid minds move Ian okay today I'm going to talk about some work that's already published on capsule networks the current way to do object recognition is to use convolutional neural nets they've got multiple layers of feature detectors which you learn discriminative lis that's good and the features are local and each type of feature ejector is replicated across space so they can deal with objects when they move or when they're in have a different position that's good as you go up the hierarchy of features the feature detectors respond to things in a larger region of space and that's good the feature types are interleaved with subsampling layers which pull the information from a bunch of feature detectors and the aim of that is to get invariance and to throw away positional information and that's bad um I'll try and argue with you either I'll tell you why that's bad so current neural nets the do object recognition commnets convolutional nets cannot generalize well to novel orientations or scales or shears and they're not dealing in a principled way with the effect that a change of viewpoint has on an image so the biggest source of variation in images is from changing viewpoints there's other source of variation like lighting and stuff like that but changing viewpoints is particularly nasty source of variation for machine learning and I'll try and give you insight into why it's so nasty the same part of an object shows up on different pixels when you change your viewpoint okay and for machine learning that's as if you had two hospitals that coded their data differently so in hospital one you code the data as age weight blood type financial status and in hospital two you code the data as weight blood type age financial status you have to get financial status in um now if you knew that if you knew that the same information was going to show on different input dimensions you'd be crazy if you did machine learning to just go about doing learning by say well we're going to ignore that we just need to get lots and lots of hospitals and we're just going to hope it all sort of averages out somehow you'd obviously want to unscramble it so you get the same dimensions mean the same things but that's what viewpoint does two images it puts the same information on different pixels and the way commnets trying to deal with it is by just trying to get lots and lots of images from different viewpoints and gently unscrambling it by pooling the best they've got for unscrambling is pooling which says if it occurred nearby it'll eventually lead to the activation in the same feature detector because we're pooling that doesn't seem like a very principled way of dealing with you point and this whole talk is going to be about how we might deal better with you point the one good thing about is it does deal with translation so in a complet because you replicate features across space that if I translate the image there will be a feature detector of same kind in a different position that will respond to whatever feature I saw before that's not actually equivariance sorry that's not actually in various that's equivariance in other words when I translate the image I actually translate which feature detector response so the translation goes through the system one problem another problem with comrades is they don't produce the powers tree for an image and it seems sort of psychologically realistic to think when I see a scene I do something like parsing it I know which parts of which and which parts belong to which wholes that's one thing they don't do commnets another thing they don't do is they don't assign intrinsic frames of reference to objects so many of you are psychologists was like a physicist so you will know these examples but if you look at the yeah if you look at the object on the left that yes if I do if I just tell you it's a country you initially see this kind of reflected Australia as soon as I tell you that it's at a diagonal orientation you can see that if Africa right is very familiar once you get the right orientation but if you don't get the right orientation it's not drill familiar if you look at the object on the right you have two completely different ways of seeing that you can see it as an upright diamond doors a tilted Square and those are completely different internal representations now conflict doesn't have anything like that a commoner you put the data in it goes through you recognize the object but it's not like the same the very same pixels can lead to two completely different interpretations in a conflict because the common it doesn't do anything like explicitly imposing a frame of reference and I can argue that explicitly imposing frames of reference is what you need to do to deal with viewpoint properly so eyeing them try and make this problem even more evident by showing you a little puzzle I found it in a Christmas cracker many years ago and what we're going to do is we're going to take a solid tetrahedron and we're gonna cut it into two with the plane okay so this is going to be jigsaw puzzle but a 3d jigsaw puzzle and it doesn't sound that hard right if I tell you I'm going to take us on a tetrahedron and then cut it with a plane and then what I'm going to do is I'm going to do an experiment on MIT professors I'm gonna give our mighty professors these two pieces of a tetrahedron and I'm gonna say these make a tetrahedron on a check they know what a tetrahedron is and I can say can you put these two solid pieces together to make your tetrahedron and they can't it turns out in fact it takes them a long time and the number of minutes it takes them is about equal to the number of years they've been at MIT so the young ones get it in a few minutes and the old ones just don't get it some of them give up on one RIT professor called Carl Hewitt who I tried this on a long time ago after 10 minutes he proved that it was impossible okay now you may wonder why this is so hard so I have our demonstration kit I think Ian would have appreciated this because it involved making something now part of the point of this demonstration is these pieces are large and clumsy so it's hard to manipulate them and that makes it much harder so here we are these two pieces misunderstood these two pieces make a tetrahedron right they're identical okay and here's what people do they say well you know the reason I've got to put two faces the same together so let's see I'm gonna put those together no that's not a tetrahedron let's see well that's clearly not a tetrahedron and last military Heatran and then people do this and I watch them do it they say vast military and what about this that's not a tetrahedron um I don't know how do you make a set region what is a tetrahedron um and people will spend a long long time now just occasionally people get it almost immediately maybe because they'd seen it before or maybe because they went to a u.s. public school a long time ago where they got milk delivered in tetrahedron tetrahedral cartons and they stack them in a particular way so let me show you what how you make a tetrahedron in case you some of you haven't figured it out yet you do it like that okay and the question is why can presumably intelligent people like MIT professors not see this it's not that hard right if you think about it there's only three different fate types of face here there's the little triangle the big trapezoid and the rectangle well part of this since you're psychologists is it's a real-life muller-lyer illusion so actually people don't perceive it as a rectangle so if you show people this you see this guy has an arrow coming out and sort of coming out and this guy has an arrow going in so actually people perceive that does not quite square and that helps to explain why it's so hard it's an illusion involved but nevertheless you to thought they would be able to get this in less than 10 minutes now I'm going to try and explain apart from the illusion aspect why it's so hard the reason it's so hard is because when you see this object you impose an intrinsic frame of reference on it that is it has three axes one of which goes this way and one of which goes this way and one of which goes that way it's a rectangular coordinate frame you impose on this and you perceive it relative to that rectangular coordinate frame and you have two of them and they're the same so you want to align their coordinate frames which is what you do with reflection and that doesn't work if you look at how this fits into a tetrahedron then the rectangular coordinate frame of this piece is nothing like the coordinate frame you impose the tech region for the whole tetrahedron there's a vertical that goes down through the vertex and then you have to break symmetry for this triangle somehow to get the other axis but it just doesn't align with the coordinate frame of a tetrahedron now there's a completely different psychological representation of a tetrahedron that is inside your brain you can represent detect heat in a different way you can represent it as there's a horizontal edge at the top there's a horizontal edge at right angles at the bottom and you just put in all the lines that join those two edges and they make it edge region right it's the sum of all the lines joining those two edges if you if you perceive a tetrahedron like that it's a completely different psychological object and it's very easy if you perceive it like that to see that you can cut it with a plane that makes a square cross section because near the top is direct / one-way and near the bottom is a rectangle the other way so the same mathematical theorem says in the middle it better be a square so the point of this long long demo is that the way you see things depends on rectangular frames of reference you impose on them and convinence don't explain that at all because combats don't do that okay so what I'm gonna do is take standard feed-forward neural nets and change them in quite a few ways in order to try and get something that deals in a principled way with viewpoint if you think about the current neural network technology we have it's obviously not exactly like the brain it's inspired by the brain but the way the neurons work isn't just like the brain and it was you're just kind of made up people who came to the field lately somehow think there's something magic about neural nets we just made it up okay for thirty years we used logistic units and then after 30 years we decided hey why not use rectified linear units and rectified linear units so much easier to optimize neither of them are exactly what neurons do but the important thing to remember if you're new to neural nets is it's all just made up right it's just engineering and people try things that worked what we've really discovered if you're off work more people have empirically discovered who've been doing your own nets the thing they've really discovered that's a big surprise even to some of them is that if you take a system with a whole lot of parameters and you optimize it by stochastic gradient descent that is you give it a small subset of all the training cases and you say on that small subset figure out how to change the parameters to get better answers and then take a small step in that direction or a related direction and then do it again for another small set of examples that works it sort of doesn't have any right to work but it really works and that's actually asymptotically you can show it's as efficient as you could be so even this dumb optimization algorithm is asymptotically efficient so you're never going to beat it by a whole lot not hugely not kind of polynomial e and so that's what we've really discovered and now the question is what kinds of systems are we gonna apply this stochastic gradient descent to and currently the systems we use have a lot of problems so here's one thing that's wrong with standard neurons they can't tell if two inputs are the same okay our standard neuron is a linear filter for a bad non-linearity and you can't make a neuron that if I give it a 1:1 says yes and if I give it a zero zero says yes they give it a one zero or zero one says no that's the XOR problem and the standard solution to it is to say well let's put in the hidden layer that'll separate out the cases and then recombine things and that works and that was that's been accepted as the way you deal with this but it's not the only way to deal with it we could deal with it by having a different kind of neuron okay something that can recognize where the two things are the same and notice that's I've got a very different flavor from the neurons we use a present because the neuron that can recognize where the two things are the same can do with covariance structure directly the neurons the juster linear filters are always multiplying an activity vector in the layer below by a weight vector to see if they should be active that's very different from looking at two activity vectors and seeing if they agree and this talk is gonna be about a system that looks at activity vectors to see if they agree and that as they're sort of primitive operation can tell whether two things are the same that's one thing wrong with our neurons they also have if you're engineer they have surprisingly little structure you got neurons you've got you've got weights you've got neurons you've got layers and you've got whole networks and that's about it you got some architecture of how the layers are connected I'm gonna suggest we should actually have something between a layer and a neuron which is a small group for neurons which I'm going to call a capsule and the idea of capsules is you're going to do a bunch of internal computation and the capsule is going to have some neurons in it and the activities of the neurons in a capsule are going to represent different dimensions of the same thing so if you want to represent - multi-dimensional object you'll have a little group of neurons from want to be able to represent ten degrees of freedom I'll have ten neurons in there that represent the values on the different dimensions and the way we know that those ten values go together as properties are the same thing is that they're in the same capsule I can't just have a big layer of values because then I don't know which values values of what I want to be able to represent several things at once so I need to package the things just like you do in a computer so in the visual pathway what a capsule is going to do is a capsule will learn to represent objects object parts of a particular kind and it'll have associated with it a single logistic unit that says whether this object object part is present so it needs to decide whether the thing it represents is there or not and almost all the time for almost all the things they're not there that's how vision is and then in addition to that it's gonna have two other things I'm only gonna talk about one of them it'll have something that represents the viewpoint that you have on this part and as you change your viewpoint those parameters will change and then computer graphics a standard way to represent viewpoints will be with our four by four matrix in 3d so it'll have a little matrix associated with it which says what what's the relation what's the viewpoint which amounts to saying what's the relation between the camera and the intrinsic frame of reference you've imposed on this part notice that you can't say what the viewpoint is you can't put numbers into this matrix unless you have an intrinsic frame of reference and if you put a different intrinsic frame of reference on the same thing like turning a diamond into a square by you then you'll get a different representation as relation between that thing in the camera the position will be the same but the orientation is that case will be different so the fact that we always impose frames of reference suggests that we're using not reusing some way of representing the relationship between the camera and the object part using the fact that we've got this frame of reference embedded in the object okay now once you go for this idea that you're going to have a little group of neurons the different dimensions in the little group of neurons going to be representing different properties of some object or object part then you have the problem prop problem what happens if two instances of that kind of thing occur in the same part of the visual field so one way to solve that is to make the capsules only respond to a very small region of the visual field and so low down when they're representing simple things like edges they're gonna have small receptor fields but as you get higher up and you have a capsule responding to a face you're going to get a serious problem if you have two faces in the same part of the receptive field so if you take two transparent spaces and superimpose them you'll have great difficulty seeing what's going on and we're gonna make a very strong representational assumption which is that you only have for each part of the visual field you only have one capture of each type if I'm if it's representing low-level features got a small area if it's representing a high level features got a big area covers but you're only like one of them that's the price of being able to use activities of different neurons to represent where you are on different coordinates you're binding it together by saying these are different properties of the thing that's here and the better only be one thing they're a consequence of that is that if you ever do get things are the same kind very close together relative to the size of your fields you your systems gonna get very confused and there's a lot of evidence called crowding that in perception you do get very confused when you have things of the same kind close together and you can't put your folder on one of them so if you're slightly peripheral vision you put two things of the same kind close together you get all mixed up they're very hard to see okay so there's some psychological evidence of this a crazy assumption the only like one thing of each kind at each place is that may actually be true okay now how are my capsules gonna work how are we going to recognize objects by recognizing their parts well the idea is let's ignore the front end of the system to begin with let's hope you've already got two low double capsules and like I say how you get two high-level capsules now so the idea is that to detect whether an object's there you're going to get votes from smaller pieces saying what the pose of the object should be when I say pose I mean viewpoint what viewpoint you have on this thing and these votes are going to come from smaller things you have viewpoint self and these votes there'll be smaller things that aren't part of the object that may be making votes and there'll be kind of irrelevant votes that are voting for crazy things and then they'll be the actual parts of the object that make votes the degree and when you see this agreement agreement is crucial here so agreement between activities you get these little voting vectors then you say there's something really there so what you're doing is you're looking for a high dimensional coincidence so if you think of all the votes as those blue dots and the red dots as votes the degree you would then say forget the outliers we've got a whole bunch of votes that agree on something and that's significant so if you think about it if you've got a cluttered or noisy environment high-dimensional agreement is a very good thing to use to filter stuff out like if you're listening to radio traffic and you see in the radio traffic New York and that occurs a few times and you see in the radio traffic September and that occurs a few times and you see the ninth and that occurs a few times that's should maybe alert you but if you see New York September the 9th and that occurs in three different messages you should think you know we got a high-dimensional coincidence now we got a coincidence of you know the month and the day in the place and we've got several thing if several votes for that and because it's a high dimensional space that's not likely to happen by charms and so now we've got very strong evidence much more than just the evidence for September or the night or New York okay so the idea of capsules is are going to work by doing high dimensional constants filtering which involves comparing predictions and this is the sort of this is the central slide of the talk if you understand this slide you've understood the main point of the talk we're going to have one layer of capsules where we've already decided whether they're there or not and if they are there what they pose is that is what's the relationship between the camera and the intrinsic frame of reference imposed on this plot and let's suppose we have two parts on mouth inner nose and we know their relationship to the camera and that's where the face is just a rigid object then if you know the relationship of the mouth to the camera and you know the relationship of the mouth to the face you can predict the relationship with the face to the camera and that's just a matrix multiply you have a bunch of numbers that represent the pose of the mouth you multiply them by weight matrix so these are activities and as you change the camera position or move the mouth around all those activities change there's nothing invariant about them they're highly variant but the relationship between the mouth and the face is something that's completely invariant that doesn't depend on viewpoint okay Oh Oh the microphone have gone off I can't hear it but sorry somebody should have said thank you sorry oh okay it's live don't know okay so you'll believe what I say more okay so the relationship between the mouth in the face does not change as you change your viewpoint so if you want to get in variant knowledge about objects into a system into a neural net the place to get the invariant knowledge is into the relationship between parts and holes that's what's invariant at least for rigid objects what's variant is where the pieces and the whole object are and what orientation they're in you don't want to try and make that be invariant of course the final name of the object well that little P the probability that this part is there that you want to be invariant across some really reasonable range so that is something that's lunge be invariant but you're also getting all this stuff that's equi variant as you change the camera or change the object all this stuff varies but a little various together so what you'd like is that the pose of the mask change the relationship between the mouth on the face predicts the pose of the face the same with the nose and if those two predictions agree and there's all sorts of issues about how well do they need to agree in song they agree you believe there's a face there so really what you'd like to see is several predictions and you'd like to see them agreeing tightly and you're gonna do a trade off if there's a few little degree really well that's good if there's a lot that agree not so well that's also good and that's how we're going to try and recognize holes from their parts and of course we're not going to hand why this stuff we're going to hope the system can just learn all of it by doing stochastic gradient descent but we're going to wire up a system in which it should be possible to learn that so the point here is that as viewpoint changes all of these poses change but the relationship between the pose of the face and the pose of the mask doesn't change that's fixed that's where the invariant knowledge is going to be and that's in the weights so what motivated all this is how do you recognize things from your viewpoints and one approach is to have a longer training data that's the sledgehammer approach and a better approach is to say look if you take images of the same rigid shape and you change viewpoint the things being imaged jumped to different pixels and that's a big mess for machine learning if you're gonna do machine learning on the pixels and that jump if you look at it in pixel space is a big mess but if I take an object and I Eames you from one viewpoint and then I change my viewpoint image it from a different viewpoint I can't just average those two images to get an image of the object from the intermediate viewpoint it's nonlinear if I average those two images I get sort of two ghost images that are separate what we'd like is a space in which things are linear and you get that linear space we're transforming to the space of identified features with coordinates so how long to blend two people's face two people's faces like for example Michael Cohen and John Dean I might want to blend those but I wouldn't try blending the images I try recognizing the relevant pieces and then blending those coordinates and then that'll work okay so here's a piece of logic graphics programs have no problem dealing with viewpoint if a graphics guy is showing you an image and you say shirt to me from a different viewpoint he won't say oh well I didn't train from that viewpoint you can't really expect me to do that you've just shown to you from a different viewpoint and we have no problem recognizing things from your viewpoints this pretty much and therefore by infallible logic we have graphics programs in our head oh okay that's the best logic only used today so the point is if you ask what does viewpoint do to images in the pixel domain it really messes them up but if you can get to coordinates well you represent the relation between the camera and a part of an object in that domain the you point does very simple things to the coordinate representation everything is linear I'm going to assume here I'm gonna perspectives 2 complicated4me saying this you're more to graphic reaction but assuming that everything's nice and linear and you can make perspective linear if you do some work um everything's ice in linear and we're crazy not to be using that linear structure it's simple structure the completely deals with viewpoint it doesn't do with other things like deformations non-rigid informations but viewpoint which is the main source of variation is completely dealt with in this linear way by going to those coordinates so that's what we ought to be doing so now let's talk about once we've decided to go to coordinates how're we going to do segmentation because if you find a bunch of parts you might find a part that's for example a circle and that circle could be an eye the left eye of a face or the right eye of a face or it could be the front wheel of a car or the back wheel of a car you don't know for sure what it is and so you're gonna have to make a bunch of votes about what it might be and then look for coincidence to decide on what it is and we're going to assume that for every part you find it belongs to at most one hole it might be an orphan but if it belongs to a holism most one of those so what a part has to do is it has to look to see it has to vote for various holes and then it has to look to see whether its vote appeared inside a cluster or other roads or whether his vote was an outlier and initially it won't really know you've got this circle it doesn't know whether it's part of a face or part of a car so it makes us it scatters lots of votes several possibilities then it needs to look to see whether those votes agreed with votes coming from other places if they agreed it says I'm okay I need to focus my votes there so we're going to have a have to have an iterative process that does segmentation by initially you have these rather vague votes or these votes the vote for a proposed for each thing but aren't very confident and each part will spread its votes over many things and then you'll get some top down which will say please from these high-level captions it'll say please send more of your vote to me because you agree with my cluster or please send less of your vote to me because you're an outlier and we can do a few iterations of that we actually do three iterations and the system will quickly settle down to sending the vote from a part to a capsule that's receiving other votes that agree with it and it settles down surprisingly fast um so here's the picture for each part we have what we call a routing softmax that is think of it in terms of with what probability should I belong to each of these higher level capsules and the sum of those probability should be one assuming you're not an orphan um so initially the probabilities will all be small and for the particular part we're looking at in this slide it might send a vote to capsule K there agrees with the cluster of other votes that come to capture okay and I hoped capsule J there's an outlier and the reason you can be an outlier in foreign capsule and a in life or the other capsule is because when you send your vote you multiply your polls by the part-whole relationship between the part and the whole and that's a different part whole relationship for these two capsules so the two higher level capsules have different views of this part because they see the part filtered through different part well relationships so a bunch of parts that are seen as all being agreeing by capsule K might be seen as disagreeing by capsule J because they're filtered through different partner relationships okay and that's why this converges much faster than under clustering algorithms you can think of the high level capsules there's like the cluster means there's like the clusters you can think of the lower level capsules you've discovered already as like the data and you're trying to explain this data in terms of these clusters and you're actually going to run the clustering algorithm during perception which sounds crazy but it was settled very fast because each of the higher level clusters has a different view of the data because it sees it filtered through different part relationships and that's why it converges in three iterations instead of like 50 so the objective function for doing the routing is to try to get the active captures in one layer to explain the capsules in the layer below and what we're doing is actually a version of the EMR rhythm for fitting a mixture of gaussians to data but it converges much faster because each component of the mixture has a different view of the data and so that breaks symmetry very strongly I'm not going to go into the details of this all the details in a published paper that was in ICL are in 2018 and I'll give you a reference down at the end and this is mainly for a machine-learning audience but the important thing is that there's an inner loop that happens during perception when you're finding these clusters and send the information back saying please send me more or please send me less and it'll settle down to nice clusters and that's what's doing the segmentation of finding the holes to combine these parts into is solve the assignment problem of assigning holes to parts that's not the learning algorithm that's just an inner loop thing that's deciding which parts get combined into which holes then in the absolute there's a learning algorithm that's trying to make the clusters tighter or trying to make them less tight it's important that sometimes the lonely arteries are trying to make them less Teijin is going to try make the less type because if you find that thing then it's part of some larger thing which you don't want to see in that image because you're told it's not there and so we're gonna do discriminative learning for all of this so the high-level decision about what the object is is going to send back information saying oh you're claiming is a cat but actually it's a dog so please don't see this cat say here and you'll get information coming back that says make those votes for the cat's ear be less well clustered and make the votes for the dogs it'd be better clustered okay similarly the relationship between the parts and the wholes we're doing discriminative learning for that if we did unsupervised learning in this version of capsules it wouldn't work because what would happen is he would say boy I know how to get really good clusters I can just make everything vote for zero and if everything votes for zero so we use transformation matrices that just sort of throw it away and vote for zero then you get really tight clusters but you wouldn't really explain anything okay it'll cause claps and that's the problem you get with unsupervised learning as soon as you try and minimize square distances in a domain that you completely control if you minimize square distances when you reconstruct the input that's fine because you don't control the data there's real data and then you're trying to make predictions to fit with it but if you've got two different things making predictions you want to fit that's really easy if you train it just to fit you've got a training discriminative lis so it doesn't collapse or you've got to have a better objective function okay so what we do is we do these three iterations to do the assignment after we've established the low-level capsules then after we've done the three iterations we've now established the next layer of capsules then that's all finished and then we go to the layer above that and so on and we're doing it greedily in the sense we establish the parts once we've establish the parts we don't change the poses at the past field I've got I'm not going to use top-down information to revise our opinion about where the parts are and so we're doing it sort of greedily a layer at a time but for each new layer there's three iterations of routing and then to do the gradient descent we just unroll that routing which you've been doing something like tensor flow and just back propagate through everything to change everything so as to get the right answer I just said that um so here's a little proof-of-concept we thought that once we got this working we could then scale it up to something bigger and I'll come to why we haven't done that yet later so it's a it's a task created by yellow coma which is you take little plastic toys that you buy in toy store and for each object class you have five examples that are training ones so five little plastic cars they're the training cars you have five different ones that are going to be the test cars they're different cars different physical objects you paint everything green and you now put on a turntable and you have lots of lights and you get to see it from many viewpoints both in azimuth I think that's azimuth and in elevation with manger formation conditions and so you make yourself a big day set of these five instances of each of the five kind of objects view with many viewpoints but then what you have to do at test time is recognize a new instance of that kind of object so you might be trained on elephants and crocodiles you might have to recognize a hippopotamus so here's some examples of what the images look like when they've been down sampled a bit you get images from many different viewpoints those cars and trucks and animals and airplanes and people this was done in the United States so the concept of a person includes that they're holding a weapon every single interest every single instance of a person is holding a weapon that's how you tell different people or animals okay so now I talked about how you deal with geometric relations but actually if you think about computer graphics you take an object from the pose of the whole of you figure out the poses of the parts of the object you go down a hierarchy like this until you get to little triangles and then you have to render it when you get to little triangles that are the surface of an object describe the surface an object up till then is geometry and there's nothing to do with light or reflectance but then luck comes into it you're not really interested in light it's just a way of seeing things you're interested in what's there unless you're an artist and then you render it and we need to invert the rendering process and that's very different from inverting the geometric aspects of it which is what I've been talking about so we need two bottom level the D renders and the way we're going to do that is we're going to take an image we can apply a bunch of filters we're going to have a stack and each point in the image we're going to have a stack of 128 different 5x5 filters then at each point in the image we're going to take that big vector of activities and we're going to convert that into that vector visit that centered at that point also the nearby activities of this stack of 128 filters and then centered at that point we're going to have 32 different types of primary capsule and for each of these primary capsules we're going to from these filter outposts we're going to decide whether it's there or not or with what probability is there and what its pose is so little 4x4 matrix that describes its pose so this is a very primitive way of doing D rendering we're working on much better ways of doing it by actually taking a renderer I'm back propagation through it to train a D renderer but for now we're just gonna use that it works well enough and then once we've got the primary capsules we then have some layers of captions but the whole thing is going to be convolutional that is each capsule will have pose parameters that say the precise position and orientation and scale and so on but the capsule we replicate it across space so that you can see two eyes at once two hours on top of each other you'd be confused by the two eyes that are separate you see both of those at once I'm said that we make this whole thing convolutional so that it deals with translation just by replication that's eat with course translation the fine-scale translation within a receptive field is dealt with in the post matrix so it's got a completely different way of dealing with fine changes in position and dealing with course changes it's a bit like if you take the mobile cell phone you move around in the same cell tower deals with you that's like your capsule then after a while you get handed off to another cell tower that's when you get represented by a different caption okay so after we got the primary capsules we have some more laser capsules and then at the top we have capsules that represent individual classes and so there'll be five of those for animal person truck plane and whatever the other thing was car yeah um this is a well studied benchmark and so you can look at how well various systems do so a standard CNN without a lot of work girl into it does 5.2 percent error on the test data the best CNN in the literature that we could find does two point five six percent that was done by Sarah Sun working with Schmidt Hooper and he did lots and lots of pre-processing to get the best performance they possibly covered if you take a previous version of capsules that we published in nips I got three point six percent and then with these capsules that have these little posed matrices and use this clustering algorithm we get one point 8 percent errors now this is ideal for what we do so this data said is kind of ideal for testing this idea and we better be able to beat the opposition on this ideal data if you take your clutch a background and add that that confuses our system we still do slightly better than the rest CNN but not a lot better if you look at extrapolation to new viewpoints then what you can do is you can train on a limited ranger azimuth and then test on azimuth outside that range or you can train on a limited range of elevations and then test on elevations outside that range now because capsules do better than CNN's they're gonna win anyway so what we do is we take a CNN we sort of train it to completion on the limited range of data and then we take our capsule system and we train it until it gets the same performance as the CNN so when we had a limited range of azimuth the CNN trained on the training data got three point seven percent error sorry yeah it's trained on the familiar view points it gets three point seven percent error we train the capsules to get the same and then we test on the extrapolated view points and the capsules generalize a lot better than the CNN showing that it's not just the capsules work but is that they generalize better and we do the same validation and again they generalize a lot better it's not it's not as much as I'd like but it's definitely much better now you might ask if your computer an old-fashioned computer vision person you might ask why isn't this just a Hough transform because in computer vision they have huh transforms for finding paths that are related right and well it is just a half transform but it's what you might call a nonparametric Hough transform so in half transforms you would make a big array and a part would vote in this array and if it's a degenerate part like if it doesn't have all degrees of freedom it will put a big streak of votes in this array and then you look for the intersections in the array the problem is if you want to have six degrees of freedom you'd need a 60 array and that will be hopeless so it's normally only done for two degrees of freedom or sometimes three so what we're doing is we're saying suppose you could learn power-ups that have all the degrees of freedom a viewpoint then if I if this is a fully specified part I can now make a point vote in the space for the whole I don't need to have an array where I scatter lots of votes I make an unambiguous point vote and now instead of having this big array for the viewpoint on the whole thing we just have these votes that come from the past we happen to have and we look for clusters among those votes and that's how we get over the problem of having to tile a six dimensional space of course we do this business of looking for votes in the six dimensional space and we convolutional e tile that over the image so we're doing it over the image but for large things is course tiling so that's how we're kind of dealing with the basic problem regime the basic problem Hough transforms in two ways one is buying not gridding the space just using convolution for the two degrees of freedom and the other is by making sure that our parts have enough degrees of freedom so they can make a point vote and it's very hard to do that by hand unless you're David Lough David Lough that was the point of sift features to allow you to do Hough trance but I got forgotten when people learned about machine learning they threw away the pointers if features and just did dumb machine learning on bags as if Finch's machine learning was kind of a disaster of a vision because it's throw away all that good old fashioned geometric vision which is what we need to deal with view point properly I enjoy saying that we try to scale this up to big image sense and there's just a kind of hardware problem harborin software problem all the hardware software design for neural nets is designed to optimize things for big matrix multiplies and when we try and scale it up we just run out of memory right away and it's because this software is keeping too many copies of things we need to use the software that does automatic differentiation because if you try and back propagate through these unrolled loops and computers all by hand it's a pain so present is just a technical difficulty making it scale up but there's also some much more serious problems than those practical ones so there's all sorts of things wrong with what I just told you and so I'm going to go over some of the main things wrong with it one or I've already mentioned you'd like to do unsupervised learning to learn all this structure and just have a little bit of supervised signal when you get labeled examples just that'll help it out but if you try and do this using unsupervised learning the transformation matrices all collapse and every all the votes just predict the origin and they all say hey I agree really nicely and that's because you're trying to optimize something in a space that you control as opposed to a space where the date is fixed under specified poses like a circle if you have a circle you don't know what orientation it's in and so you can't predict the orientation of the hole it's easy from a hole to predict the part but the part might be degenerate like if you imagine a constellation of stars from the constellation you can predict where the stars are but from an individual style you can't predict it much about the constellation because you don't know the orientation and the third major problem is actually Sarah Sabo who made this all work had to put a lot of work into tuning the whole system that is the learning racer various things because to say that a clusters there you need to have found a cluster that has multiple points that are close together but obviously you doing a trade-off between how many points and how close and that trade-off is going to vary during the learning at the beginning of learning the cluster is going to be repour because you haven't learned the transformation matrices properly yet and you can't just say there's nothing there you have to allow it to say the cluster might be there even though it's very full cluster as long goes buying you need to let all that change so it gets much more demanding about how tight the cluster has to be for it to be there and just getting all that to work was painful I mean like months painful I want to just show you before I finish need doing segmentation but this is an earlier version we had of capsules but I just want to show you they can do quite impressive segmentation using these same general ideas but this was a somewhat earlier version there works in a slightly different way so these are just on some nips digits and what you're seeing in white is what the computer sees and it's made by superimposing two digits but with a small offset and what you're seeing in color underneath is the best two digits found by the capsule system is top two beds and I draw one in red and the other in green so the overlap will be yellow and so you get to see that it didn't actually it did actually manage to see the two digits and in cases like for example the already get a right yes in cases like the one in the bottom row second from the right if you look at that it's quite hard to tell what's going on I think correctly perceives that as an five that's what it really is now of course this case is when he gets it wrong and I'm not showing you but there's it gets about half the number of areas that the best comic could do I should say the systems are trained on overlapping digits so it's not that it's it'd be much better if it's trained all single digits and the first time you showed an overlapping pair it says hey there's two digits there but these are actually trained to see two overlapping digits if you don't do that a combat can't do it at all and now I'm done so three published papers on capsules is an early one called transforming autoencoders in 2011 and then there's one in nips that does that segmentation of digits as I child shows you and then the thing I talked about today is in I CLI in 2018 that's matrix capsules and there's a new version of capsules were working on that I hope to talk about today but it doesn't work yet and I don't want to talk about something until it works and we're hoping that we'll be in nips 2019 okay I'm done [Applause]
Info
Channel: Tsotsos Lab
Views: 16,913
Rating: undefined out of 5
Keywords: geoffrey hinton, york university, ic@l, instut, centre for innovation in computing at Lassonde, lassonde school of engineering, capsule networks, capsule nets, google, vector institute
Id: x5Vxk9twXlE
Channel Id: undefined
Length: 54min 44sec (3284 seconds)
Published: Thu Mar 07 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.