GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there today we'll look at how to represent part hole hierarchies in a neural network by the legend himself jeffrey hinton he describes a system also known as glom that is a new approach to processing visual information using neural networks and interestingly the paper starts off by saying this paper does not describe a working system so this is an idea paper jeffrey hinton's suggestion of how we should go about solving vision or furthering vision in the ai community he says openly these are just ideas please prove me right prove me wrong try them out and so on and i absolutely welcome this idea papers is a thing that i think we have lost as a community because everything needs to be state of the art and so on uh this is super cool and i encourage more people to do it i'm not saying you're gonna have the same kind of success with an idea paper as jeff hinton he is banking on his name in large part with this but nevertheless uh it's just an archive paper like i see people complaining this would never be possible if it wasn't yeah it wouldn't like people wouldn't pay attention but you're welcome to write your ideas and post them on archive like or write a blog post make a youtube video uh anyone has opinions so you know go ahead um yeah so to the paper itself glom column as you can see here column comes from the is stems from agglomeration is a system that instead it presents a single idea about representation which allows advances made by several different groups to be combined into a an imaginary system called glom the advances include transformers neural fields contrastive representation learning distillation and capsules glom answers the question how can a neural network with fixed architecture parse an image into a part whole hierarchy which has different structure for each image the idea is simply to use islands of identical vectors to represent the nodes in the parse tree if glom can be made to work it should significantly improve the interpretability of the representations produced by transformer like systems when applied to vision or language that's the abstract we'll dive into the system we'll see what it's about i think i can actually make a suggestion to improve it uh but maybe i'm way behind other folks so what is the glom system and what are these parse tree about and why does it combine all of these things and for that we look at so hinton has two core diagrams here this is the first diagram this is the second diagram and at first sight they have little to do with each other so let me try to go about it like this if you have an image and hinton looks at vision very much in terms of you have an image or a video and you want to parse the image into kind of a tree and the tree should be sort of like a tree of objects and their parts so let's say it's an image of a car so the whole notion is very very object centric so this is like my best attempt at a car and a parse tree for this image would look something like this alright so this whole thing here is a car so that's going to be your top node in the parse tree the car has different parts namely it has this cabin it has a motor and it has wheels so that is going to be those are going to be kind of downstream of that parse tree then the cabin itself is going to have two segments here windows and maybe here is the door area so that is going to be window window door and so on so you get that we what we want to do is we want to look at an image sort of create this parse tree over here this is very much into the into the area of go phi good old-fashioned ai people that want to understand the world in terms of their symbolic representations and relation of these symbols to each other however what hinton is saying is that if you simply do this it's it's you know you can't really do this with neural networks neural networks are continuous and so on so what would you have to do in in addition we know that the brain doesn't reconfigure itself every single time you get a new input so the brain even though it has some neural plasticity while you look at the world and do inference in the world the connections stay the same so what we need to do is we need to come up with a system that when we input one image it can give us one parse tree but when we input another image it can give us some kind of other parse tree maybe now there are two objects in the image and this one has one descendant only which in turn has two descendants and so on you see the point the tree structure needs to be different each time this in part was addressed by hinton's capsule networks so in the capsule networks hinton's idea was sort of okay i'm going to have these capsules here in different layers and i'm going to have kind of lots of capsules and these layers lots of capsules in these layers and i'm going over capsules because it's kind of important here so hinton's idea with capsules was that the first layer of capsules would sort of recognize the the smallest parts so this would be kind of the wheel capsule and this would be sort of the window capsule and so on so there would be a single capsule for every part that could possibly be in an image right you already see the limitations because if you want to recognize the whole world you need many capsules but nevertheless this was the idea so a capsule would be active if there was the given object in the image and then the next thing here this would be kind of the the motor capsule so the motor motor capsule and this would be the cabin capsule and so on so the window would activate the cabin capsule but the door capsule would also activate the cabin capsule and so on and the wheel would maybe activate it would maybe activate i don't know the wheel should probably be here as well wheel at this level would activate that and then all of these things here would activate the car capsule sorry so you can see that this parse tree here is generated dynamically right these connections this routing in capsules is generated every time different so in the next image there could be a different object different capsules are activated different things are routed together the parser is different however you need these many many capsules for that every one capsule per possible part in the image and that was just infeasible and also the routing was very cumbersome in these capsules so here we go with a new approach and this new approach is what hinton describes as the glom architecture is composed of a large number of columns which all use exactly the same weight each column is a stack of spatially local auto encoders that learn multiple levels of representation for what is happening in a small image patch okay so we're going to build up some kind of imagination here at the at the bottom level we have our image so our image is going to be lying flat on the ground maybe you can see it like this and it is going to be divided into pixels or small patches whatever you want but these are would be called locations so it would be divided like this into different locations i am not good at perspective drawing in any case above each location there would be one of these columns and these columns i can draw one here these columns would sort of stack up like this and these columns would be divided into multiple levels so there would be a bottom level which would be this there would be a middle level higher level and so on hinton suggests about five levels should probably do and every single level of this column tries to represent the location at the image right this location down here in a different resolution so the very bottom level might be aware that there is a part of a wheel like let's say this is actually let's say this is a cat i so here there's probably yup yep okay so you can see there is there is a an ear or a part of an ear let's say there's a part of an ear in this location so the very bottom thing would probably represent something like the very structure of the fur so the bottom thing would represent what's going on at you know the micro level really the location level the next layer would represent what's going on at this location in a kind of a broader sense so that might recognize that that that's an e that's actually part of an ear right so it goes beyond the location if you think convolutional neural networks you're in the right ballpark but we're going to implement this differently the next layer will recognize well this location is part of a of a cat of a cat's head and then the next location will recognize well this thing is part of a cat so at this location there's a cat now there there is a cat at other places but at this location there is a cat and so on so maybe we don't have more and dislocate at this particular image but if you consider a different column like this this column right here and you look at what's going on in that column you'll see similar so in the top layer let's just consider the cat the top layer in the top layer it might say well there's a cat too but it there it's also part of it's part of a cat's neck neck and then here it's maybe there's a bunch of uh well i don't know a chin and there is also a fine first structure of the chin so you get the idea every column will build up these rep these representations and these are vectors so these are embedding vectors so at the bottom location you'd have the fur vector and then this vector is the ear whereas here over here the chin would be very different it would be a different vector at the same layer so the only thing that agrees here is the cat vector the cat vector in this top layer would agree between both of these columns i hope you get the idea you have a column above each of the locations every single layer in the column represents that particular location but at a different abs level of abstraction and a different level of i don't want to say resolution but it it would consider more and more of its neighbors the question is how does it consider its neighbors and how do you learn these things right so how do you learn these different abstractions and that's where these columns they communicate with each other so hinton imagines that this is a process over time where the columns iteratively communicate to each other and within the column the layers communicate to each other and this is one of these first diagrams right here so this is one single column over time okay this is this would be the this would be the fur at the ear this would be the cat's ear and this would be cat okay so the information that so the embeddings are updated by sending information around every single embedding which means that every single vector at every single layer of every single column is updated by simply averaging four things so we have the embedding at layer l at time step t plus one is going to be sorry at layer l location x is going to be a sum between the four parts the four following parts it's going to be the embedding at the last time step right so this is sort of a recurrent neural network we the new embedding is the old embedding plus it's going to be a function at a top down that's what a hinton calls top down function of the embedding at the same location in the previous time step at one layer above so l plus one it is also going to be receiving information from the upwards i think bottom up because the bottom up embedding of layer l minus one at the same location at time step t all right so this would that's what you can see right here the green arrows are each level each layer simply passes information to the next time step this is if any if nothing else happens you just keep your embedding then each embedding also sends itself through a neural network one layer above itself that's the blue arrows so the blue arrows here are these and you every everything is a neural network here every arrow except the green ones but the green ones could be two so every arrow is a neural network so this is a neural network sending information above and this is intuitive right so the ear embedding would sort of send information about itself like saying like hey i'm a cat ear sends it above and it goes through a neural network because it needs to be transformed the neural network has to learn well if it's a cat ear at that level it might be a cat at the top level and lastly every single layer sends information down and that is the red arrows right here they're also neural networks so the cat ear says well i'm a cat ear so downstream of myself there might be you know some first structure right so all of these embeddings they try to predict each other they try to predict the neighbors of themselves and hinton's idea is that by aggregating over time they will sort of reach a consensus of what is in these columns okay there are a few things missing right here the one thing that's missing and hinton pointed this out that all of these different columns that we've drawn they use the same weights okay so and he discusses this at the end of a paper it's not really biologically plausible but there's an ensemble effect we won't go into that but all these these so the blue arrows are always the same for each time step uh but not necessarily the same between different layers so that might be this f might be different from this f down here however the function passing information from from layer l to layer l plus one is the same in every single column across the image it's a bit like a convolutional network in terms of weight sharing so you can imagine it as one by one convolutional network in that sense but except the information does not only go up the layers it also goes down the layers over time so as i said this is an iterative procedure it goes up down and laterally the second thing is now that you ask oh well if every single column has the same weights wouldn't that simply sort of how how can you localize any information and the answer is that you have a side input like in a neural field you have a side input annotating each location basically a positional encoding honestly so in in addition to what the image patch looks like you also get your kind of either your x y coordinates or you could also get your relative coordinates to some other coordinate frame in there and so the network knows where it is and that's going to be important because what hinton wants to build are these islands so the imagination of hinton is that this is going to be somewhere in between like after time step 10 and you want to run it for a hundred and he imagines that there will what will emerge are these sort of islands so imagine the image is now a 1d vector down here or you can imagine these columns in 2d whatever fits you know whatever fits your brain better but imagine the images the image is simply a 1d line right here he imagines that the bottom vectors they will just you know happily kind of be describing whatever that is at the very bottom level but then at the next level once it goes to sort of higher res or lower resolution higher abstraction there will be there must necessarily be vectors that are the same if the system works and look at these two vectors and look at these two vectors they are the same because they now describe objects that are larger than one location right the cat's head is larger than simply one location therefore at the layer that represents the cat's head you expect because these are all all neural all the up and down functions in the same layer have the same weight you expect that the embedding of a cat's head is the same in in the different columns right that this is if the system works this must be the case and then as you go up you expect more and more of these what hinton calls islands to emerge right so they they agree and the idea um the idea between all of this message passing is that over time all of these things kind of reinforce each other so we looked at a column before and we maybe said okay so this vector down here it gets information from the top uh saying hey uh you know there's a cat here so you might be like a cat ear or a cat eye or something like this and then he gets information from the bottom saying well there's a bit of there's you know fur here and there's some cartilage showing and so on and it has already sort of figured out that it might be an ear and these informations they own they reinforce itself now like they'd be like okay you know you're saying i'm part of a head and you're saying there's a bit of fur and cartilage and i already kind of noticed that i'm a bit like an ear so i'm probably more an ear so the idea is that over time you have this consensus algorithm there's one thing missing and that is how do the different columns communicate with each other so i said there are different parts there is one missing and that one missing is going to be i'm just going to call it whatever a and a is going to be an attention mechanism across all the other columns at the same layer so if we look here this cell receives information from above from below from itself and also in an attention mechanism way it's going to receive information from all of the different all of the different embeddings at the same layer you can see um that you know hidden puts in everything we got in here now the attention he says is easier and so these are the four parts right here uh at each discrete time and in each column separately the embedding at a level is updated to be the weighted average of four contributions the prediction produced by the bottom-up neural net acting on the embedding at the level below at the previous time the prediction produced by the top-down neural net acting on the embedding at the level above at the previous time the embedding vector at the previous time step these three we got and then the attention weighted average of the embeddings at the same level right at the same level in nearby columns at the previous time so nearby he oh sorry he later backpedals a bit i think on nearby and what nearby exactly means and he at some parts so this this is idea i think this is still up for debate and this is i think where i can help um but what he wants to do is he wants to aggregate he wants to attention aggregate and he wants to simplify attention so instead what we usually have is we're going to produce queries and keys and values queries keys and values and they're all going to be different functions of our input and then we're going to do query times key transposed softmax of that times value and that is going to be our tension mechanism that allows you know arbitrary information to be routed around and so on hinton says nope what i want is simply that all the queries the keys and the values they're all just equal to the embeddings themselves so the attention mechanism would work out to be the soft max of x times x transposed times x and what that does is if you yourself are the query and every vector also itself is the key what do you attend to you attend two vectors that are very similar to yourself and you can see that in hinton's diagram the one we circled dark blue what would it attend to well it would probably attend to its left-hand neighbor the one you can see circled i'm gonna circle it this one it will probably attend a lot too this one it might not attend so much and the ones over here it might not attend at all what does this give us especially since the values are also these vectors this is a consensus algorithm it is not meant as a way to pass information around it is not meant like in a transformer as a way to do computation because we have no trainable weights in this process it is simply meant as a consensus algorithm so it imagines that by doing this by sort of attending to things that are similar to you and then integrating their values there will be these islands forming and that's what you see right here you can imagine if two vectors are already close at the same layer this mechanism will make them even closer so this is a sort of a clustering algorithm and so the my question is that these drawings you look at them they are very specifically constructed they are constructed such that a parse tree is emerging so when you look at this you have a clear sense i can probably i can probably move all of that crap out of the way you can see the parse tree right because the black thing is going to be the top node right here let's leave away the scene level embedding for now the black thing is going to be the top note and then it has two child notes this one and this one and then it has four every one of those has two child notes but it's not it doesn't have to be in this case so this dynamically and every one of them you know the black ones are individual this is dynamically constructing a parse tree right the parse tree here is something like this and then [Music] so this is pretty cool but it is also drawn deliberately such that a core problem does not arise and the core problem would be something like well um what if this vector here was actually also pointing like this okay so it is not in it is not in the same it is not in the same area of the parse tree right if you go down the parse tree it is actually here now if we do what hinton says and if for this vector here we do this aggregation via attention on the same layer what we will attend to is this vector over here now this is probably not meant to be because this vector over here it can represent the same thing but you can see it's not in the in the same path of the parse tree and he mentions this a little bit throughout but not necessarily clear um and the drawing makes it seem like there is no problem but i hope you can see how this is a problem the attention would pull in information from over here however the whole parse tree here and the island on the top layer suggests that these two things should be parsed independently from each other and therefore also processed independently from each other so here is my suggestion to uh to extend this and maybe hinton's already thought of this but i would suggest that this attention mechanism here is modulated by how close two things are in the parse tree okay so what would that be so for a a given vector it would be how much do you attend to this vector right here well a lot because it agrees with you right it you know this the soft max of the inner product would be high it agrees with you and also it is in the same in the same branch of the parse tree so that's perfect right this one right here doesn't agree with you but is in the same branch so it could potentially later agree with you through a consensus algorithm however this one over here i you probably shouldn't attend to that too much even though it points in the same direction because it's in a different branch of the parse tree you shouldn't attend zero to it like because these branches on top they could change and you know by you sending information there this one could change the the top structure here that could agree more with your branch of the parse tree and so on so my suggestion would be that let's not only get the soft max of the um let's not only get the soft max of the current layer things but let's do x times and here we're going to have a sum so this is going to be k and let's say we're at we're at layer l and this is layer one this is layer two this is layer three i'm gonna number them from the top actually from the bottom ah layer m layer m minus one and this is layer l i'm i suck at this so from the current layer i wanna go up the hierarchy until layer one and i'm going to take this soft max of the representation at layer l at layer k where i'm at x k transposed like this what we aggregate is still the um the values on the current layer but how much we should attend to that should be dependent on the parse tree and we do that like this and maybe we have like a kind of a lambda k l minus k l minus k i hope you get what i mean so how much how much you aggregate this sum here this sum here is weird um this should go probably ah hi it's future yannick and i just wanted to write that down again so because i've made some mistakes uh obviously the sum here should be within the softmax because you want to kind of aggregate the distributions in log space and the softmax should still be a valid you know distribution and then the lambda is exponentiated by k and k now properly runs from the 0 2 all the way up the stacks so big l would be the total number of layers and little l would be the layer where you're currently at and you can clearly see that the contribution of these attention matrices it is so lambda would be something smaller than one and therefore the contribution is in the current layer is the strongest but also in the next one up is a bit weaker then one more up is even a bit weaker and so on so you'd still have essentially the same mechanism as hidden is suggesting controlling for the fact that things are in different branches of the parse tree alright back to classic yannick who is thoroughly confused by these things yeah i'm not good at i'm not good at coming up with math on the spot but i hope you can see what it's doing so it is if if you simply take the first k you would simply stay at that layer and it would be what hinton said but what i'm saying is you should also consider how much your top your higher layer one layer up from you agrees with one layer up from the thing you want to attend to so you also compute that inner product between between uh the embeddings and you add that to the soft max distribution so initially the softmax distribution would be like you should detend to this thing and this thing and this thing a lot but then the next up hierarchy would maybe say well we agree because you know these are in the same thing but this one maybe not so much and you would add those together maybe with a lambda factor in here and then you go one layer up and it would say well okay everything over here basically agrees right and here no but everything over here basically doesn't agree so you would add that maybe with a lambda squared as you go up the layers it would be less and less important but still you'd consider it all right you know if this is going to work out uh inside the channel now back to what hinton says this this is actually the system this is the system as in a nutshell you're gonna input the image at the bottom and hinton says you could use like a convnet at the very bottom to get it into the columns but then you're going to every time step pass information up the columns down the columns and between the same layer of the different columns and that's going to in some points this is going to stabilize i don't know if it has cycles he probably doesn't have cycles this is l probably does not have cycles so at some point this comes to an end and if that comes to an end it should be that the object level embeddings agree on an object the part level embeddings agree on what parts there are the sub parts agree and so on and they form these islands these islands give rise to a parse tree and the parse tree can tell you what object is there what is it made of and where are these parts in the image and so on so exactly that is it and now we're going to look at what hinton calls some design decisions how many levels are there about five okay we can skip that how fine-grained are the locations hidden says you could be as fine-grained as pixels or they could correspond to larger image patches you and he says you could do convolutional neural network to get it in there does the bottom up net look at nearby locations he says yes the bottom up net so this this is not the attention network that's the bottom up network it could look at nearby locations but hinton imagines that if you have bottom up top down and if you have attention drawing in information and if you maybe limit that attention to a neighborhood then um then the the attention will do the job because you can have instead of looking at neighboring locations in the bottom up network you can simply in two time steps aggregate that information so you can do bottom up here bottom up here and then using the attention the lateral mechanism you can pass that information around this way and also it is not as biasing the network to the immediate neighborhood so the attention mechanism can sort of look farther which conflicts with what he's saying on top that the attention mechanism might only be looking at the neighbors i think there are different possibilities here and only looking at neighbors is actually one of the solution to the problem of having you know kind of similar vectors at very distant locations at down the levels but i think it's not as as good a solutions to simply look at how close things are in pixel space because even though things are closed in pixel space they might be far away in the parse tree space how does the attention work we've already looked at this so the way that one location attends to another location is going to be the soft max of the inner product between the embeddings here and the values are also going to be just the embedding set layer at that layer the visual input that he says convolutional net could be used color and texture um he says he make he gives this example like if you know if an object is entirely pale or entirely green or entirely i don't even know how to pronounce this the color of a part is straightforward but what color is the whole object so this entire notion of um capsules by the way hinton imagines this as these embeddings represent kind of properties of the object so that the the cat ear embedding represents not only the fact that it is a cat ear but also different properties about the cat ear and even its location in the image is in the embedding and you know we know that transformers they must be doing something like this because we feed in positional embeddings for example at the very bottom and it can still you know compute things in terms of positions so um that's the there's an intrinsic connection between kind of capsules and the kind of transformer architecture he says one of the motivations of glom was the idea that the whole object has a compound color which might be called pale green or mauve and at the object level every location belonging to the object has exactly the same compound color so the object is whatever this all over when deciding which other locations the object level attend to preference would be given to locations with a similar compound color so um what he's saying right here is that you know you could give preference to to similar color locations when you decide what you want to attend to but the color isn't as easy as simply saying what color is there in the location that you are at but you could be so if this is green and this here is blue then the bottom layer would say yes i'm green and yes i'm blue but they could also be saying well i am part of a green blue object right and then the the higher layer here you know attending or caring about multiple or bigger region its color would then be you know green blue and the consensus could reach on well we are a green blue object even though the object isn't a pure green or pure blue all throughout so he um i think yeah it's it's i think it's a side suggestion maybe he has this as a core motivation between the system but um it's just interesting to see how he thinks of things and he extends the color here to textures and even shapes the individual texture elements have their own shapes and poses in spatial relationships but an object with a textured surface has exactly the same texture everywhere at the object level column extends these ideas to shapes an object may have parts that are very different from one another but at the object level it has exactly the same compound shape in all of the location that it occupies basically saying that okay every pixel that's part of a cat head is a is a cat head has the shape of a cat head even though the individual locations might not recognize that and that information could be passed around through this consensus mechanism over time so the cluster discovery versus cluster formation we've seen that and he makes a lot of um he makes a lot of analogies to face recognition but yeah the clusters are not the islands of similar embedding vectors at a level can be viewed as clusters but these clusters are not discovered in immutable data they are formed by the interaction between the intra-level process that favors islands of similarity and dynamically changing suggestions coming from the locations embedding at adjacent levels so the core here is really this consensus algorithm that creates these clusters and yeah the clustering algorithm doesn't work by simply looking at embeddings and deciding which ones go together but the embeddings themselves update themselves in order to form clusters and yeah this is replicating embedding vectors this is a response to a criticism that i i guess he got where someone said well why don't why do you represent if you have these you know these columns at the bottom it makes sense you have all the different vectors but then as you go up you know you have that kind of the same vector for all locations because it's the same object why does it make sense to replicate that everywhere and not just have one because you know in a database we just have one and hidden basically says that in order to reach the consensus first of all it's important to have different vectors they might be slightly different so they might have some nuance in them because you know they might get pulled into different directions from the kind of bottom-up signal than from the consensus algorithm on the same layer so i you know i i believe that it is that is important uh here i think it's just this is a criticism he got and then he decided to put this in here learning islands so what we haven't discussed about this yet is how this is trained hinton says this is trained as a denoising auto encoder let us assume that glom is trained to reconstruct at its output the uncorrupted version of an image from which some region has been have been removed so he goes into uh self-supervised learning with this system this objective should ensure that information about the input is preserved during the forward pass and if the regions are sufficiently large it should also ensure that identifying familiar objects will be helpful for filling in the missing regions to encourage islands of near identity we need to add a regularizer and experience shows that a regular asset simply encourages similarity between the embeddings of nearby locations can cause representations to collapse all the embedding vectors may become very small so that they are all very similar and the reconstruction will then use very large weights to deal with the very small scale to prevent collapse and then he says contrastive learning is the answer to this so how do you regularize the model such that this consensus is formed he says contrastive learning might be useful but you can't simply apply it straight out so it learns to make representations of two different crops of the same image agree and the representations of two crops from different images disagree but this is not a sensible thing to do if our aim is to recognize objects if crop 1 contains objects a and b and crop 2 from the same image contains objects b and c it does not make sense to demand that the representation of the two crops is the same at the object level okay so he says that contrastive learning is good but you have to pay very careful attention at which layer you employ it because you know if you go down far enough then contrastive learning especially you know this this type where you crop the image into different parts and you say well since it's the same image the representations should agree hinton would say well at the top layer yes but at the bottom layer certainly not because they display different things right so you have to be careful where you apply this contrastive learning and he gives a bunch of suggestions on how to solve that he says things like well negative examples for example might not might not even be be needed well that's it sorry that's a different thing so the obvious solution is to regularize uh the bottom up and top down neural networks by encouraging each of them to predict the consensus object option yeah this is the way to geometric mean of the predictions coming from the top down and bottom-up networks the attention-weighted average of the embeddings at nearby locations at the previous time step the previous state of and i guess end there should be an end here and the previous state of the embedding training the intra-level prediction to agree with the consensus will clearly make the islands found during feed forward inference be more coherent so he says you could regularize the model to to regress to the consensus option so it's sort of like a self a self regression um and he asks whether or not that will lead to a collapse because if you don't have negative examples and contrastive learning this could lead to simply a collapse an important question is whether this type of training will necessarily cause collapse if it is not accompanied by training the inter level predictions to be different for negative examples that use the consensus options for unrelated spatial contexts so here is that problem right if you use the consensus opinion for unrelated spatial context that might be a problem he says using layerer batch norm should reduce the tendency to collapse but a more important consideration may be the achievability of the goal it goes into why regularization could help um and he says if however an embedding at one location is free to choose which embeddings at other locations it should resemble the goal can be achieved almost perfectly by learning to form islands of identical vectors and attending almost entirely to other locations that are in the same island and i don't know i don't know if this is what i suggested so i guess this is kind of a convoluted paragraph and i had to also read it multiple times and i still don't exactly know what he's trying to say right here but i think what he's saying is that what we want to do is we want to sort of regularize the network to produce this consensus right so we have a bottom-up signal a top-down signal we have a current value and we have the signal from the attention mechanism now what we want to do is we want to reach a consensus such that these islands form however if you attend to any sort of things here that have nothing to do with you you might not be able to reach this consensus right that's i think that's the problem i i think he's touching on the problem that i said before so what he says is you know what you should do is you should simply attend to things that are in the same islands already so if an embedding at one location is free to choose which embedding at other locations it should resemble the goal can be achieved by learning to form islands of identical vectors and attending almost entirely to other locations that are in the same island now i think here what he's doing he makes the case for the attention mechanism itself right so he says if if we simply draw in information from the same layer here you know anything any old information might come in and we might collapse and or we might never reach consensus because any old information might come in however if we introduce the attention mechanism into this whole thing and only draw in information from the selected neighbors that already are in the same group in the same island as me then this consensus algorithm works so if the network the network is now forced kind of to learn to build these islands of similar things in order to make this consensus work if we regularize this consensus so i believe he makes the case for the attention mechanism i don't think he in this case considers kind of the up the next up layer islands what i would say is you need to consider the island membership all the way up the columns in order to decide which things which locations right it's free to choose which embeddings at other locations it should resemble i think yeah this is the case for the attention mechanism okay i hope you're still half with me um if not i'm i'm a bit confused too but i think what he's doing is he says contrastive learning would be good you can use it but you have to be careful at which layer you do it another regularizer to form these islands would be this regularize the network to conform to the consensus option opinion however if you simply aggregate information from the same layer then that wouldn't work because you know the different things in the same layer might correspond to completely different parts of the image drawing in information from there would not help you how do you solve this by introducing the very attention mechanism that he introduced in order to only draw in information from parts of the same layer that actually are related to you okay the next thing the next consideration he does is representing coordinate transformations so how does this represent coordinate transformations there was a capsule net paper where he explicitly represents coordinate transformations in kind of four dimension quaternion space and he he says that is probably not needed because you don't want to here says you could represent this by a by four by four matrices however if you simply allocate 16 numbers in each embedding vector in order to represent the part whole coordinate transformation like the transformation that relates the part to the whole that does not make it easy to represent uncertainty about aspects of pose and certainty about others so the problem here is that we know that humans when they watch something right here uh when they watch a scene like this is a chair and there is a person a very tiny person on the chair we don't see necessarily the coordinate frame of the world what we see is we see the coordinate frame of the chair like maybe this is the center and we see the person in relation to the chair our brain seems to do this intuitively and hinton thinks that a system like this should also do it intuitively so somehow the coordinate transformations involved going from the eye to the reference through the frame of the chair and then from the chair to the person they should be somehow in encoded in this network however he also says that it's probably not necessary to encode them explicitly as you know explicit coordinate transformations because not only does that make it harder probably to learn but also you can't represent uncertainty in fact you can represent uncertainty that's the next thing right here much better by having a higher dimensional thing that you're trying to guess right if you are trying to guess a distribution with three components and you simply have a three-dimensional vector you have no way of representing uncertainty however if you have a nine-dimensional vector you can have three opinions about the distribution so this is an opinion this is an opinion and then this is an opinion and then you can sort of aggregate and you can say well i'm pretty sure about these two things because all my opinions are pretty close but this one here i'm not so sure because uh my individual things say different things things say things all right i've this video is too long um so that's his argument right here we don't need explicit representing of uncertainty because by simply over parameterizing we can already represent uncertainty well and we also don't need disentangled position information and and so on um sorry we don't need different position informations because again the network can take care of that and he gives a good example like why would you have disentangled coordinate frame if you have an image and in the image the picture in it is this how do you know if that is a rhomboid shape or if it is a um if it is a rectangular piece of paper viewed from the side i should probably draw it way closer something like something like this ah i suck at this you you get probably get what i mean like if it is a different object it has a like the object and the coordinate transformation are dependent upon each other and so it makes sense for the neural network to actually entangle the two because the two things depend on each other in essence he's just saying don't worry about explicitly representing all of the different things we got it like the neural network can do all of these things like uncertainty or position and post transformations so here he compares it to different other architectures comparison to cnn comparison to uh transformers comparison to capsule models and at the end it goes into video at the very beginning he says the paper is about actually a video system and you can kind of see that because we go through this algorithm in multiple time steps right you have it's it's like you analyze an image with these columns which gives you sort of a 3d 3d tensor with the image at the bottom and you go in the next time step you have a new 3d tensor right you pass this whole information around with the image at the bottom and says why does that need to be the same image that could also be different images so you could use the system to analyze video so what he does is he says at the same time you do this time step to find agreement you could actually swap out the video frame the x you can swap out the video frame pro and produce a slightly different video frame and you could actually have a kind of an ensemble regularizing effect so as the whole columns here the whole system comes to a consensus over time you feed in different information at the bottom and what he says is that you know if this is a slow enough video then um the top players here would probably could still reach an agreement while the bottom layers would change rapidly but that could be sort of an ensemble or a regularizer a regularizing effect that it even has so he intrinsically connects these two time dimensions because they would be separate right you could input a video and then in you know in each frame you could do this um consensus finding algorithm but he says no it's actually cool to consider them together to do the consensus finding while you sort of watch the video it's just not clear that you always need the same amount of consensus finding steps as you need as you have video frames so maybe you wanna maybe you wanna take like five consensus steps per video frame or the other way around not sure in any case i think that's a pretty cool idea and he says things like if the changes are rapid there is no time available to iteratively settle on a good set of embedding vectors for interpreting a specific frame this means that the glom architecture cannot correctly interpret complicated shapes if the images are changing rapidly try taking an irregularly shaped potato and throwing it up in the air such a way that it rotates at one or two cycles per second even if you smoothly track the potato you cannot see what shape it is now i i don't have a potato but i can give you an avocado so if you give me a second how's that could you track the shape i don't know probably hinton's correct all right then he talks about is this biologically plausible and i i don't want to go too much into this he discusses some restrictions like you know we still use backprop and his back plausible and so on um i love this sentence in the long run however we are all dead and then the footnote saying there are alternative facts uh but yeah he discusses whether it's biologically plausible how could you modify it to make it more plausible for example when you want to do contrastive learning there is evidence that dreams during so during sleep you do contrastive learning like you produce the negative examples during sleep and then during the day you collect the positive examples and so on so i think this is a more speculative part of the paper but it's pretty cool to um it's pretty cool to read it and lastly he goes into discussion he also says like this paper is too long already i'm gonna just briefly talk about this and he trashes the neurosymbolic people a bit like he trashes the people that say no no you know neural networks can never do whatever and he says pretty clearly look um neural networks can represent trees i've given you a system also bert can output parse trees so shut up i guess and he comes up with this uh glombert name which you know is is already coined if you wanted to do glombert that's already taken sorry um i also by the way also coined then i coined the name may [Music] right now okay if you wanna if you wanna use it it better be a pretty cool machine learning system and be based on glom right that was the paper um i think it's a cool system it has a bunch of parts that are maybe not super friendly to hardware at the time like this iterative procedure but honestly it is not much more than a neural network sorry a recurrent neural network with very complicated recurrence functions the video extension might be a bit tricky and but the rest and the regularization might be a bit tricky the exact objective so the denoising autoencoder objective isn't super detailed in the paper he simply says reconstruct the corrupted version of the input um how exactly the input happens maybe there's a cnn maybe the cnn feeds information into actually multiple layers none of that is exactly specified so there's lots to figure out i do think the ideas are very cool and i love idea papers and therefore i recommend that if you're interested more give this thing a read give this video a like share it out and i'll see you next time bye
Info
Channel: Yannic Kilcher
Views: 34,885
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, geoff hinton, geoff hinton capsule networks, geoff hinton neural networks, geoffrey hinton, geoffrey hinton deep learning, geoffrey hinton glom, hinton glom, glom model, deep learning tutorial, introduction to deep learning, capsule networks, computer vision, capsule networks explained, google brain, google ai, schmidhuber, transformer, attention mechanism, consensus algorithm, column
Id: cllFzkvrYmE
Channel Id: undefined
Length: 63min 25sec (3805 seconds)
Published: Sat Feb 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.