Graph Node Embedding Algorithms (Stanford - Fall 2019)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good okay welcome to the class we are discussing embeddings as as we started talking about this topic last week sorry on Tuesday I think so what we will do today is talk more about embeddings and in particular we will talk about how do you generalize neural networks to graphs and then next week on Tuesday Michaela will do a hands-on session so we will make things that I'm talking about today more concrete we will see how we can implement them in the popular deep learning frameworks and how do we make everything I kind of talk today about at high level how do we make it work and how do we make it work in practice for various kinds of interesting applications so here is what we would like to do right we would like to learn this type of a function that will take our network and that will map or embed this network in some low dimensional embedding space and the idea is right that we want to do this kind of representation learning where rather than trying to come up with a feature representation for the nodes like for example we were using with graph flats or that the way we were doing with motifs for an entire networks and so on we would like this neural network to learn the structure capture the structure of the network and capture kind of the dependencies between the nodes so our idea will be that we want to map these nodes into a D dimensional space into our representation space using embeddings in such a way that nodes that are similar in the network get mapped close together and the goal will be that we learn this mapping function so that this mapping function is not just given to us but actually wanna use some procedure to learn this mapping function right and one way we can think about this is to say AHA I have my network my complicated network here I wanna learn this encoding functions that will basically take the nodes in the network and map them into this low dimensional space in the position of the node in the space or somehow capture the structure the position of the node in the network so that's kind of what we are trying to do and the idea is right we talked on Tuesday is that somehow the similarity whatever that really means in the network will be mapped into the similarity of these two objects in the embedding space here that is just a dot product so it's a cosine similarity so nothing nothing too fancy right but the fancy part is this mapping function and what do we need to define to be able to make this work we need to define this notion of similarity of the nodes and then we need to define how this function this encoder function takes the node and how does it map the node and we said that this there are two key components to make this work one is the encoder function that will take the node and learn how to map it and then there will be this similarity function that will tell us the relationship between the between the input Network and how to map that relationship into the embedding space and to make this more kind of more concrete the encoder function will take the node and give us the embedding or coordinates of the nodes the feature representation of the node and then the similarity that we define in the network will be will try to make it such the similarity in the network equals to the similarity in the embedding space and you know similarity here could be Euclidean distance this is now a dot product so this is just a cosine distance is the angle between the two vectors that's that's the way you can or cosine of the angle between the two vectors and what was the what did we talk about on Tuesday we talked about these shallow encoding methods as we call them where the idea is that what we will do is we will try to learn this embedding matrix Z where basically the number of rows is the number of dimensions number of columns is the number of nodes and for every node we just learn the coordinates right so we'll just try to directly estimate the coordinates of the nodes right and this is what we will this is what we will do so how would you think of this the way to think of this is to say I I want to have this one layer data transformation where for every node I learn a set of coordinates let's say here are the coordinates for the node so I'm saying that the coordinates of the nodes should be such that way not product them here the similarity of the coordinates equals to the similarity in the network right and now I can say how do i best set these guys here so that their dot product equals to the similarity in the network and then of course the question is how do you define the notion of similarity and we explored using random walks to define that notion of similarity now what are some limitations of these types of approaches so the first limitation that is very important is the number of parameters we are trying to estimate equals the number of nodes right because for every node we try to estimate a set of coordinates so if I know I'm s I'm mapping into let's say hundred dimensional space and I have a network on a million nodes this means I need 100 million parameters and if I have a network on 2 million nodes then I need 200 million parameters because for every node I need to estimate hundred numbers so this is why this makes ha this makes it hard another thing that makes it hard is that these methods are inherently what is called transductive so it means that if I learn how to embed one network and then another network comes by I have to reinvent everything and another thing that is important if a network evolves over time again when a new node arrives I need to be able I need to estimate all of its coordinates so I have to do all the work from scratch so there is kind of no ways for the embeddings to generalize to new graphs or to new nodes I haven't seen during training and then the last thing that is very important is that these methods directly learn coordinates of the nodes and they do not incorporate node features so the question is if you have any attributes features on the nodes how would you do that another thing I think that is important to realize and I kind of didn't say that before is that if you think about when we talked about spectral clustering right and we talked about doing the finding the eigen vectors of of the graph laplacian that's also an embedding method right but if you think about that those embeddings those embeddings are computed by trying to optimize a very particular objective function which was this objective function of trying to find clusters and so on right so we already kind of talked about embeddings when we talked about spectral clustering only there the embeddings were very specific to identifying clusters and here we want to learn these embeddings that are specific to the task we wanna solve later right so if the task is clustering the embedding should obey the clustering structure if the task is link prediction your bedding should pray should obey link prediction right so today what we are going to talk about is to talk about what we call deep graph encoders right where the idea will be that we wanna take neural network generalize them to graphs and be able to say how do I encode a node I am called a node through multiple layers of nonlinear transformations of the graph structure whatever this means right will make it precise right and the the important point in the D is is that these deep encoders that we will be using can can be combined with kind of any graph similarity function even if even those defined in the last lecture so basically what we'll talk about today generalizes what we talked last time pictorially what we would like to do we'd like to take our graph send it to the through some deep neural network and you know out here good predictions will come so basically we will get good node embeddings that can be then used to make various kinds of predictions about the nodes we would like not only to embed nodes but we'd like to embed entire graphs sub graphs things like that right so there is a lot we would like to get out here through this kind of picture however this is amazingly challenging to do and the reason why this is challenging to do is that basically present machine learning deep learning toolbox we have is specialized to simple data types essentially all we know how to what to process is fixed size grid graphs right if you think of the images all the images have the same structure have the same size so you can just think of them as fixed size grid graphs right and if you think about text and speech what kind of graphs are these things they're sequences so they are line graphs right so essentially from what we care about in in this class these are extremely boring graphs right extremely boring simple data a fixed-size grid or or a line graph right here you can have a notion of a sliding window and here all the graphs are the same so you know how to pipe them through the network so the claim is right that modern deep learning pool box is designed for this kind of simple data types sequences and fixed size grids right but if you think about graphs graphs are not fixed size grids right they have arbitrary size they can have variable number of nodes they have much more complex topological structure and what is also important is that images have this nice two dimensional structure and it's clear what is top left and it's clear what is bottom right text has the or speech have this very linear structure so it's now it's clear how to go left and how to go right right it's it has structure if you think about networks networks have no reference point it's not clear what's top left what's top right right so there is no single reference point so it is very very hard and unclear how to do deep learning on top of these types of data right it's fundamentally different we don't know how to resize graphs things like that okay so what we would like to do at least conceptually or the other intuition level philosophical level is if you think about what made kind of advances in computer vision is this notion of a convolutional operator where you basically have this convolutional operator window that you are sliding across your image and it's it's very nice right you have the two-dimensional image and you are just sliding that operator kind of line by line or however you decide to slide it and and compute some function over that sliding window right and right you can do this and pass it through many layers and you get good results so the goal is to generalize this beyond this type of simple two-dimensional lattices and if we are able to do this then we'll be able to kind of leverage node node features and attributes right in a sense that if in convolutions you have three channels green and blue here you can also think of features as different channels but of course our graphs are not these two-dimensional fixed size well oriented positioned lattices but they are much more complex but here is an insight that will allow us to generalize the notion of convolution to this complex data types the way you can think of a convolution is that it basically takes a little sub patch of the image a little a little rectangular part of the image applies some function to it and produces a new part of the image produces a new pixel so the way you could think of this in some sense is that if I take this 3 by 3 patch here I can think of it as a 3 by 3 patch these are different pixels and what happens is that essentially the center node of that the center pixel aggregates information from its neighbors as well as as well as from itself to produce a new value right so the idea is that the way you can think of this is you can think of this transfer about about convolutional operator as a way to transform an aggregate information right you collect messages from your neighboring pixels combine it with your own message and you produce your new value right so that's how how you can think of this so now the question is if you have graphs that are more complex how would you define the notion of this sliding window over these types of graphs what would we do right for the networks that we care about so here is one kind of poor man's attempt how to do this and it will be quickly clear that this doesn't work right so imagine I have a graph on five nodes I couldn't represent graph on five nodes as an adjacency matrix right so here's the adjacency matrix of this graph here and then imagine that each node has some attributes some features let's say these are the features the nodes have so what I could do is to say oh this is now a nice fixed size input so let me just take this row and add it as an input to my deep neural network right and you would say great let's go and learn this there are many reasons why this is a terrible idea the first reason why this is a terrible idea is that for a graph on n nodes you will have n rows of data so you have n data points to train on but if you think how many parameters this thing has if he has many more parameters then you have data points to train on right because it has the number of inputs is the number of nodes plus the number of features so even in the first layer you have already used more parameters than you have training data points not and in the in the as you make the network deeper it gets even worse so that's one important thing another important thing that is this won't work for a graph on six nodes right if you give me now a graph on six nodes I don't know what to do right then this matrix would be six by six but then I don't know where to add that additional node so I don't know what to do and then there is another issue is that right now we labeled the notice a b c d e but i could label them as a b c d e and then the ordering the this adjacency matrix would change would shuffle and then the inputs to this to this matrix to this network would also change and it would get completely confused right so the point is that graphs are invariant to node ordering right you should get the same result regardless of how you order or how you number the nodes right so these are the three reasons why this won't work yes you could great good you could come up with some heuristic and one heuristic would be to say let me order the nodes somehow but the question is always what do you mean like you could order them by degree but then some node will have the same degree so what do you do right you could keep adding other properties perhaps that would work but still you won't be able to learn anything because you have too many parameters right and then yeah I don't know let you say this is a terrible idea but if you are excited about it you can do a project and prove me wrong or something right but I it should be kind of clear why you don't want to do this so let me tell you what you wanna what you wanna do right this is what I want to talk to you about I want to talk about kind of core deep learning for graphs today I want to talk about graphs convolutional neural networks I want to talk about how to add attention so graph attention networks and I'll tell I'll define what do I mean by this and then give you some ideas and demos and things like that all right so let's talk about kind of the core how do you think of deep learning for graphs all right and the key concept here will be about understanding local network neighborhoods and we will describe about talk about different data aggregation strategies and define this notion of computation graphs and then as we talk about this kind of at the single layer level we will talk about how do we go deeper how do we define the models parameters how do we train and you know then how do how do we do unsupervised as well as supervised or semi-supervised way of training these things so here is the set up for the rest of the lecture today we will assume we have a graph that has some vertex set has the adjacency matrix vertices have some set of features features could be if you we are in a social network it could be a user profile or the feature vector could be simply the image the user has in biological networks this could be gene expression profiles or you know if we have no features things can still work you we could for example use indicator vectors one-hot encodings as they are called of the nodes we could even use a vector of all constants as a feature vector things like that so this is what people use when they don't have any features and the key kind of insight today will be that the underlying network the social network we can think of it to define the computation graph so it means that the social network defines the neural network that's the key right so the idea is that the nodes network neighborhood will define the computation graph so the way we will think of this is that given a node and we wanna make some prediction for the node we will first figure out what the computation graph should be and then we will use this computation graph to to propagate an aggregate information from the neighbors and neighbors of neighbors to make a prediction at NotI right so this means that what we will learn is we learn how to propagate information across the graph to compute representation for the node and why this will be amazingly powerful is because this propagation will capture both the structure of the network around node I as well as we learn how to borrow information from the from the nearby nodes to enrich the description or representation of node I right so we are doing two things at the same time we are capturing the structure and we are also borrowing feature information right and this means this will be strictly better than trying to classify node I by itself using its own features only because we'll be able to borrow the signal from the nearby nodes okay so here is the key idea the key idea is that to create a network comparing we want to use the local network neighborhood to define the neural network structure right so if I have this tiny graph here as an example and I want to create an embedding for node a then the way I can I can do this is to say AHA node a has to aggregate informations from B C and D so note a get information from nodes B C and D and for example know to be here has only one neighbor a itself so this guy collects information from a while for example note C here gets information from its neighbors which is a B a and F so there is a B and F right so you see now how I basically took the node in the network the node in the network collects information from its neighbors that's the one layer and then I unrolled this one more layer to say hi now each of these nodes collects information from its own neighbors to to to be able to compute the representation right so the point is that now we have a two layer neural network that makes some prediction for node a and you can see how now the prediction that we make no they will depend on the featured representations of all these other notes right so in some sense we will learn how to combine information from the neighbors and neighbors of neighbors to make a more accurate prediction for node a this is what we will do yes great when when do we know when to stop enrolling in practice people like people don't go too far people would go three four steps why would you only go that that deep is you can think for yourself but I already gave you the answer earlier in the class the reason would anyone guess why you don't want to go deeper than four or five what's the ammeter of networks what is the diameter of the MSN network you guys remember that because six right so it means if I go six levels deep I visited every node in the in the graph so I don't need to go six levels deep right so the depth here is very different than than the depth in computer vision right here you go deeper in the network and if you go three four hops away you essentially visited a huge chunk of the network and you don't need to visit the huge chunk of the network so the depth here is a different concept and than the depth we like to think about okay so this is the the intuition part and of course what will these boxes be this boxes will be some kind of neural networks that will aggregate information and what is important there is one important property these boxes need to have they need to this aggregation needs to be order invariant why do we care about order invariance because if I renumber the nodes in some different way I vary order the way I aggregate these guys I permute them I should still get the same result so these aggregations have to be order invariant right so an average is an order invariant function or a maximum is order in function but you know something that takes the second element is not an order invariant function so this aggregation functions as I will show you later the important properties that they are order invariant yes I will come to that so let me not answer yet I realize but I will answer your question definitely all right it's a great it's a great question we spend some time on it all right good so this is what we are trying to do what is super cool here what is super cool and fundamentally different from other neural network learning techniques is that every node has its own neural network architecture now right so the yellow node has the neural network architecture like this you know the red one of there has a different one the green node has a very kind of - out architecture the blue node D has a very same architecture things like that right so every node has a different neural network because the neural network structure captures the local neighborhood captures the structure of the network around this node and the green node has a lot of neighbors so its network is very wide the blue node has very few neighbors so this network is very narrow right but every node has a different network so the way we will think of this now how do we initialize this neural networks and how do we compute over them is that basically at the bottom the the feature vectors we will use as inputs right every node has a feature vector so X sub a is a feature vector of node a so this goes at the input and then this box will take this two feature vectors and aggregate them and pass it on and then here this aggregated feature vector from the two neighbors will be combined with Bea's own feature vector and again passed on and again here this these representations will be aggregated and then passed on right so we will basically think of this as a fixed depth neural network where at the bottom at layer 0 features of the nodes get input and then through this nonlinear transformations some kind of embedding feature representation of a node will come out right notice what is important to notice here is that for example the the input at node C are the features of note C the the but the representation of node C at this level will be some hidden some kind of latent representation of the node and this will be another latent representation of the node so so the input are raw note features and then everything from here on is all is happening in the embedding space okay and what is also to notice is is that if this is two layers deep this means that here in the network we go two hops away right and if node II would have another neighbor here this neighbor wouldn't be part of our neural network because we go one step two steps we wouldn't go three steps right so the deeper we go the deeper we go in the network right and that's the argument why we don't want to go 20 steps deep because essentially then we will just aggregate over the entire network okay so the so now what is the key distinction between different architectures is how are these boxes this neural networks in here defined this aggregation steps that take information from the neighbors somehow summarize it aggregated and pass it on so this is where these different architectures will differ so the question is what is in this box what are in these boxes what kind of operators do we need right and the basic approach what people tend to do is to use the average as the aggregator right where you simply average the feature vectors coming from the neighbors or here you average the representations coming from the children average is the reason why people do average is because it nicely keeps things kind of at the same order of magnitude it turns out that theoretically most powerful is actually doing a sum so doing a summation is the best thing you can do and if there is time I can give a lecture on this theoretical reasons why but this is the this is essentially the idea here and again notice that summation maximum and average are permutation invariant you can take a summation in whatever order you want you will always get the same number and this is important because we don't want this result here to depend in what order we process these nodes because the order is arbitrary anyway what is important is that note C has these four neighbors and in what order we enumerate them we don't care there's nothing magical about the order so we want order invariant right so the point is that this box care will average in the simplest way will average the messages coming from the from the neighbors and then once we have average messages we will then apply the neural network so some kind of non-linearity so here are now the here is now the description of how this will work right so let's think of the following I for every node I will have a representation and the superscript tells me at what level of the network and my operating gut right so at the level zero of the network representation of node V is simply its feature vector right yeah then this will be now at the inner layers of the network I will have a representation for every layer for a given node at every layer of the network and then the representation at the final layer of the network is my embedding of the node okay happy yeah now explain what's there but the point is this is the zeroth layer representation which is basically equals to the features of the node this is the last player representation of the node the case layer and this is the embedding after K layers of the neural network now we need to explain what is happening here Sigma is a non-linearity usually rectified linear unit that's the thing that people like to use the most now what is happening here the way to think of this is the following what are we doing here is to say to compute the representation for node V let's go over the neighbors of neighbors you of node V and then let's take the representation of that neighbor from the previous level let's average those together right we divide by 1 over n and this is the aggregation right so essentially what I'm saying here I'm node we I check who are my neighbors if we are if I'm right now at level K let me look at who are my neighbors at level K minus 1 let me take their messages and average them together let me transform them a bit so this will be a parameter matrix and then I say oh and let's add this let's combine these messages from the neighbors with the message from myself from the previous level right this is node we this is a node we okay so this is what's happening here right this is now the representation of node V from the previous layer right and this is what is essentially happening so to to understand we have through two transformation matrices we have some non-linearity so that this is now a neural network we aggregate information from our neighbors we combined it with our own information and pass it on right so if this is at level K all this is happening at level K minus 1 right and at level 0 it is just a feature vector and at the final level that's my final embedding so if anyone has a question nice like a perfect time to ask me something as this thing seen sounds fine up here any question yes yes you are asking whether deeper networks are more powerful what I'm trying to say is if you make this deeper yes it will be more powerful but you can only make it so deep because you can go so deep in the graph so this is the this notion of depth here is different than what you are doing in computer vision where you are just adding these hidden layers here you are capturing much more information because you go farther out in the network because you are aggregating from neighbors neighbors of neighbors neighbors of neighbors of neighbors and so on so here you are aggregating more information as you go further out right so you are not transforming that information more but you are aggregating more information so this is this is very different because you are learning from your neighbors you are not learning from one data point anymore you are also learning from the representations of your neighbors so you're learning how to aggregate information from your neighbors right so now yes you are of course the way I compute this is that for every node I would compute the embeddings at all layers so that this is always computed right so what this if you unroll this this is essentially if I go back to here this is essentially this type of tree right these are H zeros this will be H ones this will be H 2 and for every node I will have something like this so the same H one here is in here in some other neural network right if I go back here right like so this is a 1 right but then this is also a 1 so this message and this message will be the same right these parts are actually the same in both cases right so there is some redundancy right and then you know this is now H a 2 right and so on right so you so all the I need to compute embeddings for all the nodes at all levels oh good anything else yes you can you can stack them but it will work the same way as I explained it will just be deeper right so that's possible so now the question is how do we train this right like how do I get this final embedding and how would I train this and this was the question you are asking before right we need to now define the laws on how to train these embeddings and what I want to be able to say what are we trying to learn we are really trying to learn these two parameter matrices W and V right because sorry W and B and these parameter matrices are here to do the transformation and what is para the trade-off between these two parameter matrices will be if for example W is all zeros then I'm not learning from my neighbors I am only taking my own features and transforming them right so this would be now just like a multi-layer transformation of my own features it's just a deep model on my own features if I make this thing very very high value then what this is now it says kind of ignore your own features just borrow the data from your neighbors right so this will allow kind of to learn what is the optimal way to enrich the information the node knows about itself how much you know nonlinear transformation you wanna do over the know over the nodes its own feature vector versus nonlinear transformations from you from the neighbors feature vectors and this will become very important in many applications where you can say I wanna predict something about node a but if I know who its neighbors are and if I know its features then I can borrow that information to be more accurate about predicting that for node a you can think about predicting some some someone's property in the social network if I know who their friends are and I know the properties of those friends I can those properties come here I can learn how to combine friends properties with the person's own properties to make the prediction about the person right so I'm learning how to share or aggregate information in the network yes there are two things happening here one is that we are kind of trading off between W and B but both of these are transformations and they are different so you wanna be able to learn that that for example certain components of features from the neighbors are important and certain components of the features or certain subsets of the features from yourself are important and the approach you suggested would not allow to do that so yes you could try it would be simpler but it wouldn't have enough expressive power right so this is the this is the idea so now right like I can finish embeddings into any loss function right anything here and run stochastic gradient to train these weight parameters notice that I have one weight parameter per layer of the network so if I have a layer Network I will have three 3w matrices and three B matrices so in total I have six matrices to trade of course I can train this in a supervised or I can train in an unsupervised way so I could train this in an same unsupervised way as we did it last week on Tuesday when we talked about oh I can use the random walks or I can use any other type of similarity metric in the between the nodes and now I can say how do i best set this parameter matrices such that this particular similarities are are obeyed so this would be basically saying let's try to train things in an unsupervised way what we can also do is we can try to train things in a task specific supervised way right like trying to predict type or color of the node where I can say if I want to make a prediction about no day I will now based on this embedding make a prediction and then I can do back propagation into this network to learn the parameters W W N and B and what I can also do if I want to do for example link prediction then I would take pairs of nodes and I would have pairs of neural networks here that are then connected to the final output node so I can play with these architectures in many different ways right so for example if I would have a drug interaction network then here the label could be is a is a drug safe or toxic right so I'm trying to make a prediction about whether this drug is safe or toxic right so the way this would look like let's say in this node classification way I'm trying now to write out the entire loss function that we try to optimize right and basically what do we say we say let's go over all the nodes of the network and then this is called the cross entropy loss is just a classification loss well basically what I'm saying is this is my embedding for the node V of interest this is some parameter may parameter parameter classification weights think of this simply as some vector and I'm doing the dot product between the embedding and the tractor I'm sending that through non-linearity so this is simply a logistic regression so this is now the probability that V belongs to the positive class and one minus that is the probability that my thing belongs to the negative class I'm taking the the logarithm so this is now LOC probability and then I either mult and I multiply that with the ground truth label if I'm doing binary classification I can think of Y Twitter take value of 0 and 1 and what is then interesting if y equals 1 then this term survives and this theorem gets canceled out because this will be 0 and the other way around if Y is of type 0 then this will be 0 and this term will survive so essentially what this will try to do is it will try to maximize the probability of predicting the correct class it's the same type of loss that we were using when we were talking about likelihood of a given graph right we said that positive for the positive nodes I want this to be big and for negative nodes I want this to be big ok and this is now our loss function where we can take the derivative of this both with respect to these parameters as well as the parameters of this neural network and then try to estimate wmb right so let me now kind of give an give a given overview right so given a graph we wanna define the embedding of a given node a so what do we do we first create a computation graph for that given node based on the network neighborhood structure of our given node and then given that we now need to define a loss function on the embeddings right some objective function over this embeddings so that we can then optimize optimize the parameters try to minimize that loss function right and what is interesting is that we can train our model on a subset of graphs and we can apply this model to graphs that have haven't yet see what do I mean by this is that for example we take three notes for the three notes we create the computation graph and this could be our training set so that maybe later on some new notes arrive maybe these are the notes D E and F we can still learn the model on these sub graphs and then apply them to those sub graphs and the reason why we will be able to do this is because we are assuming we will be doing parameter sharing what do I mean by that is we are assuming that this processing boxes are the same everywhere right that W and B are shared across all these different architectures so when a new Act or Kotak sure comes I can borrow W and B from the from these graphs transfer it here do the forward pass and I have a prediction done right so the idea is that we can we can apply our model even to graphs or parts of the graph that we have never trained on or we have never seen before right the way we are why can we do this is because we assume if we have different computation graphs that the parameters governing these aggregations and transformations this wmb parameters are shared across different architectures right and this is why we are doing it this way so this means the same aggregation parameters are shared across the nodes this is cool because the total number of parameters is independent of independent of the network size meaning the size of this graph it's constant and what is also cool is that now we know we can generalize to unseen nodes or to the unseen parts of the network right so because what we talked about on on Tuesday was note to back and deep work those methods cannot generalize to new nodes those new methods cannot generalize across graphs our methods can do that right we can train our B and W on some training graph so that when the new graph arrives we can apply that method to those small parameters to the new graph which is super useful when you are working let's say with molecules or things like that or for example in social networks you could imagine you are given some snapshot of the training graph then some new node arrives you've never seen it before but to create an embedding for the node all you have to do is create the computation graph and you can make the prediction so we can transfer the parameters to unseen parts of the graph another reason why this is useful is because you can have a super-huge graph that's too big for you to train on so you can only train on sub parts of it but then you can apply the model to all other parts of the graph as well and that works really well right so these are some important benefits of graph neural networks as we as we define them so far so to recap what did we learn we talked about how to generate node embeddings by aggregating neighborhood information from the network we just saw the basic idea of the basic variant of this method where we are simply just averaging the messages coming from the neighbors and the key distinction between different methods is how do we how these different approaches will aggregate information across the layers and so far the aggregation was a simple average and then a simple weighted combination with of my own feature information plus the feature information the average feature information coming from the neighbors that's the basic and now I'm going to explain the next iteration of this idea that is called graph search before I do that I'd like to see if there are any questions yes go ahead for example it just makes I see that you generalize like a graphic take a snapshot for a few years earlier because the structural properties a bit different so what case is that okay great so the question is I made this claim here that you can generalize to new graphs of course you can you can generalize in a sense that if you're kind of if the if the statistical properties of the data or if you're kind of disturbed the nodes are sampled from the same distribution then you can generalize across that distribution of course if if the data generation process has shifted dramatically then that generalization ability won't be there right so I'm not saying that you can train your model on a social network and then apply it to predict whether a drug is toxic or not that kind of won't work right but if you think about saying I build a recommender system on Facebook today and then tomorrow a new user joins you can still apply your model to this new user right because again the Facebook did not change too much from one day to another day right so the idea is more that even is is not so much about the claim the claim is not so much about generalization the claim is even that you can apply it right the methods we talked about on Tuesday would not you would not be able to apply right and you and you know to derive to the to Facebook network you would have to recompute the entire embedding of the Facebook graph here you don't need to recompute anything you just create a computation graph and here is your embedding of that new node that you haven't that you have never seen before and that's what is elegant here yes it's five layers because we exported by five steps right thank you you copy like this way so essentially like a number that's it enough of a linearity that essentially quickly to meet by how many layers yes you can make some good question how did like again the question is what do we mean by that right like I'm kind of like in in yes in in some sense the more decadently nonlinear you mean you want your model to be you can make it as deep as you like but I think what we have here is two notions of that one notion of depth is how nonlinear how much nonlinear transformations if let's see if I go back right you can ask how deep in the network do I go how much information do I need to aggregate and that's a different notion of depth than to say how deep is this neural network and if you think about if you think about traditional neural networks they are one layer deep right because it's one input and then a lot of magic happens and it's one output here if you think about it you are also increasing the number of inputs you get so this type of depth is very different than a resonant depth or a vgg depth or whatever whatever is your favorite model right those types of depth you can you can add those decadent sieve here as much as you like right but here the fundamental difference is that you are kind of increasing the amount of knowledge you bring in into the model right so you could make here this this part more more complex more nonlinear several layers deep if you like but the depth we are talking about here is different that says how far in the social network do I want to go and how much information do I want to borrow from neighbors or neighbors of neighbors right and if you think about social networks being and I know making a prediction about myself you don't need to aggregate data from somebody in Africa to make a prediction about myself you probably need to aggregate data from everyone here maybe your friends as well but somebody I don't know seven hops away in an Amazonian forest it's not how much their properties would be predictive of my properties right while your properties here might be very predictive of I know something about myself right so you don't need to go too far here of course you can make this part more more expressive deeper as you call them but these are two notions of depth all right yes if you have many notes coming you don't need to refresh the parameters again I think what we are now saying are two different things one thing is that essentially you can imagine that every note defines a computation graph so I can if I pick a random subsets of the nodes and train on them I am able to generalize to this set of notes that I haven't picked now if you think about the network evolving these guys that come new that come later might be kind of fundamentally different than than than the guys you trained on then of course you will have trouble generalizing to that but if you are assuming that the network let's say is a static object but you won't only train on the random subsets of the nodes then you'll be able to generalize to whatever else you did not train on and this is useful because you can have a graph on bazillion nodes and you can train on a subset of it and then apply your model to the rest of the network that becomes very important all right good let me now move on and talk about graph convolutional neural networks in particular graphs age okay so this is what we talked about so far we said for a node let's create the neural network structure around it and then we will have this little neural networks in here that will learn that will aggregate and transform information so far we did something very simple where we said what does the node do the node takes the average is average message from its neighbors and adds it to its own message and passes it on that's what we said now let's try to generalize generalize this and make it more general in general so the way we will do this now is this aggregation function will look the following we will say let's take our own message from the previous level and use this matrix B to transform it then we will say let's take the messages from our neighbors here are the neighbors from the previous level and let's use some general aggregation function to aggregate this and then what we are doing is to say let's let's now also transform this aggregated message let's concatenate these two messages and send them both through the non-linearity so what we are doing here is we generalize this by using a general aggregation function I'll give you examples of that and what we also change this rather than having a plus here and kind of confusing or kind of mixing together information from the neighbors and our own information now we keep it separate before we push it through a non-linearity okay that's the that's the difference so to show you again before this was the type of aggregation we were using now our aggregation is different we have that B it is down here so that part is the same but there is no plus here there is a comma so we will kind of concatenate this part with the with the aggregation part so this aggregation part is generalization of this summation here we have the matrix W and then we have the non-linearity so again two big differences rather than summing two things together and kind of losing track of them we keep them separate by concatenating them and rather summing them we use this general general aggregation function what is interesting actually is that you can theoretically analyze what are properties of different aggregation functions and what kind of functions you should use for different cases the point is that what we did before we were simply using the mean aggregation function we simply took the messages from the neighbors and added them up and then normalize that by the number of neighbors we can also do a pooling type approach for example we could do a element-wise min or max pooling so this would be another option where here we are again taking messages from our neighbors transforming them and then we are applying some kind of pooling strategy where basically we would do the the maximum across the coordinate or we would take the mean across the coordinate and this is different than what we have up the up there and then if you like you could even use a deep a neural network like an LS TM to be able to learn how to aggregate the neighbors and of course lsdm for those who know is not order invariant so when you train you could try to learn do this over several random orderings so to make sure that this thing won't try to be or that we the distinct we learn that the order is not important right but now now we made our model much deeper as you guys were asking before because now at the aggregation step I actually have a proper neural network that that aggregates and in some sense the depth of this network if you unroll it is equal to the number of nodes in the number of neighbors and here PI means some random permutation of that of those neighbors of a given node right so this is how how we now have generalized this to make it more complex we are able to use different types of pooling strategies and this pooling strategy can even be a learned pooling strategy yes and since usually you want to fine I'll see I'm under some sort of time portal information you want to learn it seems like in every other aggregation strategy where we don't really care about the order of the neighbor level exactly we don't want we don't want it to to learn the order but there is a very nice work that shows that you can make a HTM B order invariant if you kind of show if you teach it over random permutations or random orderings what's the point the point is that there are there is parameterization there are parameters in here that are still useful to learn right so of course you can come up and suggest your own aggregator that's come let's completely great and awesome what people have done in the past they have tried this they have tried that if you think you have a better idea for an aggregator that's an awesome project there is a deadline actually I forgot sorry I forgot there is a deadline midnight tonight so keep it coming these are great project ideas and feel free to steal them from each other if you like that's fine yes the nodes are the same level right we use week you can you could even combine different aggregators at different levels that's interesting I don't I don't know if people have tried that but generally you would use the same aggregators the same parameters whatever B and W at a for a for a for the same layer you would allow them to be different across layers but generally you would share this across all the layers what you could also do that for example nobody has tried is you don't need to share the parameters across all the nodes you could somehow say oh for these nodes I will have one set of parameters and for this other set of nodes I'll have other set of parameters if you would somehow decide that ahead of time I don't think anyone has tried this so this is all brand-new so the the first paper on this was published like two years ago 2017 right so the the the gcn was 2017 this grass sage was 2018 so these are this is like brand-new hot hot of the press right so so this is not there's a lot that is unexplored here and there is a lot that is unanswered here but this is kind of one of the hottest areas of machine learning right now if you look at the keyboard distribution at the latest I CLR these types of things were as big as entire NLP right so the the graph neural network keyword was as big as an LP keyboard at ICL our deadline was I think a few weeks ago right so these things are very hot because they allow us to model structure in the data in very unique ways and there is a lot of questions that are unanswered so I'm really kind of showing you the the earliest works in this area and there is tremendous kind of progress being made from week to week okay so let me conclude so why did I talk about I talked about this notion of graph convolutional neural networks where the basic variant simply averages the information from the neighbors and then adds it with its own information we kind of stack this neural networks on top of each other and then we talked about graphs age that what it does it it allows us to basically generalize this notion of aggregation with different aggregation functions and with these types of concatenation which keeps the information from the node itself separate from the information coming coming from the neighbors and the way you would efficiently implement implement these methods is to is to realize that you can kind of write this very nicely as sparse matrix multiplications right so everything that I've written in this way how you are aggregating from the neighbors you can kind of write out as a matrix multiplication where a is the adjacency matrix D is a degree matrix same as inspector clustering and these are the message matrix matrix from the previous from the previous layers so essentially all you have to implement is this type of iteration right so this is extremely it's just I'm a product of two matrices and you get the embeddings or fold the note right this is a diagonal matrix right so this this can be made very easy to implement and very easy to scale up right and as I said you know this is now 2 years old right so the the gcn paper right and as I said there is a lot a lot of follow-up work and this is all from last year and this year but essentially there are nice tutorials and overviews we will talk in the last 15-20 minutes about attention based methods that's kind of the last thing I wanna cover there is also nice work about embedding entire graphs more interesting work about embedding nodes methods about that are different from this local network expansion because these types of graph neural network approaches even if you make them infinitely deep won't be able to capture the structure of certain graphs that are kind of heavy interesting symmetries so this is super cool there is very nice connections with spectral clustering so there kind of spectral approaches to graph neural networks there is extensions to this the don't embed into the Euclidean space but embed in hyperbolic space to allow you to model hierarchies that is super cool there is interesting work on pre-training explanations for this models and more and more and more so there is really a lot more I did not talk are there any questions yes how do you incorporate the weights of the edges the way you can incorporate the weights of the edges is to it one way you could is in menu if you have any weight on the edge you could in your aggregation function you could use that information right so the most trivial way to do this would be to do a weighted average rather than just take the average another way how you can incorporate weights of the edges is to come up with this notion of attention alright great so let me tell you about attention great question thank you for asking it good so let's recap right so for graphical neural networks for GCN this was the this was the formula we talked about right you just take the average of the messages from your neighbors and transform it and then you combine it with your own message transformed as well so now the question is what how would you learn that certain nodes may give you more information than certain other nodes right maybe strongest friends give more info reveal more information about yourself than some weak friends or some acquaintances so the idea will be that I don't wanna just aggregate information across all the neighbors but I wanna have some weight let's call it alpha that will tell me how important is a given edge right so for example right now we just said all the edges are of equal importance if you think from the average point of view and what we wanna do is you want to explicitly define this edge weight based on the structure properties of the graph right so that right now all the neighborhood neighbors of the of the Kiba node we are equally important but we don't want all the neighbors to be equally important so the question is can we do better than this assigning equal importance to every neighbor and aggregating can we let the the weighting factor this weight alpha to be somehow implicitly defined or can we can we learn it right so the goal will be to to set in principle arbitrarily importances two different neighbors of each node in the graph right and how are we going to do this is that we will compute the embedding of a given node at a given layer and then we will apply the following what is called a tension strategy right where we will attend two different nodes or will pay more attention to different nodes in the network where nodes attend over their neighborhoods message and implicitly spades specifying different weights to different nodes in the neighborhood so we need to figure out how are we going to define this weight okay so let's do let's do it the following right let alpha you will be computed as a byproduct of the attention mechanism a like right and we'll say that let a compute attention coefficients EUV across pairs of nodes based on their messages right so the idea is that the edge weight of EUV so of an edge we will simply be some combination of the message the content of the message at knowdo and content of the message at node V and we'll have some transformation matrix and some function a that will combine these two together right and the idea is that this e we will indicate the importance of node use message to node V right and then what we will do is to set the edge weights we will normalize the coefficient using solvent socks softmax function in order to be kind of comparable across neighborhoods of different sizes right so the edge weight will now be the exponent of the of the of the weight computed from by this attention mechanism and then normalize and as I explained last time if you take an exponent of something and normalize it by the sum of the exponentiated values then essentially the value that will take the highest weight is the one that has the maximum here it will basically take all the weight and the rest will be very close to zero so that's kind of the intuition this is why this is called the soft next right but now I have a weight so now in my aggregation what is different is again I'm aggregating over the neighbors but now I'm also having the weight here this alpha UV that comes from here where UV comes from there okay and now the question that we need to decide next is what is the forum of this attention mechanism how do we how would we go and learn it so here is the here is the idea for the attention mechanism there can be kind of the approach to choosing this attention mechanism function a so this function up here it's kind of we are agnostic about it right so in some sense you could use a single simple single layer neural network this function a can have parameters which we would need to estimate and the point is that these parameters of a are trained jointly with the neural network the aggregation neural network itself right so we learn the parameters of this attention mechanism a together with the weight matrices wmb that we talked about in an end-to-end fashion and for example the paper from AI clear last year by village college he showed that you can actually even do like a multi attention or a multi head attention where the idea is that for every edge you will learn are different weights and then attend over all these are weights at the same time so essentially what this means is that to stabilize the learning process of the attention mechanism attend attention operations in a given layer are independently replicated are times so in some sense we are learning capital are different attention weights for each for each edge and then the outputs are aggregated or concatenated across these are different functions so what are some key benefits of this and why would you wanna use this kind of weights on the edges and learn them so this allows us to basically in learning or specifying different importance values of different neighbors and this importance values can be learned based on the future representations or the messages that these neighbors are sending to you this is computationally efficient because this computation of this attention mechanism can be can be parallelized and aggregation may also be paralyzed across all the nodes of the graph it is also storage efficient because all we need to store is the information about the nodes plus the information about the edges so this is still linear in the size of the graph and it it has a fixed number of parameters regardless of the graph size because that attention mechanism is just a function that we are flying across all all of the graph and again it's travailing localized in a sense that it attends only over the local graph neighborhoods and it's still inductive meaning we can still apply it to nodes we have never seen we can apply it to the graphs we have never seen so to give you an example here is a graph of a of a citation Network where every node is a paper edges are paper citing each other different papers belong to different subject areas different areas of science and the goal is to predict what is the subject area of a given paper here indicated by color for example the way this works if you only use the text of the paper you your accuracy would be 55% right so if I just say I take the text of the paper and I try to predict the discipline of the paper I would do 55 if I for example would use the methods that are graph based like we talked about deep work in the last lecture deep walks get 67 so far better than 55 right but if you for example look at the at the graph convolutional neural networks what we talked about today here that raises the performance to 81 percent right so we went from 55 to 60 67 to 81 and now if we add this graph attention mechanism that I talked about that further increases here for 2% to 83 right so you see how how being able to combine the textual information so the way to think of this right so this is the text information at 55 deep walk only looks at the graph information 67 if you combine both of them together using the graph neural network you get 81 if you add the the attention it further improves the performance right so this is one example of how learning attention can really help are there any questions yes so let me understand so what you are saying this rather than have attention per edge you are now learning an attention per per edge comma coordinate right so that's what I guess your vector thing would do you would say for every part of the message I learn a separate attention weight right because you would do this kind of vector not a dot product by like product by coordinate by coordinate right should be scaling each dimension of the message independently that's what you would be doing and right now we are scaling all the components of the message equally that's that's an interesting thought the question becomes how unstable the learning processes and whether you could make the learning process be stable but you are right that your idea would be strictly more powerful all right anything else yes go ahead pervading networks good question so if you would somehow have a weighted Network then I think what I would do is I would use two one option would be to simply say I'll take the weighted average so you use the weight as your attention that's the simplest clearest the first thing you would try the second thing you could try is to extend your attention mechanism a to have to be dependent both of them on the messages as well as the weight of the edge that would be another option and this mean this means that you'd be able to learn kind of to trade-off between the content of the message and the weight of the edge that's also I think a good idea so these are kind of two ideas that come to mind right now all right good let's see what do I have in the last five minutes I can show you an interesting industrial level application of this that has had a huge impact okay so this is the this is the application I like to show this so this is Pinterest Pinterest is people take these images and they save them in two different types of collections and the point is that the same image maybe about you know some kitchen with a with a blue fireplace can get saved into many different collections right somebody would maybe some architects creates a collection called blue accents and they save images that have some blue accent on them right somebody else might be interested in renovating kitchens so they are they are saying okay here are some kitchen ideas I'll arc and somebody else is like interested in fireplaces so they have some fireplaces saved into their collection and the point is that you now can have a huge graph with the way you can think of this you have these images and they are saved to collections and in this graph this is now a bipartite graph you have a about let's say four or five billion of these images you have another four billion of these connections and you have about 20 billion sorry 200 billion edges here right where the same thing can be saved tens of thousands of times into different collections and what you can what you can do is you can now interpret this as a graph to be able to do to do bear the recommendations or understanding of the content in the graph so the idea will be the following if you just look at the image of each of the just at the image of each of these objects then computer vision will many times make silly mistakes like this where you are confusing the bedrail and a garden fence or you for example you confuse brown rice and soil or you confuse the the tape the carpet on the floor and this kind of decorative carpets that go on the wall that are more like art right so these types of things computer vision makes a lot of mistakes so the question is can you use this bipartite graph to learn how to borrow information from nearby nodes in the graph so that you can really dis resolve what is similar and what is different right because you know a bed like this that might confuse you for a garden fence will be will be will appear in the graph together with other beds and you can use the information the image information from other beds to really say this is a bed and you know a garden fence like this one will appear in the graph together with other garden fences that may not look like beds and you can use that information to learn to distinguish these things together right and if you have these types of embeddings of nodes which are now images you can then do various kinds of tasks so let me show you the idea the idea is that we have this bipartite graph and we will take right where we have images belonging to collections we will take these images and we will embed them into the embedding space such that the similarity in the embedding space will correspond to some kind of user engagement signal and and of course the way we will create the neural network is that we will kind of unfold this graph a couple of times right so it's clear how this guy will collect the information from here this thing will collect the information from that pin and so on and so forth right so we will unroll it and the way we will learn the embedding is to say we will see what users like to click together so if I use he's looking at the cake then a user clicks on uh on a related cake but the user does not click on a very nice comfy sweater right and the way we will try to and create the objective function is to say that the embedding of the cake one and cake two should be closer together than the embedding of the cake one and the sweater right and that is essentially what we will be doing one thing I did not make super explicit but I hope everyone sees is how you can unroll this bipartite graph sever in several steps to create a neural network as I was showing you before right and essentially what this means is to create the embedding for the know for a given image you will borrow from other images that are next to it in the graph right and you will kind of learn how to best use your own information with the information coming from your neighbors and if you do this I'll scribble scribble skip this is how well this works for example if you evaluate if you just use the visual information let's say you do 0.23 this is mini reciprocal rank doesn't matter if you if you use if you use tags based only you do worse than the image but not but not to much worse if you combine the two together you just concatenate them into 0.37 if you use the the graph based method that combines visual and textual data you get 0.59 that's almost like 50% better then then then without doing it and to show you let me just show you two examples here is one example where you can say if I have this query what our nearest neighbors in the embedding space if you do based on visual only if you do it using this approach and you can see how this porcelain doll gives you all kinds of porcelain dolls and here you can see some mistakes they're kind of similar in style but they are not porcelain dolls or I give you another example about cats if this is your query notice that you know it's not clear why these things don't work so well but notice how here your I think all kinds of cats notice how these things are kind of not necessarily visual visually similar but it's all cats with this kind of funny hats so why did I try to tell you here the point is that using this graph information we can learn how to enrich information at a given node using the information coming from the nearby nodes and this greatly improves the embeddings we can create so with this thank you very much so I'll see you next week when we'll do a hands-on session and teach you how you can code some of these methods thanks a lot

Info

Channel: Machine Learning TV

Views: 45,461

Rating: undefined out of 5

Keywords: Graph, Node Embedding, Machine Learning, Data science, Word2vec, Deepwalk, NBNE, random walk, GraphSage, Dimensionality reduction

Id: 7JELX6DiUxQ

Channel Id: undefined

Length: 85min 50sec (5150 seconds)

Published: Fri Nov 01 2019