Graph Representation Learning (Stanford university)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay great hello guys welcome to the class so today we will do an interesting lecture about a very young field of graph representation learning and it is all about how do we learn automatically representations of nodes so that we can then use those representations for various kinds of tasks okay and this kind of very much motivated by the by the recent kind of developments in deep learning and I will connect to that area as well and you will see how this is kind of different harder and anima in many senses much more interesting ok so what do we want to do if you think about doing some kind of predictive task over a network you know one example would be what is called node classification where you have a graph you have graphs M I labeled some of the nodes are red some are green and the question is what are the colors of how should we assign colors of the gray nodes right so given this graph I want to go and infer the colors of the of the the missing colors of the nodes this is what is called node classification because I'm classifying each of the nodes in this case into two classes green or red okay that's one first example of the prediction task a second example of prediction task is what we brought we introduced in the last lecture when with baharon when she was talking about link prediction where basically the idea is I give you a network where most of the links are observed but some are some are missing can you predict which means are missing in this network and whenever we are trying to do this these types of these types of tasks there there is always kind of this problem that we need to do feature engineering we need to take the the network and somehow represent the nodes in a network with with a set of features and we can do this feature engineering by for example we talked about motifs we talked about graph let's these were all examples about how could you describe the node and the network network neighborhood around the node the set of features right so basically we always have this idea that we need to take our raw network our raw data we need to create a set of structured data where we described every node with a set of features so that we can then apply some learning algorithm and we can build some model that can do some prediction for ask some tasks right and what is kind of the most cumbersome part is this part where we say how do we get from a raw network data to this kind of feature description for every node right how do we do this feature engineering should we use motifs should we use graph let's should we use something else right and once we have these features we can then apply them to some downstream tasks of clustering analysis outlier detection link prediction node classification whatever you like and what we'd like to go away is we'd like to skip this step of feature engineering and we'd somehow like to automatically learn those features and the way kind of we wanna do this or think about this is that we wanna basically go and do this kind of task independent feature engineering in networks and what we will try to do is we try to take a network and map every node in in the network into some vector of numbers okay and we will call this vector of numbers we'll call it an embedding we'll call it feature representation we'll write this Ruby some kind of latent vector describing a node in the graph and our goal will be to figure out what should be the coordinates of every node so that somehow something about this network is preserved and when we say graph representation learning or graph embeddings the idea is that I want to take every node and embed it in some D dimensional space where D is some parameter think of it an order of a hundred an order of a thousand okay that's what we are trying to do what so what do we want to do we want to map each node of the network into a low dimensional space and this is very useful because if we can map nodes into low dimensional space then we know how to reason about data points we can measure the distances we can think about how they feel this embedding space do any clusters emerge in there and so on right and our main goal here is that the the similarity in the embedding space should somehow reflect the similarity in the network whatever that will mean right and this means that this will allow us to kind of encode network information and generate the representation for an individual node right so the idea will be that somehow we take our network we create the adjacency matrix out of this we can come up with an embedding for every node and once we have every node described by its by the embedding by the set of coordinates you know we can do anomaly detection we can do clustering we can do link prediction node classification whatever we would like to do at the end okay to give you a concrete example here is a demonstration how this how this would look like if I take a network in this case the Zachary karate club network which we talked about before as an input and I wanna embed the nodes in two-dimensional space right so in this case D equals 2 so here is now the embedding of the nodes of this network in two-dimensional space so every node has two coordinates and I'm plotting the locations of the nodes after this embedding right and what you see is for example that nodes that are somehow in the similar part of the network they also get embedded close together right so these guys here get embedded up there the red nodes here get embedded there and then you know the violet ones are here and then this you know the green ones are somewhere here these are twenty six and twenty five these are these two nodes for example right so you see kind of how you know we took this network and now we were able to embed these nodes into this two-dimensional space and somehow spatial positions of the nodes correspond to their kind of their network positions and their network relationships okay so this is kind of what this allows us to do of course we won't be embedding in two dimensions but we'll be embedding in hundreds or thousands of dimensions okay now what I wanna do next is kind of explain how this relates to this notion of embedding and feature learning that is coming coming like coming out from deep learning deep neural networks and here is kind of why this is an interesting problem and very different than then what we can do in computer vision or speech or text right so the kind of the modern deep learning toolbox is kind of designed for simple data types it's designed for simple sequences or grids right why do I mean you can take an image represent it as a matrix as as a grid graph and then you can kind of pass through this this image through a super super deep neural network to get some output you desire right and if you think of text text you can represent as a chain graph right text is a sequence right so you can now start reasoning about text at this level of a sequence and again in some sense you have fixed size input because the window through which you are kind of going through the sequence is limited and with images you always resize all the images to have the same size so again the input is very simple it's just a fixed size matrix right so it's a you know it's a grid graph if you want to think of it that way right the problem becomes that real graphs are much more complex they are not this nice kind of fixed size grids that that you are inputting right so graphs have much kind of more complex topological structure and they don't have this kind of spatial or grid type locality because what do you do in computer vision if you think of this as a vision you have this kind of filter that you are sliding through across the across the image right and it's not clear how would you design these types of sliding filters over graphs because graphs are kind of much more complex objects the other thing that is important for example to distinguish between graphs and let's images is that in in graphs there is no fixed node or drink right I can have adjacency matrix I permute it's rows and columns that's still the same adjacency matrix even though the matrix or the same graph even though the the matrix will look totally different right in the image if I permute rows & columns if I scramble them that's not the same image anymore it's a very different image it's not a cat anymore it's something else right in I can really renumber the nodes and you know the matrix would look very different but topologically stay is still the same graph and the other thing that's interesting is that these graphs can have can be dynamic can have features or values or nodes on edges and things become very interesting okay so really these types of graph graph and the representation learning on graph is a is very challenging in some sense more challenging than representation learning for text or images because those are much simpler input data types okay so now let me tell you how we are going to reason about embedding of graph right so we will be given a graph which will have a set of vertices a set of nodes let's call it 3 and we'll have a the adjacency matrix and for now let's assume the adjacency matrix is binary and we can even assume that it's symmetric meaning the graph is undirected right and for now let's assume no no note features no extra information on on on the nodes anything like that we are just given the wiring diagram ok so a graph and what you would like to do is our goal is to somehow to embed to take the nodes of the graph and embed these nodes in our embedding space where every node has a set of s coordinates we will call the embedding of the node Z and we wanna some sense learn this function we will call the encoder function that takes the node and encodes its coordinates it determines it called coordinates and our goal will be that we take notes from the network and their relationship in the network is somehow encoded in their spatial position in the embedding space ok that will be that will be our our goal so if I now take the same picture and try to make things a bit more precise I can say that the similarity of a pair of node U and V in the original network should be similar to the similarity of the coordinates in the embedding space so then the question is how do I measure similarity one way to measure similarity would be to measure the Euclidean distance between the two points what we will be doing and what people like to a lot specially in high dimensions is to use what is called cosine similarity and cosine similarity if all the coordinates are non-negative is basically the dot product right so the dot product between two vector equals the cosine of the angle between the two the two vectors right if the two vectors are orthogonal then the similarity is zero and they basically if they overlap one on top of the other then it's basically the length of the vector right so this is this is this is the special case of cosine similarity which is just just the dot product between the between the coordinates of U and the coordinates of V right so our goal is to say I want to find the embedding so that the similarity in the embedding space relates or is similar to the embed to the similarity in the original Network and what is our task our task is to define what is this similarity mean and how do we then determine the coordinates that optimize this Network similarity right so our goal will be to define this notion of network similarity and then optimize the coordinates so that network similarity is preserved okay so here are kind of three steps to the algorithm right so the first we need to define an encoder which is basically a mapping from nodes to embeddings then we need to define a node similarity function that will say how similar or how related are two nodes in the network and then we need to optimize the parameters of the encoder so that the similarity in the network is as close as possible to the similarity of the embedding okay now here is here is how we are going to think about this right so here I said optimize the parameters of the encoder this may seem as as if that encoder is this magic function that takes the node and produces the the coordinates and for us we will consider very simple case of encoders so let me tell you how our encoder will look like our encoders will be very simple and all they will do is they will just memorize the coordinates of every node right so basically our encoder we'll simply memorize the coordinates right so that the number of parameters of our encoder is the number of nodes times the embedding dimensionality okay so we just kind of directly go and learn coordinates of each node independently right so this means that our encoder function is just a table lookup okay I'll have a picture very soon about that okay so now that that I told you how to how do i encode an out by basically just there directly estimating its coordinates now the question is what is how do I think about the similarity function in the network and the similarity function will specify the relationship between a pair of nodes in the original Network right so in some sense I want to say how similar U and V in the original network and that similarity should be encoded by the similarity of the of the coordinates of the two nodes in the embedding space right so I want similarity to be hopefully equal to the dot product between the positions of the two nodes okay so what we will do is the simplest thing where we basically say that encoder is just an embedding lookup which basically means that we will take we will say the encoding of a particular node V is simply this encoding matrix times the one-hot encoding of the node of the particular node right so this is really what we are trying to estimate we are trying to estimate B coordinates for every node this is what we will try to learn and the vector will here is simply an indicator vector that has all zeros except one one that which indicates what note is that and to give you a picture because that was kind of complicated all I care about is to estimate this matrix where I have one column per node the the number of rows equals to two deep the number of dimensions number of columns equals and number of nodes and each column contains the embedding for one node okay and now my encoder if I say give me encoding give me the coordinates for four node five the encoder goes into the fifth column and says this is the encoding and if I say give me the coordinates for note 10 we go to the tenth column here and output that okay so the number of parameters that we are trying to estimate is this matrix is the coordinates of every node in the network yes what determines the number of dimensions this is a user-specified parameter okay so user specifies that the number of columns the number of nodes is specified by the network sorry it's a hyper parameter right the user decides what is the dimensionality they wanna embed into great question thank you for asking okay so so we are kind of working now towards basically our idea is that each node is assigned a unique embedding vector and now now we need to figure out how are we going to define the similarity and today we will talk about two methods one method it's called deep walk and the other method is called node to work so we will first discuss the deep walk method and then generalize it to the node to work method and then at the end of the lecture I will talk about not how to embed individual nodes but how to embed entire graphs as well okay so that's kind of the plan for the lecture so now the key choice for us is how do you define node similarity right it's clear how we wanna do the embedding and how we want to do the similarity in the embedding space but question is how do we define the network similarity and there is many different ways to do that right so for example early on people said oh two nodes are similar if they are connected and if they are not connected they're dissimilar that's one option right another option be to say the similarity between nodes is the number of shared neighbors that they have right or maybe that they have similar structural roles and and so on and so forth what has emerged as the best way to define similarity is to use random walks okay so we will use the technology of random walks that you are all very well familiar with by now and we will talk about how can you use random walks to generate node embeddings okay here are the links to the papers we will cover one is from 2000 14d a drone is from 2016 so this is very recent work okay so how will this work here is our similarity and the embedding space between node U and node V and we want to to encode the coordinates such that the similarity and the embedding space equal store if c approximates the probability that node view and we co occur on a random walk over the network okay that is basically the idea so to unpack a bit this a bit more right so what we wanna what we will do is we will take some node U and we will simulate random walks starting at node U and then we will ask how often or how does if we start and know you how often do we do we visit node we okay and then we will say aha the probability that we visit node v if we start at know do you know is some value and I want this probability to be encoded by the spatial positions of node U and V right and I want the death probability to be encoded by the by the angle between the two vectors because this is kind of cosine similarity so the the smaller the smaller the angle the bigger the cosine of the angle the bigger the probability right and here notice I am like intentionally sloppy so I just say it has to be proportional to right so basically my goal will be once I know how likely are you and me to occur in random walks together I will say we want to figure out the coordinates of the nodes such that this this probability of Co occurring in the random walk is encoded in the in the similarity or the distance metric between the two vectors which in our case is the is the cosine similarity or the dot product yes the best way exactly when I say best is state-of-the-art results on empirical data sets have been achieved using these methods yes all right and there are other computational benefits so why do you want to do this right exactly your questions the first thing is that now you have this flexible stochastic definition of node similarity that kind of both incorporates local but also kind of higher order neighborhood information right it incorporates a lot of the network through this random walk mechanism and the second thing is you don't need to specify similarities between all pairs of nodes while training you only need to consider pairs that Co occur on random walks so in some sense you can simulate these random walks get your get your training data and then estimate the coordinates yes similar let's be relatively close together on the graph right otherwise the random walk is a little rich from one day to the other so how do you deal with nodes that have basically wait for a coronagraph great question so the the question is do nodes that are very close together in the graph are they always embedded close together and could you have nodes that are in different parts of the graph maybe be embedded close together because they have similar structural roles wait with me for maybe like 17 slides and we'll address exactly that question ok great there was a question here no question anymore all right don't be shy all right so we continue we will address your question ok so these are some of the reasons why these are these are good ideas okay so what's the intuition the intuition is you want to find them banging of notes in the dimensions so that we want to preserve similarity and the idea is that we've all ivana learn node embeddings such that nodes that are nearby in the network are also nearby in the embedding space so now the question is given a node you how do we define what is close to node you and the idea is that we will define this set M of n sub R of U to be the neighborhood of node you define by some strategy our right so I will have some network exploration strategy I will start at node U and i will ask what is close to you and whatever is close to you i will put in this em okay and usually I will have another hyper parameter that will tell me how big of this set I wanna create let's say I want to say I want a set of ten nodes so I will say what is the network neighborhood of node U of size 10 and I want to go and somehow identify this network neighborhood so here is how you can how you can do that the idea is that we wanna run fixed length short random walks right and then just record what are the nodes that the random walk visits okay so basically I say imagine I pick my hyper parameter value to be seven so I will see I will start at no do and I will simulate a random walk of length seven and ask what are the node IDs I visited and I will put this into the same set n sub R of U and I can go in and create multiple instances of this set n because I can simulate multiple random walks all starting at node u imagine I sample I know 200 walks per node so I will create 200 of these sets okay so now this means that I have defined this notion of a neighborhood for node u I have a set of neighbors and these neighbors are determined by this random walk walking for a fixed number of steps where the number of steps is some hyper parameter if you think of five seven ten something like that okay now that I have this neighborhood set I want to define an optimization problem that will that I will use to figure out what should the coordinates be so I will do the following I will say let me optimize the coordinates of the nodes the parameter Z such that given node U I can predict who its neighbors are okay so basically I want to say I want I am at I'm at node U and I want to predict who my neighbors are based on their coordinates and based on my coordinates so here is the optimization problem right I'm saying I will want to maximize this dysfunction l that goes over all the starting nodes U it goes then over all the neighbors of that you and that says oh I wanna maksim maximize here I have the log probability that says how likely given that I started at you how likely am I to start to visit node we okay here I already have coordinates off of you and I will also use the coordinates of we to estimate this probability so the question is now how do i encode this probability yes oh just it doesn't really matter I can call it a maximization or a minimization problem and why do I have a lot yeah because it's a product of probabilities okay right it's a product of probabilities but if I take the log then the product becomes the sum and as I'm sure they have taught you in machine learning courses you should never work with draw probabilities because multiplying them kind of screws you totally numerically so you wanna work with logs and then the world is nice to you yes yes to be even more precise what I would want to do here is I would want to have another sum that would index these sets as a number of friend of works that I started over the node so there should be multiple instances of the same sets top of the of the random walk starting at node you okay so this is not a union of all the random walks but for each random walk starting that you I have a separate set so in some sense I'm missing another I here that would be a sum over I random walks that I start at you and each one gives me a different set walks you can think of it that way all right great question super so now I need to go in and keep unpacking what this really means okay so how am I going to unpack this right so my my optimization problem is given that I started at now do I wanna predict what neighbor V is in its neighborhood right and the question is how do i how do i how do i model parameterize this this probability write this probability basically I'm trying to what I'm trying to do is I'm trying to maximize the likelihood of random walk right I say oh not you and we co-occur in a random walk right node node V is a neighbor of no do it node we gets visited from an N random walk from no do so the question is how do i parameterize this probability and here is how I will parameterize it right what I will do is the following thing this function is called softmax and I will explain why it's a good idea but basically what do I do is I take the coordinates of node U and the coordinates of node V and I do the dot product and i exponentiate that and then i divide or normalized by the sum of dot products of node u with any other node in the network so that this is between 0 and 1 and the reason why this is called softmax is that basically what i want to ensure is that this will be greater than in some sense this will be greater than 1 if node V V is the most similar to node u right so in some sense this summation down here will be approximately equal to the max over all the nodes right so this will be close to 1 if node V is the most similar to node u ok so that's why I want to do it this way right now that now it's clear how I want to set the coordinates subject all these giant sum is as large as large as possible so here is everything denoted right so this is now sum over all the nodes you summer olden some sum over all the nodes we seen from random walks starting from node U and then here is the predicted probability of U and V Co occurring in the random walk right where you know here is know do here is node we these are the coordinates of you these are the coordinates of week and is a particular choice of how am I going to model this this probability of co-occurrence one could go and choose other ways to model this we chose the softmax for the reasons I mentioned so what do we want to do now basically what this means is that optimizing the random walk basically means find the embeddings z sub u that maximize this function l okay what are some problems with this formulation and what do we have to resolve so one problem that that this does that this equation is not tractable is that it has quadratic complexity the reason if has a quadratic complexity because I have a nested summation to nest to nest at summations that go over all the nodes right I have the one that the outer some that goes over all the nodes and then for every of those nodes I have an inner summation this normalization factor that again goes over all the nodes right so this means that in order to evaluate this function L I need to go over all pairs of nodes so the complexity of this is quadratic in the size of the network and this is prohibitively expensive to run this on any kind of real network sizes okay so what we will do now is we will take this this theorem that takes linear time to evaluate and we will approximate it to allow us to evaluate it in constant time okay so the idea is that this normalization term from the sock soft next is the problem and we want to approximate it and the way we will approximate it is using a technique called negative sampling okay so what we will do is will we have this softmax and log of it and the way we will approximate it right if I take a log of the fraction I get log minus the log right but rather than then working with this X we will actually think of it the following way we will replace the exponent with a sigmoid function so now what I say this is the dot product passed through the sigmoid and then I have the log from before and then rather than summing here over all the nodes I will just same sum over K nodes that I pick at random okay so where does the approximation happen approximation happens here because I'm not summing over all the N nodes of the network I am only summing over K nodes where I pick this nodes at random from the network and these nodes are called negative samples and I know the TAS were asking me this these nodes can be the same can be a node so node u or node V can be part of these negative samples it doesn't really matter but the point is that you can approximate the some rather than summing of it over everything we just sum over a fixed number of randomly selected nodes and now this becomes much more scalable yes why do I so why do I wanna do this replacement with sigmoid what is nice about sigmoid is that sigmoid will take any any real value and kind of squish it between 0 and 1 right so so I can so basically what this means is that I can rather than taking the exponent and worrying about a normalization by basically passing the dot product through a sigmoid I already get something that looks like a probability something that's between 0 and 1 right so here I'm trying to show you the value of the eggs and this is the the value of the sigmoid function right so it kind of it squishes everything nicely between 0 and 1 crossing at 0.5 at 0 right so you can now think of this as a different way to measure similarity right and then I have the the similarity between U and V minus the normalization factor that I approximated using this set of negative samples or random nodes rather than summing over all of them ok so this is what what the negative samples are I'm showing the same thing here what's the result to work empirically better is that when you sample these negative samples you don't sample them uniformly at random but you sample them with probability proportional to the degree of the node so that the higher degree nodes get sampled more often let me just and then write so k is another hyper parameter I get to choose which is the number of negative samples I'm sample and there is a there is a balance I have to find right like if I pick higher K I will have more robust estimates because I'm approximating the normalization term better but higher K corresponds to higher higher prior on negative events and is also computationally more expensive so I have to kind of balance balance the two out by choosing the right value of K question why is the summation outside outside the lock ah good question the way so the way you can really think of this is that I wanna I wanna I wanna approximate this part and the way I will go to approximate is it is by by sampling now in in general of course this is not equivalent because I cannot take the summation out in this case the reason why this is fine is because all these guys will be between 0 & 1 ok and it's it's kind of how to say it's a it's a it's an approximation that works well in practice sorry why are you not scaling here you could scale it but because it's the same for everyone it doesn't matter right right so you you would be scaling everything you would be multiplying everything by the same constant so in your optimization it won't really matter yes I mean we we are B but what we are gaining is that now everything is nicely between 0 and 1 and it allows us to apply all these tricks we don't necessarily get it ok good so let me now step back and tell you how this works ok so the way this works is a sample fixed length random walks use of the network and for every node I will collect this set or this neighborhood this set n sub R of node U which is a multi set of nodes visited on random walks starting from node u and then my goal will be to optimize embeddings of nodes z sub u such that this this equation is is maximized now how am I going to optimize these embeddings is to apply what an optimization method that's kind of the simplest of them all which is stochastic gradient descent which is basically I can compute the gradient of this function with respect to Z and then given given the given the the example given u and V I can then say how do I need to update the coordinates Z such that the likelihood improves and really I'm just kind of doing this in this small mini-batches and keep updating these these equations these coordinates of course the problem is non convex so I will get stuck in some local minima but in general these methods because the the kind of the number of parameters is is is large enough they tend to converge to a good part of the space so we will converge to some local solution but the robustness of these local solutions empirically turns out to be quite good ok so this is what I wanted to talk about in general about doing the random walk and then optimizing our objective function to find the coordinates now what I what I talked so far more generally is that we talked about how to optimize embeddings given the random walk statistics now the question becomes what strategy should I use to generate these random walks right one idea is just run of six lengths unbiased random walk starting from each node and this is kind of what I presented so far the issue here is that you want to have different notions of node similarity encoded as Camilla was saying right so this fixed length unbiased random walks are not necessarily the best solution so the question is can we make these random walks more interesting to capture kind of richer notions of node similarity okay so now I'm answering your question it was 24 and it's 33 so as 10 slides you didn't have to wait 17 okay great so here is the here's the idea right so this is now a second method will call node to whack so the idea here is that again I want to embed nodes to encode similarity but our goal will be that is similar as before everyone and frame this as a prediction task where you wanna say what nodes are coworker visited in the random walk or what nodes are neighbors of each other and the key observation is that we need a flexible notion of node neighborhood so that we can capture different types of node similarity and the way we can capture different notions of node similarity is through a biased what we'll call second-order random walk that will then generate these network neighborhoods for us okay so how are we going to generate this second-order random walk and the intuition comes from this idea that if you say if I want to explore the network and I'm given K steps to explore it there are kind of two extreme ways to do exploration one way to do exploration is through the breadth-first search type thing right where I say I will start here and I will just try to be to stay as close to the starting point and visit as you know the K nodes so from no new I could go to s1 s2 s3 all I could use a very different exploration strategy depth-first search where I would say I want to get away from node U as fast as possible as far as possible right so I would say if I have three steps to take I would take this type of exploration right and and kind of what's the trade off the train office if I'm using this type of breadth-first search exploration I get a really good sense of how the network looks very locally right and if I do the breadth-first search type of exploration I get a better sense how the network looks kind of at a longer distance scales right so the idea is right that these two different exploration strategies will reveal very different kind of statistics about the network right so in our case we will be generating this random walks of a fixed length let's say three and you know the using the breadth-first search exploration strategy would generate a neighborhood set of s1 s2 s3 and you know using a depth-first search exploration strategy in this case would generate neighborhood set of let's say four five and six right and this would give you the breadth first search will give us very local view of the network and the depth-first search gives a very global view of the network okay so now the question is how do we kind of interpolate between these two extreme exploration strategies right we want to have a soft way to interpolate between these two extremes so the way we can interpolate is to to define this second order random walk that will have two parameters P and Q and parameter P will be what we'll call the return parameter where we will return back to the previous node where we came from and we will have this second parameter that will be what we'll call in out parameter that will tell us would we like to move kind of away from the node U or do we wanna kind of stay close to node u okay that's the idea so now let me tell you we have this P and Q let me tell you how we parameterize the random walk okay so we want to have a random walk that explores the network and the reason why we'll call it second order because the random walk will remember where it came from right so far when we talked about for example PageRank PageRank is a first-order random walk because you come to the node and you forgot where you came from here our our walk we'll have one one step of memory so it will know where it came from and that's why it's called second order okay so imagine the following the observation is the following if you have a node you start your endo market node you and you walk and then you end up at some node W then from note W there are only in some sense three directions in which you can go from W you can go to some other node let's let's call it s2 that is at the same distance from you as W was so when you are a W you can stay at the same distance you can go a step farther from you or you can go a step closer from you right in a single step there's nothing else you can do right if I think about distance from you I can you know make a step backward then I go backward I can make a step forward and I get closer or I can kind of go side step and we maintain the distance that's that's all in the graph is the same right so big right the idea here is that I know where I came from so I can go back okay so here is what we will do we will put weights on the edges based on where we came from and these weights will will model do we wanna do we stay at the same distance from you do we go further away from you or do we get closer back to you okay so how would we set the weights if we are currently at W this is how the weights would be set right we will put a weight of one two towards the nodes that are at the same distance from you as we are currently we will put weight 1 over Q on the nodes that are one step farther from you than we currently are and we will put a weight of 1 over P on the on the edges that point to nodes that are closer to node u then what we currently are okay so this is this is the idea right and this P and Q are again model hyper parameters where P is the return parameter basically saying let's make a step closer to where we started and Q is the walk away parameter is like we want to escape the orbit of you let's move further out right and now by playing with these parameters I can force the walk to stay close to where it started or I can force the walk to go as far away as possible from where it started right if I make Q to be small the walk will wanna escape if I make P small then this will have a high weight the walk will wanna go back okay that's essentially the idea this is what I just said right if I want breadth-first search like exploration I want this weight to be high so that the Walker is likely to go back I want a set low value of P if I want the depth-first search if I want to go away from the from the starting point I want to have Q to be small so I want this edge to have high weight so that the Walker goes goes that way right so basically the idea is right if I say I'm not W I have a set of nodes I parameterize the trend these are now unnormalized transition probabilities where I can each edge has one of the three values depending whether I that know these are the same distance from you as we currently are it is one step farther away or one step closer okay so here I try to explain that and now I can run a random walk for a fixed number of steps and see what nodes it visits okay yes the random walk has yes has one step of memory it knows where it came from where it came from and it knows what distance is no result right so if I know where I came from then I know you know if I if I know I came from here then I know what does it mean to be the same distance I know what it is to make a step for the route and what is make to take a step closer right so if I know what distance I'm away and I know what distance my neighbors are away I can compute the weights of the edges I can uniquely determine who is closer who's farther and who's at the same distance okay so that's what I need to know yes yes what you have to do is you have to estimate the pairwise distances to notice exactly so you need to compute the pairwise distances B&Q are the same across Walden all right again you could make this more complicated right now the idea is the desert two parameter two hyper parameters that you set and will be equal across all the nodes of the graph actually it's a good question if I know changing this would buy you anything I don't know right maybe a good project idea okay so but the point kind of is the following right if I do breadth-first search right so if I have a low value of P this means basically my walk will make a step forward and a step back and a step forward and a step back right so I'll be kind of exploring the network very close together and if I if I set the low value of Q right then I will I will I will walk further and farther away from where I started it will be always more like it for me to make a step farther than to stay at the same distance or or turn back so I will get a more macroscopic view of the network and to give you an example this is a social network of a characters in the novel where what we what what was done here is we set different values of P and Q we learned the embedding and then we are we colored the nodes that are close together in the embedding with the same color right and here for example we have high value of P and low value of Q so it means we are doing more the depth-first search type of exploration so we go further out in the network and you notice how we be in some the nodes that are closing the embedding they correspond to these types of almost like communities in the network right for this set of parameters but if we change the parameters and make high value of Q which means the returning parameter is is is high then what we note what we do in that case is that nodes that are closing the embedding again are colored with the same color and if you see that this more corresponds to the structural roles you see that for example all these blue nodes are somehow connectors between the red nodes we have this yellow nodes that are just like low degree peripheral nodes and so on right so we get different types of embeddings depending on how do we set this network exploration patterns okay or how do we initialize the exploration of the network using the random walk one thing the last thing I want to talk show you this is one example how this works is also that these things work really well when network data is incomplete so here are two two experiments where this show this is basically saying let's start with the network but let's start removing edges at random and here we say let's start with the network but let's start adding edges at random right so here the network is getting sparser and sparser here the network is getting denser and denser but we are just kind of throwing random stuff at it and we are asking how good of an embedding do we get so this is some node classification task numbers higher numbers are better and you see that you can remove for example 50% of the edges in the network and your predictive performance only slightly drops or here you can add half of the edges at random so you are just throwing random and you seen but again your predictive performance doesn't deteriorate drastically which is a very good news that shows the robustness of these methods what I also want to say is that there is very rich literature and very active research area around here thinking about how do how do we do different kinds of biased random walks how can we come up with different optimization schemes and optimized parameters in different ways and also kind of what kind of network pre-processing techniques do we wanna apply so that the notion of network similarity is most appropriate for our application okay so there is a lot of kind of different techniques that specialize further from what I said what I want to quickly cover or yes I take your question oh yes of course everything I talked so the question is can you bias by adroit yes of course so if you have a weighted graph everything I talked to automatically applies where you say the transition probability is proportional to the edge weight so you can easily do the deep walk so the unbiased random walk on the weighted graph or you could even try to say what if hi I have edge weights but then I have this piece and cues that further modify the edge weights so you can definitely do that that's a great yes of course super question thank you you're thinking that's great good any other question all right good so let me tell you then what how can you use this and bearings right after you compute computed computed the embedding you can use the coordinates the following way you can simply run a clustering technique and ask you know what nodes are similar to one to one another to discover either structural rules or do community detection if you want to do community if you wanna do know classification then you can simply take this inferred back to G for a given node I and pass it through your favorite classifier like a support vector machine a decision tree a linear regression whatever you'd like to do and then if you want to do link prediction now we have to reason about pairs of nodes right I have a pair of nodes IJ which means I have a pair of coordinates I one for node I in a and another vector of coordinates for nodes J I can do different things for example I can concatenate the two coordinates the two vectors together and think of this as a feature vector for each prediction I can do element wise product this is this is called hada mark the way of combining different vectors I could also do a simple sum or or the average of the two vectors and use that as an input to the to the to the classifier or I could even do something where I am asking what is the l2 distance between the two vectors and use that as a way to predict whether I think there will be a link or not okay so there are different strategies how can I take a pair of nodes and out of this pair of nodes take their feet corresponding feature vectors and create a new feature vector return either by some kind of aggregation distance matrix summation element wise product or just simple concatenation and then I've put this into my favorite classifier in order to be able to do this so let me quickly summarize so so far we built on this basic idea that I want to embed nodes such that the distances in the embedding space reflect the similarities in the original network and then we talked about that I can use different notions of node similarity and we particularly we had a dive into this notion of I wanna use random walks approaches to define the node similarity now the question becomes what method should I use for my given tasks there has been a lot of work done trying to compare different methods and what the what what the kind of conclusion is that no method wins in all the cases for example note to make tends to perform really well on node classification but less well on link prediction tasks random walk approaches are generally computationally more efficient than any kind of competing approaches but in general when you do this type of embeddings you might you must choose the definition of note similarity that matches your application right so depending on how you construct the network and how we want to measure similarity the best thing is to use that notion to directly optimize your embeddings um with this I'm kind of I can finish here and take a few questions if people have it yes I'm learning being very independent of the task but if you could find some way for the task set like affect the random walk in some way like wouldn't that give you embeddings that are better for the task at hand and then you wouldn't have to have a downstream SPM or something it's a good question right so the question is why do you want to do things in a task independent way the reason why you wanna you want there are several reasons why I want to do this sometimes you don't know what the task is sometimes you don't really care so much about the task as you care about kind of capturing the structural information of the network right and in the same way like for example if you think about right when we embed the network we embed the nodes if we take text and you embed the text to embed the words and you don't care about embedding the word so that you can do really well on one particular task but you care about embedding the words so that you kind of capture the semantics of the language and here kind of the goal is to to embed things such as we capture the semantics of the network right and of course there are methods where I can say I really care about this task so let's compute M bearings that are optimal for me to solve this task right here it was a more general idea of I can use random walks to define this topological notion of similarity and learn the embedding is that way okay you had a question yeah so what's the question so I how how do you mean it is multiply no net barracks and work with like other crops great so what would you do if your if you have bipartite graphs with different types all the methods would still very nicely apply what you can do is if you have different types of nodes you can further tweak the random walk how it explores the network for example so everything we talked about nicely like vanilla already applies to bipartite graphs if now we have heterogeneous nodes different types of edges you can you can of course go and specialize your random walk much more based on your application and you can find the embedding still yes great question so do we have pre computed embeddings for graphs that are that are often used there are embeddings that are published but I think the difference would be different from the language right there is you know the 200 words of English that we all use there is not 200 nodes that appear in every single graph so so so as there is this kind of parallel with the language it only goes that far but good good question all right let me in the in the last few minutes let me talk about now now we talked about how do you embed nodes but how do we embed entire graphs right so the idea is the following what I'd like to do is I'd like to take an entire graph and put and put it somewhere in the embedding space and the reason why I would want to do this is because sometimes I want a reason about multiple graphs at once right so maybe I have graphs representing molecules and I want to predict is this molecule toxic or not or imagine I'm given some kind of transaction graphs and I want to say is this fraud or not right so in many cases I can have access to many small graphs right and I wanna reason classify them in different ways so how can I think of this is that I want to take the entire graph and map it into some high dimensional or low dimensional space where the entire graph maps to a single point so the question is how can we how can we do that and I'll give you I think three approaches how how people can do that here is the simplest idea that people have have proposed right so the simplest idea is run our standard graph embedding technique to embed all the nodes of the graph right and then to to now create the embedding of the sub graph what do you do you simply sum up the embeddings of the nodes appearing in that graph or in that sub graph okay this is very similar to what they do in language where they come up with the embeddings of individual words and now if they want to have an embedding of the document to just sum up the words that are in the document so here we are doing something similar we say let's embed the entire graph and now the embedding of the sub graph or or the entire graph itself is simply the sum over the embeddings of the individual of the of the individual nodes and this was for example approached proposed used in 2016 to to classify different molecules based on their graph structure right where the idea is again computing bearings of individual nodes and then sum up those embeddings to come up with embedding of the graph that's idea number one this is simple idea number one simple idea number two that's a bit more complex is this idea that we want to create a virtual node that connects to the nodes of the sub graph or the entire graph we wanna embed and then the idea is that the embedding of the of the sub graph is simply the embedding of the super node that connects all the nodes of the sub row right so to in order to embed this sub graph we create the soup this red super node that links to all three nodes of the sub graph again run our favorite node embedding technique whatever is the embedding for this node that's the embedding for the sub graph okay this is another interesting technique and the last thing I wanna cover is actually a paper that was published this June and it's kind of cool so I wanted to give you some intuition there is many more details here's the paper okay so and these paper introduce this notion of anonymous walks and I thought this idea of anonymous walks is kind of cool okay so we talked about random walks and the way we represented the states of the random walks were simply denote IDs of the nodes that the random walk visited so now these guys proposed a different version of a walk or of a random walk that they call anonymous walk and the idea of anonymous walk is that you that the states of the walk correspond to the time indices when the node was first visited by the walk okay so let me give you an example and then things should be clear imagine the walk goes from A to B to C to B back to C okay then the anonymous walk for this for this random walk would be we would represent this as 1 2 3 2 3 the reason is that is node number 1 is visited at position 1 node B is visited at position 2 node C is visited at position 3 so now I'm going to note B Nobi was visited at position 2 so I put a 2 here I go to see she was already visited it was visited first time at position 3 so I put a 3 here okay and then for example a different trend of walk that goes from C to D to B to D to B would have the same anonymous walk right because we start at C sees the first note we visit this is the second node we visit B is the third node we visit we go to D again give us visited at step 2 so we put a 2 here and then we go to be given B was visited at step 3 so we put a 3 here ok and the reason why this is called anonymous walk is because kind of we forget the IDS of the nodes but we remember when in the past heavy visited that node ok and you know here's another example from A to B to A to B to D this will be one two and then one again because I'm I'm at a and then two again because I'm at B and then three because I visited no DD okay so that's the that's the that's the idea so what what this means is that this number of anonymous walks increases exponentially with the read of Auckland and here is the graph I took from the from the ICML paper this is the length of the anonymous walk versus the number of different anonymous walks okay so just to show you for example the number of anonymous walks of length three equals five so here's a little exercise what are the five anonymous walks of length 3 so 1 is 1 1 1 right I started a node and I visited three times right that's one possible one so then the next one is 1 1 2 right so this would be a a B or B BC or something like that right where I first visit the same node twice and then the third step I visit a nooner then I can do 1 2 1 which is I'm a starting node then I go to a new node and then I go back to the starting node all right then 1 2 2 it's clear what it means I start at a node I go to a new node and then I stay at that node so it's a self-loop or something and then 1 2 3 means I visit these 3 different nodes during the 3 steps of my work ok so what this means now what authors propose is how do you compute the feature representation of the graph one option is to say let me go and enumerate all possible walks in the graph of a given length n right so here L is my hyper parameter and I will say I will enumerate all possible walks of length L in my graph and I will simply count how often each one of them occurs and then I will create a empirical probability distribution to say what fraction of times does each possible walk occur right so for example if I set l equals to 3 I could take my graph enumerate all all all possible walks of my graph I would anonymize them and then I would ask how often does each of these you know five different anonymous walks occurs and I would capture this with a five dimensional vector and I would think of that five dimensional vector as an embedding of my graph where the eyes coordinate of that vector would tell me what is the probability of having this anonymous walk ASA by in my graph okay so that's first idea how you can represent the graph the challenge here is that you know I'm saying enumerate all possible anonymous walks on the graph and count them and that's that is very very hard so the second way how you can make this better this is now approach three idea to is you wanna go and come basically rather than do the complete counting you wanna do sampling right and the way you would do sampling the idea is the following let's let's generate an independent set of M random walk and think and then anonymize them and calculate the empirical distribution of the anonymous walks okay so is it clear so I have up basically I say I wanna generate hundred thousand random blocks I pick a starting know at random I generate a random walk of length L I anonymize it and I increase a counter in my vector and I keep doing that 100,000 times and then I get a vector of discounts and then I normalize it to sum to one so I get a probability distribution and this vector I created that's my that's my representation from for the graph what is beautiful what is beautiful is that actually you can compute how many how many walks do you need to estimate your distribution over anonymous walks to a certain degree and you get this kind of classical epsilon Delta argument where you say I want my distribution that is tomates through the sampling to have error of no more than epsilon with probability less than Delta how many bugs do I need here is again the formula that tells you how many walks do you need to sample to estimate the distribution accurately now eta is the number of possible walks of a different length L this is my dream the bar chart with green bars that I showed before right to give you an example there are 877 anonymous walks of length seven right so if I say I want walks of length seven then there is almost 900 of them and if I want to set my epsilon to be 0.1 and my Delta to be to be point zero one then I would need to generate around one hundred and twenty two thousand different trend walks in order to estimate this probability distribution over anonymous walks of the graph and that would be one way to do it and then the last way how you can use this notion of anonymous walks to learn the embedding for the graph is the is the idea is that wanna learn an embedding Z for every anonymous walk a and then use those embeddings basically to an aggregate the meters sum them up or average them in order to come up with an embedding for the entire graph so then the question is how would I embed the walk okay and the way you would embed the walk is the following is that you what we wanna do is we wanna so what we want to do is we want to embed walks in such a way here this walks are denoted by W that I can say that I can predict what kind of anonymous walks occur in a neighborhood of a given node right so I say at node you I have seen this different anonymous walks and now I want to predict the probability or do I want to predict what is the next anonymous walk I'm going to see okay so in the same way as we are doing this in terms of node n bearings we can do this now for walk embeddings right so now Z is an embedding of the anonymous walk where there is only a finite number of anonymous walks and I will simulate these anonymous walks starting at some you know arbitrary node you and I will simulate capital T of them so here is how this would work I would say let me simulate capital T different random walks starting at node u each random walks of length L ok and now I will say I in some sense I have this set and as before before our set N was a set of neighbors but now a set n is a set of random walks right so each of this is a L dimensional vector or a set of node IDs and then with each random walk there is an Associated anonymous walk okay and multiple different random walks kind of map to the same anonymous walk so now I wanna my goal is to set no to learn the embeddings of of anonymous walks such that I can predict which walks can co-occur okay so here is what I want to maximize I have a sum I go a summation over basically I have one more parameter capital Delta that gives me like my context window size and then what I'm doing is I say let's go over let's have the sliding window of size Delta over the capital T random walks I created and then I say I wanna predict the next random walk given the walks I have seen in the past right so this is my context window size capital Delta and given the previous ones I wanna predict the next one how do I write out the prediction problem I say the following the probability of a given random walk that I see at time T given the previous random walks previous Delta random walk that I've seen will be simply e raised to this Y of the of the work that I've seen divided by the normalizing constant over all possible anonymous walks that I have that I that I can see in my graph and then how does this Y Y function look like wonder why func how the Phi function looks like it basically says let me take the context of the of the walks I've seen let me anonymize them and take their embeddings take the average learn this transformation matrix to come up now with in some sense with the embedding for the for the walk that that I expect to happen next okay and this way I can again figure out using stochastic gradient descent what are this coordinates Z of Earth walks what is now interesting is that disease there is one z / anonymous walk and based on that I'm basically predicting what is the next anonymous walk I expect to see at a given node right so something similar that that what we did before that we are generating random walks anonymizing them the difference is now that the set end is a set of random walks not a set of nodes and then I again very similar idea to before where I say given the previous random walk predict what the next random walk is going to be and I can figure out how to learn the embedding of the of the corresponding anonymous walks that predicts random walks that tend to co-occur so let me just summarize there are generally three ideas that I covered how can you embed entire gloves you can embed the nodes and then simply sum their coordinates to come up with the embedding of the graph you can create a super node that spans the graph and then embed that super node that's idea number two and then idea number three was this idea of anonymous walks and what is nice about the anonymous walks is that they take this super large space of random walks and kind of compress it down to the space of anonymous random walks and this gives you kind of nice dimensionality reduction that you can even think of the thus the describing the graph by simply by describing it with the frequencies of different types of anonymous walks that occur in this graph so you could either go and represent the graph with the distribution over the anonymous walks computationally intractable idea to is sample the walks anonymize them and then measure the distribution over the sample random walks and then idea number three is embed anonymous walks and then when you say how do I create how do i embed the graph is to say let me aggregate the embeddings of the anonymous walks and this gives me the embedding for the ground um so this is what I wanted to cover today I know we covered a lot basically both from the embedding of nodes as well as for embedding of entire graphs the slides includes include pointers to papers that we covered so there is much more detail in the paper in the papers and then one thing we didn't cover is what optimization techniques do you use to actually learn these coordinates and as I kind of hinted optimization techniques are are very simple but work really well in practice and they are all based on stochastic gradient descent and we will learn about stochastic gradient descent kind of further down in the class but but I wanted to cover this a bit earlier because a lot of projects are on these types of topics and many of you are excited so I wanted to give you some way to think about these problems thank you very much and I'll see on
Info
Channel: Machine Learning TV
Views: 70,005
Rating: undefined out of 5
Keywords: Graph Representation Learning, Graphs, Machine Learning, Embedding, Node embedding, Link prediction
Id: YrhBZUtgG4E
Channel Id: undefined
Length: 76min 53sec (4613 seconds)
Published: Thu Oct 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.