Lecture13. Graph Embeddings

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right there you go all right guys so today we're going to continue with um machine learning on graphs with the topics machine learning graphs and we're going to talk about um graph embeddings now we're going to have a short lecture today um and you will learn more material um at the seminar so first of all what do we what do we mean by graph embedding well it's sort of i mean very very simple idea that uh for every node of the graph we want to to every node we want to assign a coordinate in some space right and that's about it so in some sense when you try to draw a graph and you want to put notes on the two-dimensional plane you're actually embedding it into the two-dimensional space right and positioning depending on what method of embedding you choose for different notes you'll get different positioning on the um on a plane right um and sometimes you will actually nicely see the graph sometimes you will not be able to see it well because it might for example your embedding might collapse notes um all on top of each other but that's literally the idea try to assign coordinates to every node now um if you assign you know only one coordinate to every node you can embed the graph into one dimensional space onto a line you assign two coordinates to every node you'll embed it into you know two-dimensional space or on the plane you assign three coordinates you embed it in three-dimensional places in in space into the space right but um you can actually assign 10 coordinates or 50 coordinates or you know as many coordinates as you want to every node and that will be embedding of that node in that space um now there are different names for this again sometimes people call it projections sometimes they call it embedding um and the other word for assigning um coordinates is encoding right so the modern term that is used um especially with neural networks is encoding and so you encode nodes as low dimensional vectors right or project them um and important here that this projection is into the continuous space so you know it's a coordinate it's real value now um the question is how to project right now if the objective is to draw the best possible way you know you might think of projecting in such a way that you minimize for example the number of intersection of edges or you know you you want to make sure you know that spread evenly around um so there are different sort of object functions that you can come up with the one we're going to talk about today is um related to graph connectivity pattern and what we want is uh to project such a way that node connected by edges will project next to each other so we'll preserve geometric relations but the idea is also not to just preserve local geometric relations right which is nearest neighbors but maybe there is possibility to uh observe some sort of second order or even further water relations and yet another term for this is representation learning right because we're looking and learning um a new representation for a node you know this part is a node in the graph but we're representing it um with the coordinates in this embedded space um and we select particular type of projection right and then we apply it to all the nodes so each node is projected um independently uh from another so that's what graph embedding is uh here's an example again it's assigned importance to the node this is you know favorite credit graph that we looked that that we looked at before um you know and and uh i have shown you many times in this graph um typically it was the layout was design it's usually sort of the spring-based layout for example um this one the one you see here is a layout done by deep walk it's actually again designing to 2d um if you notice um it's kind of nicely uh you know does the the the grouping of the clusters right so it's kind of it preserves clusters so it preserves local information so the simplest simplest projection that you can think of and the one that in fact um you know probably you know previously considered um is uh by using um single value decomposition or sometimes it's also principal component analysis projection um the idea is uh that um you know we can take the matrix and factor it out right it's that's why it's actually also called matrix factorization type of approach and we briefly talked about it um the last time imagine that you have a you know that we have an adjacency matrix and here it's just shown for m by n but you know square if it's the adjacency matrix and you can actually write this matrix out as a product of three matrices u s and and v transpose and this factorization is called a single value decomposition right um and the way to understand this is is the following the way it works um you can take here one vector vertical vector right then the s matrix is the diagonal matrix i'm just reminding you guys single value composition is diagonal matrix and then you can take the elements from this b matrix and what happens is you multiply column vector by raw vector it will actually gives you matrix now you multiply this column vector by rhodic vector you multiply by the scaling factor which is the diagonal on this signal matrix and it gives you some form of a um so if this is um um you know k vector uh we'll call it a um or actually let's make it i call it a sub i um and in fact um the entire matrix right can be written as the summation of this type of products now um why are we talking about this well because you can actually interpret this as an embedding in their space because you can think about for example these vectors as being uh coordinate axes right as being uh x's and uh these numbers that are here as a coordinates in this vector space and what's interesting about svd is that if you you know multi always exactly reconstruct the matrix but if you just a few of those columns then you can build an approximation and so if i take some not all of them but from a to from my you know the first column to let's say the k case column i will create the approximation a sub k and uh at this is the best in terms of uh normal approximation right so this is the best approximation overall so the matrix a right so in fact you can think about this as embedding um the matrix right into this low dimensional k dimensional space where the k vectors and these are the coordinates for the embeddings now remembering that a if a is symmetric it can represent a graph this is instantly assigning coordinates to a graph right to the nose of the graph um that sort of that has been known for you know as many years as pca exists right just probably more than 50 or even more now next method of of of projecting is actually much more powerful than just using svd but it is very very much connected to it and in fact we already learned it uh last module when we talked about uh spectral clustering uh just i'll briefly remind you the idea when we talked about spectral cross clustering the spectral graph partitioning we said okay um let's say we have a bunch of nodes and we want to assign labels to the nodes um plus one or minus one which class um they belong to um and when we do this right then we can actually calculate the value of the cut which is how many um ages connects plus and minus class um and we kind of put it this way um as as s i minus sj where i and j are the nodes um connected by h right and then you know we worked with a sum and came up with the formula and then we converted this into optimization problem by replacing those s's with uh real values x and s that would help us assign nodes to the clusters now what's interesting that if you kind of look at this and um you know you you you think for a second like what we have done um you know on it what we've done is we have replaced those labels right s's that is with x that is um belongs to r which is real value and so i could just put here x i minus x j but then what it is is really we gave those nodes coordinates yeah this is x and again this is x and so we're not just giving them coordinates we're looking for with summing the distances between x i and x j and squaring them and going over all connected nodes so what this does is says look find um and and then because it's minimization problem right we want to find those x's well what this says is really to put try to put the connected x's next to each other because if you do that the sum will be small now we also talked about this that of course the you know the minimum here is put all the nodes into one point and then um it you know the minimum is zero here uh but to prevent that we kind of we're saying that look um that the solution is not acceptable let's look for enough for any solution except putting everybody to the same value to zero so in fact when we solve for graph laplacian we could have taken those values eigenvectors right the values in eigenvector and use them as a coordinates um for a particular graph of flushing solution it's just uh we you know we use just one dimension we use second eigenvector but we could use second and third eigenvector second third and fourth eigenvector and then um those eigenvectors are orthogonal and so then um we could actually embed a graph into either you know two dimensions three dimension or any other dimensional space where um that space is defined by the eigenvectors of the raffle question right of the laplacian matrix um to sum up this approach uh is what we're really talking about is the following we're saying okay um here is x i x j uh these are the coordinates for two points for two uh the coordinates that we assigned to the graph nodes um we're gonna add them up with some weight um and you know we sum up only all those coordinates on the on those notes that the neighbors that have h between them and so here is um this combination x i and x j are coordinates of the nodes of the graph aaj adjacency matrix so if they're they're connected then um there is a contribution to the sum and if they're not connected then there is not and from here it's clear that if this is an energy for the graph energy for the embedding then um putting neighboring connected connected nodes next to each other embedding them in the space next to each other where the distance is calculated as just the euclidean distance x i minus x j um is a good idea that will minimize the energy right and this difference can be written as uh you know in terms of using laplacian matrix where this is location matrix and then um you know we're looking for the minimum of this u from x and that will give us in this case one digital embedding will assign nodes coordinates along the x axis if you want to go for high dimensions no problem just instead of x i x j you know you put here matrices x i would with x i k and instead of just um simple sum here yeah it just goes for the traits um and then we also put we have to do some constraints again to prevent um collapsing collapse of all the nodes um onto the same point and so this is a spectral embedding uh raffle plush and embedding laplacian embedding so it has many very many names um and it actually works not that bad right but the idea is again every chord every note getting its coordinate and here we're explicitly saying that we should nodes that are connected right they're connected because of aaj the nodes that are connected being projected next to each other leaving it next to each other in this um high demand in this space now this is an optimization and so we're trying to optimize right we cannot guarantee that this is going to happen but you know the the function uh will take a smaller value if we manage to find those locations okay and uh um here is uh an example of that embedding again for karate graph uh here it's shown with two clusters but that's what it looks like which is actually also not bad at all okay um so those two methods they're kind of you know old methods if you wish um that has been used you know with the by the graph community for quite a while and particularly has been used for uh you know drawing graphs because uh you know drawing large graphs is not easy and those methods were quite popular to try now we want to generalize this and maybe um try to kind of expand this in terms of the information that we preserve one thing that i also want to mention that in this slide and in this projection right in this approach of spectral graph projection methods um we're only taking information about neighbors so it is nearest neighbors information right that's the information that's being used to project okay so let's see uh how like overall embedding concept can look like so we have a graph right and then we do map it onto some embedding domain right and this mapping you know can call it projection you can call it in an encoder and you get those vectors now overall the idea is that when we do this mapping or disencoding you want to be able we want to make sure that the certain properties being preserved for example we want to say that if the two nodes are next to each other in our original graph then their projection the data the points that we get will be next to each other in this embedded space or we can say look now this is sort of the minimum amount of information or maybe we can say like look um you know there is this node there is this neighbors and then there is this next nearest neighbors okay please preserve that make sure that you know this node and its neighbors and its next nearest neighbors uh mapped into uh closed locations to each other or maybe we can think about the structural role like you know the centrality metrics of the node that they will be preserved that goes with the same centrality will be mapped onto the same location um in in the embedding or maybe you know um the the structural role in the sense of remember structural singularity that structurally similar nodes um will be mapped into the same location or maybe we want to preserve community structure and that means the notes that are in the same community in the mapping will be in the same community so how do we check that well the idea is that you know we can actually come up with some sort of metric that we use on the graph um then we you know do the mapping then we try to use the same kind of you know concept same kind of similarity on the mapping and make sure that um those metrics whether it's done on the graph itself or if it's done on um you know on a projection will give us approx approximately the same result of course you might want to to to to learn to completely reconstruct the graph from the embedding but that is you know it's probably not going to work because then you'll need to encode too much information and the embedding is quite often is um is uh compressing right information original data so we want to learn those and what exists then and the encoder uh maps nodes to the embedding right and let's say we select one particular similarity function and let's say you know similarity in our original network um is you know whatever we choose but we can for example you know select similarity as just being nearest neighbors or being sort of on the same uh you know neighborhood and in the projection world right in in the in in the one we're embedding world in embedding space um we try to use um you know dot product as their similarities here is shown the dot product why well because of course you know it's it's very close to euclidean the idea is when you have multi-dimensional space yeah and and we're comparing vectors it's probably better to look for uh you know angles in between vectors and so um that will tell us that okay two two nodes are similar if the cosine of the angle between small which is a dot product right and they're dissimilar if their angle is large the product is small and so we're gonna want to project the nodes then measure this magic trick for the nodes right there's a similarity and we want it if the nodes originals are similar in some way in rough then their projections will keep will be similar um in the embedded space does this make sense okay so that's this is a concept now what do we do with this concept well uh here comes the method in fact it's a class of methods it's called random walk embeddings um there are three popular methods uh i'm gonna talk about you know one of them today um so the original welcome back so that is the following um you know we want to be able to sample some of the connectivity in the graph it's very hard for every node to actually uh use all this connective information simply because you know if you want to take a note and um calculate shortest path from example from this node to all the other nodes right you're going to get um in a lot of computational problems right a lot of computations in there and in fact it might not it might be an overkill right for each node of the graph calculate all the connect all the state for example shortest path for this for this node to every other node what we want to do instead is to just take some neighbor um of course we can use like breadth first search depth first search um here is the idea to add a little bit statisticity and again the reason because uh brett first search for a breadth first search or depth for search if you have uh a graph like power log graph it actually grows extremely fast and so in a few steps as you remember you'll be covering a lot of graph and so to avoid that instead we can actually just create random walks and we've done random vlogs before many times so you know you pick up one note you start from the node you run the run walk then you know you pick up that node game you random random walk you pick up another game run the random walk etc etc and so what we want to predict is that if we take two notes u and v then they will concur on the same walk right u is a starting node b is the node in the random walk of course we don't want to you to have you know very long random walk we can say okay you know random of 10 steps or whatever steps and then we actually run those random walks and see how often um it ends up on a particular node obviously the nodes the nearest neighbors to your selected node to you to the node u in this case um random walks will you know random walks of course will more frequently visit this notes than say for example that one but they will still visit it and so what we can do is we can um estimate just by running those walks like literally running those blocks is we can estimate the probability of ending on the node b if you start the random walk from node u and you can do it for uh you know starting those random walks from any numbers then the id is the following then if we map those two nodes right into sum some onto some points in the you know multi-dimensional space let's say this is a two-dimensional space and these are the so the projections right those are two points there's no projections if this probability is high because the nodes are next to each other we want this angle to be small or if we take for example this node and say some other node here and let's say the probability is low then we want those two vectors embedded vectors to be far away from each other so we want you know the dot product in between those two vectors the embeddings will be proportional to the probability of uh starting from node u and ending at node b so that's really the idea okay so how do we do this well um we're giving a graph in our goal to find this embedding this projection or if you wish you want to find a function that takes our note and maps it onto the coordinates right and you know it's is as as always instead of just uh you know doing the probability we would rather do a you know log likelihood objective right so logarithm of probability and uh since random box started from you know all the nodes are independent right uh we can multiply the probabilities or we can sum a logarithm of probabilities so we sum over all um the nodes uh on the graph right so we start random works from all the nodes on the graph and m sub r is a neighborhood of node um that being covered by a random walk and this is the embedding of that node into our you know into our multi-dimensional space the one that one that we want to find and so what we want to do is we want to maximize this probability over all the nodes that we have all right so this is a you know this value value is obviously the model prediction um of probability of visiting node in the neighborhood of node u you know when you have uh given that you're starting from node u and this is embedding on that note all right so given the embedding uh predict the probability of visiting the actual nodes around that node in the graph um that's a problem set up now um you know well you can reformulate it um if you value you can respond it's reformulated by instead of just you know doing their probability of neighborhood right neighborhood meaning um going through all their paths in that neighborhood and that's what this says and uh you know this is the sum now the question is how do you define this probability and here is sort of a little bit arbitrarily idea about that that idea makes things work is parameterize a probability using what's called soft max so the idea is the following that um if you take um you know dot product we want to make sure that we can sort of empty uh the distinction the difference in between um the vectors that are next to each other versus the vectors that are further away right so we want this to be much much larger than that but if we just take cosine similarity this is not necessarily the case so you want to kind of make it much sharper [Music] and to do instead of just using looking at the at the value x you know to make things sharper you can actually look at the exponent right e to the power of x right for we'll make this will amplify the effect but that of course you know it cannot be thought of as a probability so instead of uh using that we need to you know take that exponent and we need to normalize it obviously over all possible um you know values over all possible uh points that we have there and then uh if we do this then we can put it in here i mean it sounds good it sounds reasonable but there is a problem here and the problem is if i just as i just described um if i put in those values then what happens is actually let me go back here notice that we do have summation here and we also have summation here where we do normalization so if we put this whole thing together um what you get what we get is yes those two sums it sums over all the nodes then here would go sum over all the random walks that originated from each node right this is every node in the graph this is a neighborhood of every node this that we're summing over right because we're visiting all the nodes by the random walk and this random walk again can be in a few steps away but then right here we again have a sum over um the entire graph right and that mean makes it instantly quadratic and that because you know here is the sum and here the sum makes it you know complex quadratic and that for large graphs is not going to work and so the idea that the letters of the paper for the deep walk came up with was so instead of just using that sum uh produce what they called negative assembly so instead of actually summing over all the possible values they proposed to sum only over some number of nodes that are not connected obviously not connected and then this is a negative example now uh i would send you to read the paper to understand why the sampling work sigma here is uh um you know the the classical sigma which is one 1 plus e to the minus x this is a way to write this expression and by doing this negative sampling by selecting subset of nodes that are not connected we can instead of normalizing overall graph we can normalize over them and if we do that we're instantly getting rid of you know double summation instead uh you know getting just summing over some number of example nodes after we've done that um look what happened um that's what that the function that we want to minimize and you know i should probably write this lambda of z right z is a projection right this is a coordinates that we're looking for um and so what we want to do is find the find the minimum based on those z's so how we do this well numerically we just use gradient descent by modifying um by modifying by changing this such a way that on every step uh you know we decrease value of lambda so we're going to the gradient of the decrease of of value of lambda with this function it's very easy since it's exponents it's very easy to compute derivative so we can actually calculate the gradients and then you know use them to descend um since this is quite large and we want to have stable solution typically we use stochastic gradient descent to do this all right so that's that's pretty much the original uh paper by pierosi and his quarters and that goes back to 2014. now since then there's been like lots and lots of papers proposed if we gonna talk about this current preserving methods um the one that preserves their structure for the name um that they are like random walk embedding base this is deep walk this is also note to back method and then there is this method called line these are sort of the most well-known algorithms now there are more algorithms has been proposed that will preserve or try to preserve uh for example you know the centrality of the of the nodes or um the you know the the belong the clustering coefficients or for example um the fact that the nodes belong to the same cluster so that's pretty much it um for this lecture um and today in the seminar um you'll be able to practice i believe um deep walk and note to back and with this we're done any questions you
Info
Channel: Leonid Zhukov
Views: 167
Rating: 5 out of 5
Keywords:
Id: MNgKx4A1pXM
Channel Id: undefined
Length: 36min 48sec (2208 seconds)
Published: Thu Jun 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.