CS224W: Machine Learning with Graphs | 2021 | Lecture 5.2 - Relational and Iterative Classification

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so first we are going to talk about relational classification and iterative classification so let me explain you uh the ideas uh for this part of the lecture so first we want to talk about what is a relational classification then we'll talk about iterative classification and then last i'm going to talk about belief propagation as i as i said earlier so probabilistic relational classifier the idea is the following the class probability y sub v of node v is a weighted average of class probabilities of its neighbors right so this means that for labeled nodes we are going to fix their class label to be the ground truth label the label we are given and then for um unlabeled nodes we are just initialize their belief that let's say they are they have the color green to be let's say 0.5 so something uniform and then nodes are going to update their belief about what color they are about based on the on the colors of the nodes uh in the network around them and this is going to iterate until it converges and just to note here we are only going to use node labels and network structure but we are not yet going to use node attributes so here this works even if nodes only have colors and there are no attributes attached to the nodes or no features attached to the nodes so how are we going to do this update we are going to update this formula that i that i show that i show here and the idea is the following we say that probability that my node of interest v is of class c is simply 1 over the sum over let's say the weights of the edges um uh that that are adjacent to our node v so this is basically uh counts the degree or the in degree of node v um and then we say now let's sum up over this same set of edges here is the weight of every edge times the probability that the neighbor belongs to the same class c so basically here we are summing all the neighboring nodes u of node v we ask what is the weight of the edge times the probability the likelihood that node u belongs to class c so basically what this is saying is that the node v says the the likelihood that i belong to a given class c is the average likelihood that my neighbors belong to that same class c and that's it right so in some sense here a node is going to update its belief or prediction about its own label about the beliefs or predictions about the labels of its nearby nodes and here we are making the most basic assumption which is that the two nodes share a label if they are connected in the in the network which is here and you know we can now think of auv as a zero one kind of as an unweighted graph or we can think of this as a weight um and then you know now different nodes you have different influence on the label of v because of the different connection strength and this is just a normalization so that this weighted summation uh is between zero and one there are a few just important things to note is that convergence of this formula is not guaranteed and also notice as i said this model cannot use node feature information it is just using the labels of the nodes and also uses the network information how does it use the network information it uses it because these summations go over the edges or over the neighbors in the network that's how network information is used so let me give you now an example how one could go and implement this simple method first you know we are going to initialize all labeled nodes with their true labels and we are going to fix this so these probabilities of green and red nodes remain fixed right it says node 7 belongs to green class with probability 1 while the node 2 belongs to green class with probability 0 because it's red right and then all the unlabeled nodes here we will assign their probability to belonging to to class uh green to the positive class to be uh to be 0.5 right nodes are unde undecided which class they wanna uh belong to so now what we are going to do is we are going to update the this probabilities p of the gray nodes so let me show you how we are going to do this um of course when we will be doing this we need to come up with some order in which we are going to update the nodes and in our case let's assume that our order is exactly based on node id so it's one two three four five and so on so first we are going to update node three and node three says the following it says aha i have uh three neighbors um you know um and uh how how am i going uh to do this i have um you know uh the the two neighbors of me have the probability zero of belonging to the green class and this uh and i have one grey grey member that has 0.5 so this is 0 0 plus 0.5 and because i have three neighbors and my graph is unweighted so this is divided by three so now the the node number three has a probability of belonging to green class uh of point seventeen now there is you know now node four wakes up and we are going to update its own uh predicted probability of being green and again it goes over its four neighbors and we sum up a zero uh this is because of this one then we sum up point seventeen point five and one and divided by four so the node number four will say my likelihood or my probability that i am um labeled green is point four two now um now that node four has uh updated its label then node five is going to update its label again summing over its neighbors uh and you know it will it will come up with a belief that its probability of being green is 0.73 then you know node 8 is going to update its label and node 9 is going to update its label and after you know the first iteration updates of all the unlabeled nodes these are our uh beliefs our predicted probabilities of them being green again the the pre-labeled nodes they don't get to update they remain fixed and all the gray nodes get to update and you for example node number nine thinks it's green because the only neighbor it has is green with probability one so nine is also very sure it's uh it's green um while for example the the eighth is a bit less because it has one node who is not neighboring one node node number five and that node is not super entirely sure about its color yet so this is uh one iteration now we do the second iteration when now again node three is going to to update its by summing zero zero zero zero um and point four two and dividing it by three so what is going to happen is this probability increases a bit more then you know note 4 updated itself five updated uh and eight updated and nine updated but its probability didn't change so we say not note 9 has converged it converged to its belief and that's how it will stay so now you notice how the beliefs have changed across the network and now i can run this for another uh iteration where again i update nodes in the order three four five eight and as i update them uh you know the probabilities updated a bit again uh but node 8 did not change its uh probability so we say it converged and we fix it right so now i still have three nodes and i will update this for another iteration right um and in the other iteration the four five and three don't change so they have also convert so now uh node three believes it's of green class with probability point sixteen um five thinks it's point uh eight six so quite high and then you know node number four is undetermined it thinks it's a bit leaning a bit more towards green but it's uh but it's unclear so um if i run for one more iteration um then the probabilities um on this classification task will converge here is what are the values they converge to so what do we conclude we conclude that 9 8 and 5 belong belong to green class node 4 also belongs to green clash because it's just above 0.5 but very slightly so 4 is also green and 3 says you know i have a very low belief that i am of green color so 3 is of red color right so our predictions now would be that four four five uh eight and nine belong to uh green color and three belongs to uh red color and we did this by basically just doing this um iterative classification by propagating this label information and notes updating its belief about its label based on the labels of other nodes so in this approach we used the notion of the network but we did not use the notion that nodes have any kind of features or any kind of signal attached to them so this was in terms of relational classification based on the labels alone and you saw how basically the the probabilities kind of spread from labeled nodes to unlabeled nodes through this um through this iterative classification when nodes were updating uh the probabilities based on the probabilities of their nearby nodes or their neighbors in the network so now we are going to look at the iterative classification a different kind of type of approach that will use more of the information in particular it's going to use both the node features as well as the labels of the nearby nodes so this will now combine both the network as well as the the feature information of the nodes so in the relational classifiers that we have just talked about they do not use node attributes they just use the network structure and the labels so now the question is how do we label how do we take advantage how do we harness the information in node attributes and the main idea of iterative classification is to classify node v based on its attributes f as well as the labels z of the neighboring nodes n sub v of this node v of interest so how is this going to work on the input we are given a graph each node will have a feature vector associated with it and some nodes will be labeled with labels y the task is to predict labels of unlabeled nodes and the way we are going to do this is that we are going to train two classifiers we are going first to train a base classifier that is going to predict a label of a node based on its feature vector alone that's the classifier phi 1 and then we are also going to train another classifier that will have two inputs it will have inputs about the features of node v and it is also going to have this additional input this vector z that will summarize the labels of these neighbors so this means that now we are making predictions with this classifier phi 2 um using two in two types of input one is the feature information uh captured in this feature vector f and also the labels of the name of its neighbors captured in the vector z and this way this approach will now be using both the feature information as well as the network information so let me tell you now how do we compute this label summary vector z of of our given node v so the idea will be that you know i'm a given node here in the network let's say blue and i have some nodes that believe they are green and other nodes that believe they are red so now what i need is to create this this this summary vector z uh that will tell me how what are the beliefs what are what does my what do my neighbors think about their labels so for example there are many choices i can make to deter to set this uh vector uh z for example i could simply have it to be a vector to be a histogram which would count the number or the fraction of each label in the neighborhood so for example in this case um for this node the blue node we could say aha i have two neighbors who think they are green and i have one neighbor who thinks it's red or another way would be to say you know 66 percent of my neighbors think they are green and 33 think they are red and this is what this vector z would capture we could also for example extend z to say what is the most common label among these neighbors or how many different labels are among these neighbors but essentially here the idea is that this vector z captures how the labels around the node of interest v are are distributed among the neighbors of node v in the network and there are a lot of different uh choices uh how we come up uh how how we can come up with this vector um so these iterative classifiers are going now to work in two steps in the first step or in the first phase we are going to classify node labels based on node attributes alone so it means using the training data we are going to train two classifiers for example using a linear classifier a neural network or support vector machine or a decision tree that given a feature vector of a given node it's going to predict its label and in phase one we are going to apply this classifier phi 1 so that now every node in the network has some predicted label and then using the training data we are also going to train this classifier phi 2 that is going to use two inputs it's going to use the feature of the node of interest as well as this summary vector z that captures or summarizes the labels of the nodes in its network neighborhood and this phi 2 is going to predict the label of the node of interest v based on its features as well as the summary z of its of the labels of its neighbors and then right so we have first applied phi one then we also have trained phi two so that now we will go into phase two where we are going to iterate and we are going to iterate until convergence uh the following uh the following um iteration where on the test set which means on the unlabeled nodes we are going to use this first the classifier phi 1 to assign initial labels we are going to compute the label summaries z and then we are going to predict the labels with the second classifier phi 2 and we are going now to repeat this process until it converges in a sense that we are going to update the vector z we are going to apply the classifier phi 2 to update the label of the node and now the node labels have updated we are going to update the label summary z and again apply phi 2 to update the node pretty the the node predictions update z update the prediction and keep iterating this until this class labels stabilize converge or some maximum number of iterations is reached um note that in general or in worst case convergence is not guaranteed so we would set some maximum number of iterations you know maybe 10 50 100 something like that not too many not too many so that's the idea so what i want to do next is actually show you this to give you a sense how this work on a simple toy example so the idea here is that we have an input graph let's say that we are thinking about web pages so we have a graph of how web pages link to each other each node will be a web page an edge will be a hyperlink between the web pages and we will have a directed edge which means that one page points to another page and imagine that every node in my network is is a is described with the set of features in our case you know we could simply assume that this features captures uh what words what text a web page is using and imagine that the task we want to do is to predict the topic of the web page and what we would like to do in this prediction is use two facts first is of course we'd like to use the words of the web page to predict the topic and then the second thing we would like to do is we also can assume that pages that are on similar topics or on the same topic they tend to link one another so when we make a prediction about the topic of a page we don't only rely on the words that the page uses itself but we also want to rely on the labels on the topics of the pages that link to this page of interest so that is the idea this is what we would like to do so let me now uh show you how how this would work in this example of web page classification so the idea here is that first we are going to train this uh baseline classifier this classifier phi uh one that is going to uh to use uh and classify pages classify nodes based on their attributes so just to give you uh to explain what what is happening here we have this graph on four nodes um we will do the following it's the color will denote the ground truth color so this is the ground truth topic of the real topic of the web page um the the then each each page also has some feature feature vector let's say describing the words on that page and let's assume these are the feature vectors of my four pages four the pages also have hyperlinks uh that are directed and point to each other and then for example based on the labeled data we can apply our phi 1 and phi 1 will say ah this this web page belongs to topic 0 and you know the ground root is 0 so this is a correct classification but for example this this web page here based on its features alone we will predict that it is of topic zero but in reality is it is of topic one and the question will be can we using kind of network information could network information help us to decide that this should really be green topic not the red topic as it is predicted based on the features alone and then you know we have these two other pages here are their corresponding feature vectors um and you know here the classifier would predict that they are both green um while uh here the classifier would predict that these two should be red and perhaps you know one way to think of this is that it's really whether the first feature is set to one you are red and if it's set to zero you are green that would be one way to think of the what the classifier is learning so now in the first step of this or in the first phase um we we came up with predictions of the labels uh based only on the features alone so we applied the classifier phi uh one from the previous from the previous slide so now what we're going to do next is we want to create our fit label summary vectors z so for each node v we are going to create this vector z that captures the neighborhood labels and given that this is a directed graph we are going to use a four-dimensional vector to capture the statistics basically we are going to have two parts to this vector z a part i and o i is about incoming and o is about outgoing uh neighbor label information and we will say that for example i of zero is set one if at least one of the incoming pages one of the pages that links to this node of interest is also labeled zero and uh you know we will use similar definitions for i1 0 and 01 so give you an idea so this is now same graph as before these are the features of the web page but now this four more numbers appear here and let me again explain what do they mean so this is now these four numbers are basically my vector z that summarizes the labels of the neighbors around it and we have i which is for incoming neighbors i want to say is any of the incoming neighbors of class 0 is any of the incoming neighbors of class 1 because this node here has one neighbor that is green we set value one here right it's an incoming neighbor um and then on the on the uh outgoing neighbor side uh this is this edge we have one neighbor that believes its class is zero so uh here it is right uh so notice that colors correspond to ground truth and numbers correspond to predicted labels right so that's why here i have a one zero because um the node of interest has one outgoing edge to a node of uh class zero and has zero outgoing edges to nodes of class one of the green class uh similarly for example you can look at this node here uh its feature vector and then the summary of the incoming edges and the summary of the outgoing edges this uh this node has um uh here has one incoming neighbor that is of class zero this is this one and has zero incoming edges of class one and uh in terms of the outgoing edges it has zero outgoing edges to class to class zero and it has one outgoing edge to class one so this is now you know the four dimensional representation uh the vector z the summary of the labels around this particular node analogously you can compute it for this node then you can compute it for that node so now that you have for every node both the vector f and the vector z uh what what you do you now train um the second classifier and again you only train it on the labeled training set and this classifier you train in such a way that it uses both the information in f as well as the information in z so now you are basically uh training the classifier phi 2 that uses both the feature information as well as the label summary information uh about what do the neighbors of a node of interest think about their own labels and this now gives me the classifier phi 2. so now um in the second step i'm going now to apply my classifier phi 2 on the on the graph of interest on the unlabeled nodes so right so we i'm going to use trained node features to apply the classifier file 1 on the unlabeled unlabeled set and i'm going to predict the labels and now um now i'm going to use trained node feature vectors as well to to predict the labels and now given the predicted labels i'm going to update the feature descriptors the feature descriptors in a in a sense of class summaries around each node now that i have created this vector z i can now apply my classifier phi 2 to update the predicted label of every node and if i do this the labels of certain nodes may change and basically the idea is right now that i have used all these data meaning the feature vector as well as the summary vector of the labels around the given node i will be able to more robustly predict the label of the node of interest and now right i basically the idea is that now i can go and update the summaries because some labels might have changed i update the summary and reapply the fight to uh predictor phi 2 classifier and i keep doing this uh basically reclassifying nodes with phi 2 until the process converges right so i'll be updating this until until the label prediction stabilize i think because if a prediction for a given node changes then its vector z is also going to change and if the vector z go is going changes then the i have to apply the predictor phi 2 to update the belief or the prediction about the label of a given node and i keep iterating this until the process uh stabilizes right so here the prediction for this node uh flipped from zero to one now these vectors z have changed have been updated so i have to re-update my classifier phi 2 again over all these nodes and i keep doing this until the process converges meaning until no node labels change or until some maximum number of iterations is reached and that is essentially the idea of what we are trying to do here and how do we arrive to the final prediction so to summarize so far we have talked about two approaches to collective classification first we talked about relational classification where we are iteratively updating probabilities of nodes belonging to a label class based on the labels of its neighbors this approach inc uses the network structure and the labels of the nodes but it does not use the features of the nodes so then in the second part we talked about iterative classification where we improve over the relational classification to handle attribute or feature information of nodes as well and here the idea was that we classify a given node v or a given node i based on its features as well as the labels of its neighbors so here the way we achieved this was to train two classifiers the initial classifier that given the features of a node predicts its label now that the nodes have its labels predicted we for every node we can now create this summary vector z that describes uh the labels of the nodes that are neighbors of a node of interest and then that we have that we can then use the classifier phi 2 to re-update the prediction on a given node that and this classifier phi 2 uses both the feature information of the node as well as the summary vector of the labels of its neighbors the summary vector z and we then keep applying this classifier phi 2 over the network until the entire process converges or until the entire the maximum number of iterations is reached
Info
Channel: stanfordonline
Views: 8,072
Rating: undefined out of 5
Keywords: Jure Leskovec, Machine Learning, Machine Learning with Graphs, Graphs, CS224W, Stanford, Iterative Classification, Relational Classification, node classification, relational classifier, collective classification
Id: QUO-HQ44EDc
Channel Id: undefined
Length: 29min 20sec (1760 seconds)
Published: Tue Apr 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.