Graph Convolutional Networks using only NumPy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

when i started learning about graph neural networks it wasn't clear to me how the equations and the graph processing would translate into practical code but it turns out this whole message passing idea where nodes and graphs and messages to one another through their connections is as easy as doing one more matrix multiplication so by the end of this video you will be able to implement graph convolutional networks from scratch using nothing but numpy let's get started [Applause] [Music] before diving in let me mention that the code to this will be linked in the description below i'm also going to put a link to my mailing list where i make announcements of various content releases of blogs or video as well as a link to our discord server where there's a community of people like you also trying to learn this stuff and it's a great place to ask questions so i hope to see you there the data we'll be using for this task is the famous zachary's karate club graph so the karate club has 34 nodes which are people and then the edges between each of these nodes represent social interactions that occurred outside of the context of the karate club itself and following the lead of kiff and his gcn paper i applied a greedy modularity maximization algorithm to come up with community labels for each of the nodes and it's these community labels that we're going to train our gcn model to predict message passing between nodes on a graph is a fundamental concept to many of these modern methods so we can get the sum of each node's neighborhood features by multiplying the feature matrix by the adjacency matrix which encodes all the connections in the graph if you don't know what the adjacency matrix is or just want to understand this next part better let me point out i have a video dedicated to understanding the adjacency matrix and message passing algorithm so i'll put a link to that below i do think it's a fundamental and pretty straightforward thing to understand so check it out if you need to briefly the adjacency matrix is sized in by n where n is the number of nodes and a 1 is put in location i j if nodes i and j are connected in the graph and otherwise there's just a zero and when you multiply a set of node features by this the zeros of the adjacency matrix cancel the contribution of all non-connected nodes resulting in a sum of just the connected nodes so it's a sort of graph neighborhood mask the original kitfin welding paper on gcns presented a normalized form of the adjacency matrix and that's what we're going to calculate now for the karate club graph and use through the rest of this video the paper defines a tilde to be a plus the identity matrix which adds self connections into the graph i call this a mod in the code we want to normalize these connections by the neighborhood sizes so we count the number of connections for each node and put it in the diagonal of another m by n matrix which we call d mod the final equation takes the inverse square root of d mod and then sandwiches a mod in between two of them this ultimately just scales each one in the adjacency matrix by one over the square root of di times dj where di and dj are the neighborhood sizes of the two nodes being connected our final result is stored in a hat again check out the other video on message passing if you want more detail on all of this next we define the input features since we don't have any node features we'll just use the identity matrix using the identity matrix as input features will effectively map each node in the graph to a column of learnable parameters in the first layer resulting in fully learnable node embeddings now that we have the labels normalized adjacency matrix and input features let's go on to the gcn implementation a single gcn layer is implemented in the gcn layer class which i'll then be able to stack into a larger gcn model to instantiate a gcn layer we need to specify the number of input features and desired output features which constructs our learnable parameter matrix w i also have placeholders for an activation function and a name string for a convenient display here's the gcn equation in the kiff and welding paper a gcn layer updates the node embeddings from hl to hl plus 1 where l is the layer number and l equals 0 refers to the input features so in other words the update uses a hat to multiply the incoming feature vector which does the message passing and then multiplies that result by a learned matrix of parameters w finally it applies a nonlinear activation function represented here by sigma if you look at the forward method you'll see exactly that the first line uses the normalized adjacency matrix and input features to do the message passing step and then that's cached into this self.underscore x note here that the at operator is a shorthand for matrix multiplication in numpy it then computes a linear projection with w and then passes that through the activation function a quick note that the the reason i store inputs and outputs into self.underscore x and self.underscore h is because that's needed later to do the gradient calculations in the backward pass but that's the gcn layer it's identical to a normal dense neural network layer except you first do this message passing on the graph by multiplying the input by the normalized adjacency matrix so that's it one more matrix multiply i also created a normal dense layer for taking note embeddings as input and then predicting probability of each class using softmax in other words it's just a linear layer from pytorch with softmax activation you specify the input and output dimensions which creates the projection matrix and in this case i also have a bias term the forward pass does not do the message passing step it's just applying the layer to the already computed node embeddings that were output from the gcn layer and then it just applies normal softmax activation with this added trick about subtracting the max just to help the numerical stability but that doesn't change the results at all i also define this gradient descent opt class and that's more just an artifact of how i designed the parameter update process once the gradients were calculated but it doesn't have anything to do with gcn's or is fundamental to the understanding here but if you're curious its only role was to flow the gradients down the layers during back propagation so moving from the last layer to the first each one calculates their gradients updates the parameters and then stores the gradients that need to flow to the next layer into this gradient descent opt object now let's look at the back propagation piece we'll start with the soft max layer since it's the last in the model and the first to calculate the gradients but first i create a mask that zeroes out the loss values for nodes that were not used during training so i keep this list in the gradient descent object next i calculate d1 which is the common equation for the partial derivative the cross entropy loss with respect to the pre-soft max values in other words it flows the gradients of the loss function through the softmax activation when this is matrix multiplied with w it gives the gradients flowing through the weights of this layer to the next layer and that's stored in the gradient descent opt out attribute which will be used by the next layer to calculate its gradient but to calculate the final gradients for this layer's parameters we use d1 the bias term is just d1 directly but averaged across the training batch and for w we have to matrix multiply with the input which we stored in the self.underscore x during the forward pass i also include the option of l2 weight decay for the w parameters and during the actual update process we multiply these gradients by the learning rate and take a step in the negative direction so as to minimize the loss function this is standard back propagation and so far has nothing to do with graphs now let's look at the gcn layer d tan h is the derivative of the tanh activation function d2 extracts the incoming gradients from gradient descent opt dot out which were computed by the downstream layer and passes them through the activation function by multiplying with d tan h next we update gradient descent opt dot out to represent the gradients that will pass to the next layer which is just d2 projected backward by this layer's weights which is the same as we did in the softmax layer also the same as in softmax these intermediate gradients are passed to the w matrix by matrix multiplying with the inputs stored in self.underscore x from the forward pass and then the weight decay and parameter update steps are also identical so what's the difference aside from the activation function the only difference is the pre-computed inputs from the forward pass stored in self.underscore x in gcn this included the message passing step and in softmax we had nothing like that but the actual gradient updates have nothing to do with graphs it's just standard feed for neural network stuff where the incoming gradient is differentiated with respect to the layers input to get the partial derivatives on the w parameters now we'll piece it all together and build a gcn model with the gcn network class to ultimately predict the community label of each node in the karate club graph this class creates an input gcn that takes in the raw number of feature inputs and outputs new embeddings of the desired hidden size using the gcn layer we just built for message passing and embedding updates it then stacks multiple hidden gcn layers to add as many message passing graph convolution operations as we desire and then the soft max layer is added at the end to transform these node embeddings into class probabilities in the forward method notice that you pass in both a and x because a is required for the gcn layers it then calls the embedding method which initializes h with the raw input and then loops through all the gcn layers to incrementally update h each of which uses a for message passing once all the gcn layers have been used to get the node embeddings this soft max layer is applied to convert those embeddings to class probabilities notice that a is not required for this calculation let's take a look at the gcn embeddings before we do any training we've set the dimensionality of the last gcn layer to 2 so that we can straightforwardly plot the embeddings we already see some isolation between the pink and red communities but there's significant overlap between red and green if you'd like to understand how gcns can discover graph structure without any training check out my blog post that digs a lot deeper into that along with the code and i'll put a link to that in the description below now let's train for a number of iterations by providing one label for each of the three communities and see how the embeddings change through time we see that the model quickly pushes the three communities into opposing corners where the nodes generally seem well separated by color into corners but nodes with connections to multiple communities have a more central position but clearly the training process is more or less working since the three label nodes are pushed into these opposite corners and then exert attractive forces on nodes that share connections in sum we showed that message passing could be implemented by matrix multiplying the feature matrix with the adjacency matrix and this behaves by essentially masking out the non-connections by multiplying them with zero and this is essentially gcns with the exception of this message passing step everything else is the same the forward pass is the same and then the gradient update equations are also the same so i hope this takes out some of the mystery for you see you next time

Info

Channel: WelcomeAIOverlords

Views: 10,147

Rating: 4.9845262 out of 5

Keywords: Machine Learning, Data Science

Id: 8qTnNXdkF1Q

Channel Id: undefined

Length: 13min 21sec (801 seconds)

Published: Thu Dec 10 2020