Simple Message Passing on Graphs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Great job with the content and simple examples. I didn’t understand what you meant when you were talking about graph convolutions.

Small critique, you need a bit of work on your camera presence: have some more confidence and some emotion (you’ll stop swiveling in your chair).

👍︎︎ 5 👤︎︎ u/physnchips 📅︎︎ Dec 02 2020 🗫︎ replies

Great explanataion!

👍︎︎ 3 👤︎︎ u/shehzensidiq 📅︎︎ Dec 02 2020 🗫︎ replies
Captions
fundamental to the study of graphs is the adjacency matrix which is a data structure in the form of a matrix that encodes all of the connections in the graph between nodes and you can use the adjacency matrix to implement basic forms of the message passing algorithm which is a key concept in understanding many of the modern graph neural network techniques so this video is going to define these terms and show some simple examples using just python to help make the concepts crystal clear [Applause] [Music] before moving on let me mention i have a mailing list where i make announcements about new content either blog posts or videos and i also manage a discord server where there's a community of people like you trying to learn machine learning and it's a great place to ask questions so please join us and links to both of those are in the description below let's build the adjacency matrix of a simple example here's a small graph nodes one two and three are connected and a chain and then node three has two other dangling connections to nodes four and five since we have five nodes our adjacency matrix will have 5 rows and 5 columns the first row will encode all the connections of node 1. we cycle through the columns and put a value of 1 in column j if node 1 has a connection with node j otherwise it'll be zero since node one is only connected with node two we'll put a value of one in the second column and a zero and all the others in the next row we'll do the same for node two it connects to nodes one and three so we'll put a one in the first and third columns and zeros everywhere else the next row is for node three and has ones in columns two four and five and then the remaining two rows only have a connection to node three and therefore a one is only in the third column you might notice a couple things when you look at this matrix so the first is that the last two rows are identical and this makes some sense because there's no real way to tell these two nodes apart from the perspective of graph structure they play the same role you also might notice that this matrix is symmetric meaning that if you reverse the rows and columns such that the first row becomes the first column the result is identical and this is because we've modeled our graph as having undirected edges meaning that a connection from node one to two is also a connection between nodes two to one but that's not a requirement we could have said for example the edge between node three and four was directional and pointed toward node four and then we would need to delete one of these encoded connections because one encodes an outgoing edge from three and incoming to four and the other encodes an edge outgoing from four and coming to three and there's different conventions for whether the rows or columns encode incoming versus outgoing edges you might also notice that if you sum the adjacency matrix across the rows or columns you get the number of incoming and outgoing edges of each node and that's called the degree of each node so that's an important concept that we'll use later now that we have an adjacency matrix let's multiply it by node features and just see how it behaves for our node features let's just give each node a single number such that node 1 has a value of 1 node 2 has a value of 2 and so on and now let's calculate the matrix multiplication of our adjacency matrix with that feature vector starting with the first row you see that every element of our feature vector is multiplied by zero except the value corresponding to node two which is node 1's only graph connection so in other words the adjacency matrix masked out all the values except for the one that node 1 has a connection with let's do the same thing for the next row the result of this computation is the sum of values 1 and 3 giving a total of 4. but these are the graph connections of node 2. the adjacency matrix multiplication sum the values of node 2's connections and ignored all the others so carrying the rest out node 3 sums 2 4 and 5 to get 11 and the remaining nodes have identical connections and therefore identical results of just summing the single feature of node 3. the final result of this matrix multiplication is a new feature vector of the same shape as the original but now each value represents the sum of the connected neighborhoods of each node this is a simple form of message passing where the messages are the feature vectors and the aggregation function is the sum there are a number of ways to tweak this that might be useful so for example if you wanted to take the average of messages rather than the sum all you need to do is divide each of the results by the neighborhood size of the node one way to implement this is by using the degree matrix which is the same size as the adjacency matrix but it's zero everywhere except along the diagonal and the diagonal elements are the neighborhood sizes of each node in our case the diagonal would be 1 2 3 1 1. now if we take the inverse of this matrix that amounts to the same thing as inverting each of the diagonal elements so the new diagonal values would be one one-half one-third one and one if we then multiply the adjacency matrix by the inverse degree matrix we effectively assign a weight to each edge in the adjacency matrix such that each value in a row is divided by the neighborhood size of that row and therefore each row will sum to a value of one so for example we see the first row is still the same because there's only one connection but the second row has two connections so each value is 0.5 and similarly for row 3 we see 3 connections each value being 1 3. if you then use the scaled adjacency matrix to multiply the feature vector you still mask out the non-connections but now you're scaling each value by the neighborhood size and therefore computing the average rather than the sum before moving to the example let me mention a couple other tweaks that are specifically relevant to graph convolutional networks so in this they add self connections for each of the nodes to the adjacency matrix which basically means that each node has a edge connecting itself to itself but the implication of that is when you do this neighborhood aggregation of messages you include the node's own value in that aggregation so this comes from the derivation of the convolution operation on graphs and the paper derives that but i think it makes some intuitive sense as well this is called a tilde and it's just defined as a plus the identity matrix next they do almost the same scaling as we did in defining the average but rather than using d inverse a they take the square root of d inverse and then sandwich a between two of them so that's d inverse root times a times d inverse root and by putting the d inverse root on the right side of a we scale by the neighborhood size of the destination node rather than the source node so if you were to write out the equation for a particular element i j of this resulting matrix you would see something pretty simple in other words you just normalize the connections in the adjacency matrix by using the neighborhood sizes of both the source and the destination nodes denoted by di and dj another way of looking at it is each message is scaled by the neighborhood size that the message is coming from which is dj in a weighted sum and then the result of this overall weighted sum is scaled by 1 over the square root of di so the overall equation is a hat equals d tilde inverse root times a tilde times d tilde inverse root where the tilde just means that the self connections were added to the adjacency matrix i don't think you need to understand this with exceptional clarity at this point but rather understand that using this form has theoretical justification and it is related to deriving what a convolution operation is on a graph but what it amounts to is adding in self connections to the adjacency matrix and then scaling the values such that repeated application of this matrix is numerically stable rather than blowing up or decaying to zero with that out of the way let's take a look at a simple example to build our intuition for how message passing behaves with this matrix let's look at a system and observe its dynamics when we repeatedly apply this matrix so i like to think of this as a water drop experiment where we have a graph and we spike one of the nodes with a water drop and then the edges encode how the signal can transfer through the graph so by repeatedly applying our a hat matrix we get to see how the signal spreads through the graph via message passing to simulate the water drop we can initialize the features as having a value of 1 on the first node and then 0 everywhere else and by multiplying this by a-hat we'll apply message passing to come up with new values for each node where the water drop has spread to its neighbors if we then use this output as input to another multiplication by a-hat we're applying yet another stage of message passing and allowing the signal to spread further here's an animation showing 10 iterations we see that at the beginning all of the signal is on the first node and after the first stage of message passing the signal's only been able to move to the immediate neighborhood of the first node but upon each iteration the signal can spread further and further until eventually the system converges to a steady state so to sum this all up if we take a set of node features and multiply it by the adjacency matrix we're effectively summing the values of each local neighborhood and if we scale the values of the adjacency matrix for instance using one over the neighborhood sizes we can convert that sum into an average and you can imagine other normalization schemes additionally if we repeatedly apply the adjacency matrix the signal can travel to more distant neighborhoods in the graph so in our water drop experiment we saw that when we continually applied it what started out as a spike on a single node spread through the entire graph and eventually reached a steady state so i hope this helps you understand the adjacency matrix and how it relates to message passing see you next time
Info
Channel: WelcomeAIOverlords
Views: 17,160
Rating: undefined out of 5
Keywords: Machine Learning, Data Science
Id: ijmxpItkRjc
Channel Id: undefined
Length: 10min 51sec (651 seconds)
Published: Tue Dec 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.