Graph Neural Networks - a perspective from the ground up

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
What if someone told you that a machine has discovered an antibiotic that can treat previously untreatable strains of bacteria? Yes, you heard that right! The antibiotic is a chemical named Halicin, and it was discovered by a pioneering machine-learning approach  called Graph Neural Networks. The fairly straightforward approach can also be used to find new drugs for other diseases like cancer, improve transportation, solve problems related to social networks, and many others. But what is a Graph Neural Network, and how is it different from the neural networks that have been around for more than 5 years? Let's find out! Imagine that you're Cupid, and... Oh! Hi Andy! Ok, so you've been tasked to find a suitable match for Andy to fall in love with. How would you approach this? First, we know Andy's age, gender, occupation, that he reads fiction, loves cycling, and he can't dance.. Well that's a bit harsh, how about 'learning to dance'. Lets place all these features about Andy into a list as such. Next, lets gather information about his social interactions - who does he hang out with most often and who are some friends of those friends. Alright, we can now begin to analyze this web of information to find Andy a suitable match. This web of data can actually be viewed as what we call a Graph in computer science. But what is a Graph, exactly? A graph is a structured way to represent data. It consists of nodes to represent entities like people, and edges to represent the relationships between these entities. Like in Andy’s network, each node might also have a set of features which describe it. For edges, they could either be undirected - meaning two way, or directed - one way Extending from undirected to directed graphs is not difficult once we get the basic mechanisms. So lets stick with undirected edges for now. But, why Graph Neural Networks? If you’re watching this, I’m sure you’ve heard of the expansive success that Neural Networks has had on many problems They are able to learn a variety of desired outputs from several types of inputs such as numbers like stock prices, words that we use everyday or even images like this low resolution '4' over here. So, why don’t we just feed Andy’s network through the all-powerful neural network and let it tell us his best match? Well… It turns out that neural networks have mostly been successful on more structured data types. For example, numbers belong in a series, words that we use everyday are found in sentences, and images? Well they actually come as a grid of pixels. But neural networks still struggled with more complex data such as graphs, which do not have any fixed structure or ordering. So, in order to reasons with graphs, researchers went back to the drawing board to study what made neural networks so successful. The first layer in a common Convolutional Neural Network takes each pixel and extracts information about  its region by aggregating  the pixel within the context of its neighborhood. The next layer then extracts information about  that region within the context  of its neighboring regions. Doing this over enough layers, the network will be able to reason over different parts of the whole image. And then with some linear computations, it can finally identify the relevant object. Take a second to think how might you apply this mechanism to graphs? If we look closer at an image, we can actually view it as a special type of graph - a grid graph that is  structured, and every node is  a pixel that is connected  to its 8 neighboring pixels. Seeing it like this, you might think to yourself - why don’t we just use the same aggregation mechanism  that we have seen before? And you're right! We can actually use that same intuition on graphs! Relaxing the properties of fixed structure and ordering, we can similarly gain a better understanding of each node by aggregating the messages received from all neighboring nodes. Now, hold on to your new message, Andy, as your friends complete their first round of message passing. Observe how the color of each message changes to represent the newly aggregated message. And, with each new round of message passing, more and more information gets passed around in the graph. Over sufficient rounds of message passing, we will eventually obtain a final representation of each node in the graph that well describe it given the larger context. We can also view the rounds of message passing as a series of layers just like in our convolutional neural network. Here we go from the input, to round one of message passing, and then round two of message passing. And this produces our final understanding of nodes. So… We now know how to understand each node better, but how does this allow us to help Andy again? Well, as humans, we might take a look at Andy's graph, and mentally or visually, we will process all the information  such that people who are more  compatible with Andy will have a stronger association with him. In other words, in our mind, maybe people who have similar characteristics to Andy will be placed closer to him - and they are therefore stronger candidates. Similarly, in the process of reasoning over a graph, the graph neural network seeks to summarize all that information of each node into a numerical representation, where nodes  which are more ‘similar’  will be closer to each other. We call this representation  of nodes - node embeddings, and we call the space of all possible  representations - you guessed  it - the embedding space. But how does the network know how to come up with these embeddings, and where to place them in the embedding space? Neural networks need an appropriate measurable objective to guide the network into learning something useful for our task. In Andy's case, one way is to tell the network that embeddings of people who Andy spends a lot of time with, should be closer to him Rearranging the expressions, we can quantify our objective into a loss function to tell the network how well is it doing. Adding in all the other objectives from other nodes and taking the average, we now have a working loss function. We are finally ready to see how all these come together to find Andy's match! Firstly, from the input, we forward propagate through the message passing layers to get a final representation of each node. We then tabulate the loss, and backpropagate it through the  network to tell it where it went wrong. The network changes its computations, and tries again, this time hopefully with a better performance. We can do this as many times as we want, and, after the network has learned embeddings of Andy's graph that are good enough to meet our objective, we could simply shortlist a bunch of people who are in a certain radius from him, maybe filter out some Andy deal breakers, and rank them as potential matches according to numerical distance. And, we are done! Our hopefully successful matchmaking task is actually a form of link prediction, where we task the network to recommend new potential edges between nodes. Another popular task Node classification - where we use the representation of each node to predict a certain class. We also have Clustering - where we break up the embedding space into different groups. And finally, we have Graph classification - where we do some form of aggregation over all node representations of a graph, and compare it with other graphs. This last task was used to discover Halicin! You might notice that the key component of these strategies involves the need to learn good node embeddings. Going back, hopefully you can see that this actually hinges a lot on how we perform message passing between the nodes. So, lets dive in to take a closer look at each round of message passing. Message passing is actually a mathematical function, which we call 'f', to update the receiver node using the  messages from each neighboring  sender node. In this case, the receiver node is Andy. Note that the aggregation includes the receiver node itself. In most cases we would want to give different importance values to each neighbor. For Andy’s case, this could be based on the number of connections that each node has We express these as constants, since they are fixed by the graph's structure. Similarly, depending on the final task, we  might want to weigh each feature  of the nodes differently. For Andy’s match-finding example, maybe a combination of age and education is more important. Instead of hand-picking which features are more important, we can leverage on the power of neural networks and learn them through backpropagation as represented by weights 'W'. In order to learn more complex patterns, we would also want to add some non-linearity to the function output. Going one step further, must we only use the sum operation to aggregate the messages? Well, since changing the permutation of the nodes don't change the node’s neighbors, what we really want is a permutation-invariant function which consistently aggregates information from a node’s neighborhood regardless of ordering. Lets now express this slightly differently,  to generalize it to any other  node and not just Andy's. Firstly, all the neighbors can be denoted as j, and so we can group all these terms and denote the weights as a function. Since we are updating the receiver node, we might want to treat its node's features  differently from the rest. We can also denote the outermost non-linear function as a general learnable function. And here, we arrive at one of the three general “flavors” of graph neural network layers - Convolutional - where the weights of neighbors are fixed based on the structure of the graph Previously, we discussed that the weights for each feature can be learned. So, could we also learn the weight given to each neighbor according to their features? Turns out we can, bringing us to our second flavor - Attentional - where the weights of neighbors are learned based on the interactions of the features between the nodes. Finally, so far, we have been aggregating a weighted combination of neighbors, but the most general form of message passing would be that each pair of nodes collaborate to produce a specific message between them. One final note - recall this form of convolutional message passing? Computing the updates for each node sequentially like this sounds slow, doesn't it? In practice we actually leverage on linear algebra! A graph's edges can actually we summarized using a table which we call an adjacency matrix. Note that for our case, we would need the nodes to connect to themselves. We can also stack all node features of a graph in a matrix like this. And we now have our three main elements, including the learnable weights, 'W', with all these dimensions. But wait, do you notice something missing? You got it, the importance weights for each neighbor is actually computed by changing the adjacency matrix accordingly. And now, hopefully you can see how these functions are exactly the same! With this, we can use some highly optimized matrix multiplication library, to perform message passing for all the nodes at once. Isn't that amazing?! With that, we have come to the end of our introductory graph neural network video! I really hope that it was helpful to you in some ways, and if it was, do give a like or subscribe, so that youtube's neural network can matchmake this video to people similar to you. Alright, till next time, bye!
Info
Channel: Alex Foo
Views: 74,359
Rating: undefined out of 5
Keywords:
Id: GXhBEj1ZtE8
Channel Id: undefined
Length: 14min 28sec (868 seconds)
Published: Sun Aug 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.