What if someone told you that a machine has
discovered an antibiotic that can treat previously untreatable strains of bacteria? Yes, you heard that right! The antibiotic is a chemical named Halicin,
and it was discovered by a pioneering machine-learning approach
called Graph Neural Networks. The fairly straightforward approach can also be used to find new drugs for other diseases
like cancer, improve transportation, solve problems related to social networks, and many
others. But what is a Graph Neural Network, and how
is it different from the neural networks that have been around for more than 5 years? Let's find out! Imagine that you're Cupid, and... Oh! Hi Andy! Ok, so you've been tasked to find a suitable
match for Andy to fall in love with. How would you approach this? First, we know Andy's age, gender, occupation,
that he reads fiction, loves cycling, and he can't dance.. Well that's a bit harsh, how about 'learning
to dance'. Lets place all these features about Andy into
a list as such. Next, lets gather information about his social
interactions - who does he hang out with most often and who are some friends of those friends. Alright, we can now begin to analyze this
web of information to find Andy a suitable match. This web of data can actually be viewed as
what we call a Graph in computer science. But what is a Graph, exactly? A graph is a structured way to represent data. It consists of nodes to represent entities
like people, and edges to represent the relationships between
these entities. Like in Andy’s network, each node might
also have a set of features which describe it. For edges, they could either be undirected - meaning two way, or directed - one way Extending from undirected to directed graphs
is not difficult once we get the basic mechanisms. So lets stick with undirected edges for now. But, why Graph Neural Networks? If you’re watching this, I’m sure you’ve
heard of the expansive success that Neural Networks has had on many problems
They are able to learn a variety of desired outputs from several types of inputs such
as numbers like stock prices, words that we use everyday or even images like this low
resolution '4' over here. So, why don’t we just feed Andy’s network
through the all-powerful neural network and let it tell us his best match? Well… It turns out that neural networks have mostly
been successful on more structured data types. For example, numbers belong in a series, words
that we use everyday are found in sentences, and images? Well they actually come as a grid of pixels. But neural networks still struggled with more complex data such as graphs, which do not
have any fixed structure or ordering. So, in order to reasons with graphs, researchers
went back to the drawing board to study what made neural networks so successful. The first layer in a common Convolutional
Neural Network takes each pixel and extracts information about
its region by aggregating the pixel within the context of its neighborhood. The next layer then extracts information about that region within the context
of its neighboring regions. Doing this over enough layers, the network
will be able to reason over different parts of the whole image. And then with some linear computations, it
can finally identify the relevant object. Take a second to think how might you apply
this mechanism to graphs? If we look closer at an image, we can actually
view it as a special type of graph - a grid graph that is
structured, and every node is a pixel that is connected
to its 8 neighboring pixels. Seeing it like this, you might think to yourself
- why don’t we just use the same aggregation mechanism
that we have seen before? And you're right! We can actually use that same intuition on
graphs! Relaxing the properties of fixed structure
and ordering, we can similarly gain a better understanding of each node by aggregating
the messages received from all neighboring nodes. Now, hold on to your new message, Andy, as
your friends complete their first round of message passing. Observe how the color of each message changes
to represent the newly aggregated message. And, with each new round of message passing,
more and more information gets passed around in the graph. Over sufficient rounds of message passing, we will eventually obtain a final representation
of each node in the graph that well describe it given the larger context. We can also view the rounds of message passing
as a series of layers just like in our convolutional neural network. Here we go from the input, to round one of
message passing, and then round two of message passing. And this produces our final understanding
of nodes. So… We now know how to understand each node better,
but how does this allow us to help Andy again? Well, as humans, we might take a look at Andy's
graph, and mentally or visually, we will process all the information
such that people who are more compatible with Andy will have a stronger association with him. In other words, in our mind, maybe people who have similar characteristics to Andy will
be placed closer to him - and they are therefore stronger candidates. Similarly, in the process of reasoning over a graph, the graph neural network seeks to
summarize all that information of each node into a numerical representation, where nodes which are more ‘similar’
will be closer to each other. We call this representation
of nodes - node embeddings, and we call the space of all possible representations - you guessed
it - the embedding space. But how does the network know how to come up with these embeddings, and where to place
them in the embedding space? Neural networks need an appropriate measurable
objective to guide the network into learning something useful for our task. In Andy's case, one way is to tell the network
that embeddings of people who Andy spends a lot of time with, should be closer to him
Rearranging the expressions, we can quantify our objective into a loss function to tell
the network how well is it doing. Adding in all the other objectives from other nodes
and taking the average, we now have a working loss function. We are finally ready to see how all these come together to find Andy's match! Firstly, from the input, we forward propagate through the message passing layers to get
a final representation of each node. We then tabulate the loss, and backpropagate it through the
network to tell it where it went wrong. The network changes its computations, and
tries again, this time hopefully with a better performance. We can do this as many times as we want, and, after the network has learned embeddings of
Andy's graph that are good enough to meet our objective, we could simply shortlist a
bunch of people who are in a certain radius from him, maybe filter out some Andy deal
breakers, and rank them as potential matches according to numerical distance. And, we are done! Our hopefully successful matchmaking task
is actually a form of link prediction, where we task the network to recommend new potential
edges between nodes. Another popular task Node classification - where
we use the representation of each node to predict a certain class. We also have Clustering - where we break up the embedding space into different groups. And finally, we have Graph classification - where we do some form of aggregation over
all node representations of a graph, and compare it with other graphs. This last task was used to discover Halicin! You might notice that the key component of
these strategies involves the need to learn good node embeddings. Going back, hopefully you can see that this actually hinges a lot on how we perform message
passing between the nodes. So, lets dive in to take a closer look at
each round of message passing. Message passing is actually a mathematical
function, which we call 'f', to update the receiver node using the
messages from each neighboring sender node. In this case, the receiver node is Andy. Note that the aggregation includes the receiver
node itself. In most cases we would want to give different
importance values to each neighbor. For Andy’s case, this could be based on
the number of connections that each node has We express these as constants, since they are fixed by the graph's structure. Similarly, depending on the final task, we might want to weigh each feature
of the nodes differently. For Andy’s match-finding example, maybe a combination of age and education is more
important. Instead of hand-picking which features are
more important, we can leverage on the power of neural networks and learn them through
backpropagation as represented by weights 'W'. In order to learn more complex patterns, we would also want to add some non-linearity
to the function output. Going one step further, must we only use the
sum operation to aggregate the messages? Well, since changing the permutation of the
nodes don't change the node’s neighbors, what we really want is a permutation-invariant
function which consistently aggregates information from a node’s neighborhood regardless of
ordering. Lets now express this slightly differently, to generalize it to any other
node and not just Andy's. Firstly, all the neighbors can be denoted as j, and so we can group all these terms
and denote the weights as a function. Since we are updating the receiver node, we
might want to treat its node's features
differently from the rest. We can also denote the outermost non-linear
function as a general learnable function. And here, we arrive at one of the three general
“flavors” of graph neural network layers - Convolutional - where the weights of neighbors
are fixed based on the structure of the graph Previously, we discussed that the weights
for each feature can be learned. So, could we also learn the weight given to
each neighbor according to their features? Turns out we can, bringing us to our second
flavor - Attentional - where the weights of neighbors are learned based on the interactions
of the features between the nodes. Finally, so far, we have been aggregating
a weighted combination of neighbors, but the most general form of message passing would
be that each pair of nodes collaborate to produce a specific message between them. One final note - recall this form of convolutional
message passing? Computing the updates for each node sequentially
like this sounds slow, doesn't it? In practice we actually leverage on linear
algebra! A graph's edges can actually we summarized
using a table which we call an adjacency matrix. Note that for our case, we would need the
nodes to connect to themselves. We can also stack all node features of a graph
in a matrix like this. And we now have our three main elements, including
the learnable weights, 'W', with all these dimensions. But wait, do you notice something missing? You got it, the importance weights for each
neighbor is actually computed by changing the adjacency matrix accordingly. And now, hopefully you can see how these functions are exactly the same! With this, we can use some highly optimized
matrix multiplication library, to perform message passing for all the nodes at once. Isn't that amazing?! With that, we have come to the end of our
introductory graph neural network video! I really hope that it was helpful to you in
some ways, and if it was, do give a like or subscribe, so that youtube's neural network
can matchmake this video to people similar to you. Alright, till next time, bye!