Graph Neural Networks (GNN) using Pytorch Geometric | Stanford University

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so so for the first part I'll just give a brief introduction on how to use my torch or like some Python basics to build machine learning pipelines so we'll start with a very simple example say Amnesty ossification that every everyone in motion on you should be familiar so the package that we're going to be using is many like those Python packages so torch dot NN which we abbreviate as and then torch and end of functional which would grieve as F so and then has a lot of modules on neural networks and the function functional has a lot of functions function definitions thats related to neural network operations these two are just specific for amnesty which you don't have to care and then the last one is the matrix which we use to do perform evaluations of the model so I've already ran this because so so here in the second part this is just loading the data set so this is gonna take a little bit so I just ran it before before I came but I guess the thing that you have to know here is that before you start training you have to get a sense of the concept of data set so this is like the data structure that patronage keeps for you to get input for the model so these are these are storing all the data set in this case this is the M this data set and for later on we're gonna use a sightseer enzyme IMDB this other kind of graph data sets but all in the format of this high-touch data set object and so here this is where we define the training set this is the loader that loads the training set so what does this various it's just a inheriting from some abstract data set format that is an iterable so the main functions that you want to inherit when you build your own data set is the thing called land which is the length of the iterable and also the index like you'll be able to index into specific examples into the data set for example maybe I can show here so this trainset thing [Music] oops so so this train set is just just this thing here which is a Python data set oh it's gonna take a bit but but you can also do something that's very common to like a very common to any trope of Python interval which is like you can print the length of the iterable and you can and you can print say what's the first example like any index that you can put in so this basically retrieves you the tenth example it's doing why is this so [Music] we're able to download it I said okay and you should be getting some [Music] okay but but this is what you what you have to do for it for iterating over the data set so you can index into a dataset to retrieve one one example from the training set but in practice what we do is mini batch training right we do stochastic gradient descent which means that at every iteration or at every year and every iteration you take in multiple examples so this is also the way that you take multiple example you can take like say 10 11 12 or like after shuffling or something like that so yes so this is the basic concept of data set and and as you can see like it will print out the number of data say this was actually the previous run so it turns out that the desert itself but like now I'm printing out the length of the training set I can also print out like the specific data data example that is here so basically this array is just a image array of 26 by 26 which represents 2m this digit so so okay so so once you once you construct it maybe you can just yes so once you've constructed the data set now it's the time to construct the model and so we will do like this we first construct it is that we would then build our model and then we link the model and the dataset together to do a training and testing so here it's a very brief or a very simple example of like MLP with com1 wonder your convolution this is like really simple model just to demonstrate how I would use Python to build this kind of build kind of network just show of hands how many people are familiar with like pi touch concepts ok like half of it ok so maybe I'll explain a bit more in detail so every model in so every model that you build in python is inheriting from something called like any end or module which is like a superclass of n neural network model so this provides a of good things that you were later see you know like it provides easy interface to optimize easy interface to run influence and training on that model so this module mainly has like two functions that you have to implement one is initialization and one is the forward function so the initialization basically helps you to define all the parameters that you use it in the model so these are the places where you put in all the queen of all parameters and later on you can retrieve these parameters in your optimizer and optimize for these parameters right and and the second part is the forward function so the what the form function does is it basically tells you how to construct the computation graph from the input to the output but in this case the parameters that we we care here is just a convolution to the convolution which you don't need need in in craft neural nets but and then the second one is the linear layer the third one is also linearly so these are all the parameters that we use to train this model and here we're defining so linear layer has two inputs the input Channel and the Alpha Channel so these are pretty useful because I will see a lot of linear layers in graph neural nets so here the input is something like 26 by 26 which is the size of the image and then 32 is the number of channels because we noticed that the number of output channel is 32 so here this is the input dimension and this is the output dimension I have a hidden layer of 128 neurons and in the second convolution I taking this number of outputs here and then I output 10 I oughta move 10 because this is a classification task right so this is classifying which digit is the M is the number is so the possible number of possible digit is 0 to 9 so this is like a 10 class classification task and and in a forward function this is what we do we we take an input X so the forward function takes in the input of the actual tensor and then he bills that come it builds the computation graph here so the notion of computation graph actually like starts from tensorflow but the good thing about Python which also makes it easier is that you don't have to pre construct your entire computation graph before feeding your or your data so you can actually like on-the-fly retrieve your data from your computation graph look at it maybe even modify it and then feed it back to feed it feed it back into your computation well so it's like dynamic yes sure again it's not going to be important for this class but the shape for the convolution has like three parameters one is kernel size which is like the the filter size of the filter dimensions which is three by three in this case and the input channel is one because this is a black and white image so just like a grayscale so input channel is one because it's one dimension and output channel is 32 it's just like the curve output Channel yeah but it's not going to be used for graph darkness it's just there for image classification [Music] in the first function yes yes so there's like a patchy manner here so the 32 in this case is the batch size yeah yeah again this is variable so there's also like a flat in here which means that you flatten this like multi-dimensional array to one dimension this can also be very useful in graph no that's by the way and this is not the only way so this way this is actually the flatten is actually an operation that connects the the input tensor which is this X to a very to a different answer which is also assigned to X so you first do the computation on X and then you you put the resulting value back to X but this could be a little bit expensive like what we usually also be able to do is like we can use this like view function like this one I think so the tensor top view and we just specify like because we want to flatten it to 32 by whatever the the rest of the dimensions are just negative one here so this essentially does the same thing with Glanton but the good thing about this is that it doesn't actually does the actual computation of changing the shape of the X it just providing a different view of this tensor so a different view mean means like I look at the difference B but it's in the same memory in the memory so it's a little bit more efficient but computationally they are equivalent these two are equivalent you can by the way you can look at the documentation of the view in the Python tensor documentation so after this competition we do non linearity and we produce a logit logit is essentially the ten class here the output is dimension ten so ten class we pass through itself max on the largest with the largest is the ten cast score so for those of you who are not familiar with softmax it's just a way to differentially differentiate kind of max right so say I provide a score for each of the four use of the candidate class and I use this particular function so this will be the probability of the model choosing that particular class it's just basic soft Max and then with this office we can use like cross-entropy to compute the actual loss is is the basics about by touch clear is there any questions on pathology okay so so once we define the model we can actually start working on like actually running this model right so maybe maybe let's let's first start with a very simple example where we don't actually train the model but just like look at like what these tensors are like for example we can define a equals to is numb hi so we can say define a equals to some kind of cancer so this is not cancer this is an umpire array no pi is not defined [Music] so this is a numb higher rate everyone is familiar with non-private right and we can't define something like and he thought once we can do like similar things right this is just a two-by-two matrix of once right if you print B it will be a 2 by 2 matrix of ones right and and in numpy we can do operations like matte small right you can do matrix multiplication between these two so you can do n t dot mats so this will give you the matrix multiplication between matrix a matrix P which turns out to be this but this is in um hi and which is also which is just using CPU for your computer but the good thing about packages that you can do this linear combination again like GPU so you notice that like here our runtime type is set to GPU here so which which means that we have a GPU available so we can do this using height wash on GPU so what we do here is that we can define this thing called tensor which is the the object like Python object that they used to perform certain computation the tensor a is something like a torch don't answer and then you put in this numpy array so it will so if you print the TA now this is going to be a high-touch tensor which is wrapped around the numpy array right you can also define that they type clothes right so it now you it tells you I this is using flow of 64 right and we can do the same thing for another and so here we can do either torched or tensor be but we can also do like torch dot once so you notice a lot of similarity between calling non high-functioning and calling the the Pythons function so is it this one so now we print TB alright so and we can also define the type similarly this is all very similar to known on pi interface so it's very easy to use and now you can perform computation using torch functions right so now instead of calling the non height of Matt Moore I can do torch network right so this will give you matrix multiplication using high touch and or just as like a very simple shortcut is just do this which also gives you the matrix multiplication right but the here the other hands are still ravine CPU because we just held the title to construct this tensor but it's doing CPU so the way that you do GPU is that you can look at what the GPU is available so total CUDA is available this will give you Q so now because our runtime as you notice this review so it says CUDA is available and because it's available we can now do something like transferring the data into GPU and the way you do it is the very simple with this you just that's the the torch put it to any GPU that you can specify so now if you print this thing called TB sorry ta whatever it will give you something in a GPU okay yeah the first time transferring data is a bit slow but you notice that it's there is like a device device thing that wasn't there before so now this means that it's in GPU but a lot of times that you want to make sure specific like which GPU you want to actually use so this this time you want to call to device so so device is something like essentially this printer so in this case we only got one GPU so this is the GPU that we are using so so it's like that so but it's equivalent it gives you the same result so is there any question on like GPU using GPU pythor tensor stuff okay great [Music] so yeah so here it's the same thing here we're just define if the CUDA is available we set it to 0 otherwise we use CPU as the as the device right now we to use the tool code that I used before to get to the device and this is how we construct the model and then we define a loss so so this is the loss that we use to to compute the difference between our train softmax versus the gwangju label which is like 1 page 8 out of 10 digits right and then this is the optimizer so a lot of optimizer czar there so you can look at the thought Natraj dot Optim dot whatever like there are a bunch of things like s GD r s GD a degrade at a delta a bunch of those and you can you can try whether these fit you and there's one thing that's in like worth looking at because a lot of times you want the learning rate to be kind of annealing over time right initially because the model is not very very trained very well you want the learning rate to be big so you can like step with big steps and then later on you want to enter in you to be decaying or like annealing so there's an option called schedule in the in the optimization package so there's like different company doing kneading operations that you can do there's like step and you mean there's like linear annealing there's exponential annealing there's also something interesting called cosine Y which people envision likes so so yes so you're encouraged to look at this but it's not strictly necessary but equal you could improve your performance so once you run this probably have already run it then you can start the training so we have everything ready we have the model ready we have the data set ready we have the optimized already we have the laws ready so now we can do the training so the training group essentially goes like this so there's an fixed number of epochs that we need to go through typically like this just represents like the number of times you need to go through the in right we're also recording the walls and stuff and then okay so here is for each of the epic where you go through the in-house hello data now we enumerate over the data loader remember that data loader is a wrapper around the data set which is like which you can index into so this enumeration here just tells you like at every iteration you want to extract the number of examples equal to a batch size by the way batch size is also defined in the loader so now for every iteration this is the thing that that is retrieved from the dataset remember the train train set I said train set ten so this is the thing that's in the train set ten so here in this cases image and label right so we put everything into GPU we perform the model so when so when you see see like the model and then parentheses the input this means that this is calling the input function sorry calling the forward function of the model so this this is the function that it cost right so the input here is X which is the image and that's why we put the image into the model right so once you do that there's an important sorry and the last is of course the loss is the cross-entropy loss which you can do copies is very standard cross-entropy but here there's an important step that you really need to do which is zero grad zero guide basically set the gradient to zero for other variables this is very important because if you don't do this by default the the path Hodge is gonna accumulate your gradient right say like your previous you have already computed your gradient for all your parameters in your previous iteration if you don't do the regret it's gonna add back the gradient of the current computation so it's gonna be a some of your previous gradient plus the current gradient and this is not what you want a lot of times so you really need to call to regret if you want this kind of behavior and [Music] yeah so this is this is a function where we compute the gradient so backwards means back prop so back prop computes all the gradients for you for all the parameters that you define so if you don't forgot where the parameters are these are the parameters right for linear the parameter is the weights weight and the bias for so the weight will equal to this the number of elements in weight is equal to the input times the output and the number of elements in the bias term is the number of output so these are the parameters and when we call backward all the gradients of these parameters are computed and now we do optimizer thought optimize the dot step so when we call optimize a dot step where work we are saying for all these for all these parameters where we already have gradient we want to perform one step optimization with the learning rate that's specified here so and also okay maybe I also explain the parameters function call so this is just so it's perfectly fine if you put a list of parameters there but this model of parameters is just a revaluation to help you so say like I want to optimize all the parameters that is defined in it so this is what you do right you just call parameters and then you will help you so it's essentially giving you all the parameters here and how do you decide is that anyone has a core how do you decide whether a parameter here is included in being modeled on parameters is there like say for example if I do something like this say x equals to and and dot linear five is X optimized is X included in model table parameter no yes good so anyone know why it's not it's not very it's a very tricky thing so okay so so the the reason that X is not included here is that I didn't add self if if I do this this is in the parameters it's easy model table parameters this is just like a very specific thing about any and all module like you have to define this self for the model to of all the parameters to be included in optimizer so that you can optimize and there's also a caveat like say I want to have like a list of parameters I wanna say I have a three layer neural network see I have this for I in range three oh whatever so if I if I have if I have this anyone know why anyone know if this is included in a parameter list so this will also not be including the parameter list even if you have self but because this list so it's not being included so what you do instead is do n n n dot ma your list and then you put the list there in which case it will be in the model parameters and you can optimize that just like whenever like sometimes you don't see the last coming down or you don't see the opportunity I'm there changing you could just be because that the parameters are not actually in your model doctrine through this so you didn't actually optimize that so it's good if you know this and know how to debug those make sure that your list of parameters is the list of things that you want to optimize okay so now we can we can perform training so this is essentially the entire training group that I just explained I explained on to this tab which is optimization of optimized adult step and there's also accumulate in the loss and but these are very simple Python stuff and note that this is the they've also I will talk about it later on the evaluation metric I will explain what this is but just to get a sense this is like how you're trained your own that's like going through a lot of epochs so here I'm recording the iPod 0 which is here but there's we can go through that they have multiple times to get results so ok so let's talk about testing so I assume that you guys know basic concepts about like test set training set right the basic idea is that you want to evaluate how good our model is but you but you don't want to use your training set to evaluate it right because you aren't trained on this so there's a possibility that you over fit on the training set but the model doesn't generalize to data that you have not seen before so this is the how you do the test set so the thing about has is that the evaluation metric does not have to be have to be differentiable right the true your training loss has to be differentiable because then you can do as pretty but your test test matter it need not be differentiable and these are typically things like precision recall f1 things like that is there any question okay cool okay this is just training I can kill this right but you will see like lost sort of going down and all that so in the test we basically want to evaluate how good it is so what we do is we take a different set of data in that data set so here see the test loader so test loader is constructed from the test set not the train yourself so you you notice from the previous code section that these are two different data sets but the idea is the same we enumerate over the entire test set and for the images we go to GPU for labels we go to GPU and we produce our model output and remember that the model opcode is the same is the softmax that we kind of computed from the from the model from the model cross and and okay so this is maybe I need to decompose this so so this is essentially the model output and then out of all the scores that we computed this way we want to pick the maximum of these to be our class right so say like the model predicts zero with possibility 0.5 0.1 one with possibility zero point one or two with possibility 0.8 right so you would say like the model actually thinks that the digit is two right so in which case we are picking the augments so essentially your you can treat this as as your predictions so torch dots are max which tells you the maximum is outputs right and and of course you can flatten for sure so so this is your predictions and and this is your labels right so we can we can print both we can print labels we can also print predictions and we just run it for one iteration so you judge just to let you know like what this looks like so this is the our prediction the model predicts seven the label is actually seven it's all correct right no it's all correct okay so but we want to evaluate a based on these model helpers and the labels right so what we want to do here is that we we can make use of the remember the library that I just talked like I mentioned at the beginning so there's like this a scalar matrix which helps you which already defines a lot of metric function that you can use for example accuracy for example precision etc so so here we're computing accuracy ourselves using me and items but we don't have to do that we can just make use of to a scalar but the thing is excellence implemented based on numpy so you have to convert this thing back to Tampa and the way you convert this thing back to Nam Huynh is you call CPU because this thing is in GPU right now remember and also also you need to compute the lamp ISO convert to an umpire is that clear and and I can do the same thing for the labels labels don't CPU and just convert back to an umpire right and then I can make use of my favorite SK learn taught metric I say there's a lot of different different functions in it so almost all the evaluation methods that you can think of is already implemented so you can basically call a lot of different things for example position remember position is false positive to positive over what shoe positive to negative no to positive plus false positive yes true positive as possible step so so this is how you compute the precision score you can print their position precision so I'm just going to run for it point iteration oh so so here this is a multi-class classification but by default averages binary which is not which is only for binary classification so in which case you want to do what's called average Micro say you do micro average of 10 class oh yeah this I I guess don't care about this so now precision is 1 but then you can do different other things right we call and there's PR curve there's everything like you can do it this way ok so let's remove these just to let you know that you can use SK learn to perform all these evaluations and in this case is doing has accuracy which means not so right now just now I printed the precision recall for my intuition now here I need to aggregate over all the iterations of the data set so to get overall accuracy which is happens to be 0 and an ax so yeah so that's the basic that's all for the basic tutorial for the height watch is there any questions on python related training models the question is how is the loss and optimizer linked together so they don't link direct link together directly so what you see here is that you compute gradient of the loss so when we say lost all backward it completes gradients of all the parameters in the model right this is the signal from the last function so once you get the gradient for other parameters then optimizer here only cares about parameters so you realize that optimizer takes in the set of other parameters right and all the parameters have already been computed the gradient so so once you call optimizer step you just take all the parameters look at their gradient and it performs one-step optimization exactly [Music] but lost them criterion which is yes exactly that'sthat's the point so so the loss is not related directly related to optimizer loss is related to your gradients so loss tells you the gradient of all the parameters and then based on all the gradient you compute your optimizer housing look at the gradient and performed once they have optimization based on these gradients so the loss and optimizes themselves to not call each other [Music] okay yeah the general idea is that you're you compute gradient for all the parameters and then you perform optimizers based on these gradients so you don't have to look at the last two company computer your optimum compute your step because you already know all the gradients of your parameters okay so okay so will will then cover the piper geometric so before covering the penalty of mastery I'll just say like this is just one library that we can use we think is like easy to use but it's not like the thing that you have to follow there are a bunch of libraries like graph nets from Google if you're interested in cancer flow and there's also DGL which is from Amazon that supports both Python and MX nets so there are different kind of frameworks that you can use and this is just like one way to like as an example of how to do this but you can if you want you can also implement it from scratch it's actually not that hard as you would think because I think in your slides there's also like one slide about how to do the computation which essentially says like you just do the matrix multiplication of the adjacency of your weight function of your feature matrix you can do this matrix multiplication in Python so you can even write it from scratch it's not that hard but this is just a way to like help you provide additional utilities for you to write it so I already installed the libraries here and these are the set of dependencies that we would need for the Python model Geppetto geometric model so let me just briefly explain you already seen is and here here these two are specific to panto geometric so this NN module here is the neural network module that's specific to panel geometric so he implements a lot of things like graph cough gene conf and like different kind of graph convolution models and the util here is basically perform some graph graph utilities functions so we also use Network X to just visualize the graph it's not important and it's not necessary optimizers so these are the data set that we would use so this the good thing is that a lot of standard crafted sets are already like easily put ported to these libraries so you can just autumn download them automatically from from Palo geometric but but if you have custom data set it's also easy to build them like as I showed in the title data set function ok and here a bunch of things that helps you to perform visualizations so I like tensor board X so this is basically an interface between high torch and 10 support for those of you who have not used 10 support this is just a way to track your training like how how well you perform over time like you can see like proto graph of loss with respect to the number of epochs or like metric like accuracy with number of airports and so on so it's very easy to track those during our training and I will show how to do this and this is for TZ visualization so you want to embed their learn embedding into like a 2d mesh and think that's easy to visualize this is how this is one way to do it and finally this is just a plot bad quality plot function okay so so let's start with how to write a general model for graph Kampf assuming that you already built a or use a built-in convolution operation that that title German already defined right so this is just a stack of graph conversions like very simple model here and as you already know from from the previous previous example that's this is a model so it inherits from the N n dot module the torture is not actually necessary I think so because I already so it's inheriting from n n tomorrow so so there's like the initializer which you have to implement and there is also like the forward which you have to implement so it's the same but but there's a few more things that you have to put in here for example here I was using the module list which I talked about before so here you're putting all your convolution operations so here I'm putting in a convolution model that takes in the number of input dimensions and number of output dimensions and I'm also building two additional layers of computation here right here so this is going from hidden to hidden right the hidden dimension the input is seen in dimension the output is also hidden dimension and and we can look at the build comp is really very simple I'm just saying I gave it's like an old classification I'm using the most simple GC kampf the graph convolutional networks and here if it's a graph conclusion I'm using a particular model coaching for graph model but we'll look at how to implement this later so you can even like input your custom count function and this is actually the the the function that we are asking you to implement in a homework right the class that is stop cast from the message-passing superclass so here here after the convolution I also have additional parameters here like a sequential sequential here is very similar to module list so we define a list of these these layers and and the difference here is that if we call sequential that means we execute all these videos sequentially whereas in module list you can execute out of order or in whichever way that you like so this so this sequential always execute sequentially so you perform this first linear layer and then drop out and a second in earlier okay is there a question on there okay good [Music] so we again we talked about the forward function but this forward function is kind of specific to graph neural networks so this forward function takes in fixing the data which is which is your input so so this data is an element of the data set remember the PI table data set that I talked about you can index into every element of the data set you get an object here called data so the data here consists of X which is the feature future matrix is the dimension is number of nodes times the number of nodes inviting dimensions and all node future dimensions and there's also edge index you can see as index as like an adjacency matrix sparse adjacency list basically says like what are the edges in your graph so if if there was an edge between node 1 node 2 here there is the first row is 1 2 and three four one three so this is a adjacency list of the graph and here is the patch the patch is a little bit complicated because now now we need to so instead of bashing images which is like regular having the same number of nodes now we need to patch graphs so when as we all know graphs have different number of nodes so if we want to patch a lot of graphs together especially in graph classification then we want to know which node actually belongs to which graph so this is telling you which new to witchcraft so internally it's an array that says the like which node for every element of the array you record which graph it belongs to so in practice you see something like 1 1 1 1 1 which means there are five nodes that belongs to graph 1 and then followed by say 2 2 2 which means there are 33 nodes in graph 2 and so on so this is a batch of some number of graphs and if it's no classification as we will see here so there are a lot of no classification that only runs on one graph so in which case the batch here is like all ones so it's like a trivial so here we have we have just a sanity check that if there is no features we use constant feature we also talked about it in class and and here we have a number of layers that execute the convolution so here accounts I remember this comes it's a module list here so this is a module list which has a list of list of these elements for every man element is a layer right it's a convolution layer so for I in these many layers for every layer and performing one different convolution here has and then pass through a value and drop out right so one thing about dropouts is that there is also this training flag here which you might want to take a look because dropouts competition is different for and testing as you all know like at test time you cannot you have to rescale the value so that's because right it's because training at training time you drop out so the magnitude of the order all the outputs is kind of larger so at the test time you kind of want to down which that so you have to you have to specify whether you are in training or not so that the dropouts knows how do you come here this so a test time it just passed through everything a training time it will drop out so and lastly we said if this is a graph if this is a graph task so here I just mean like we are doing graph classification rather than those classification you have you need to do a pulling mechanism which we also discuss so you need to pull all the loading a yes but in the previous part you had Chi Chi E and and GCN convolution that that fits a convolutional layer yes okay and then de and then just specify the parameters Tilly yes so so yes so the question is here we have this app you are GNN top GC and comp so this defines one layer of convolution and we are now simply stacking together a bunch of these layers as shown here we have a for loop that stacks all these layers and and every layer here does has its own parameters so these parameters will be internally defined in this GC comm function GZ come module class that I will explain later so the parameters are not are not directly defined in this model but defined inside these convolution areas of individual layers ok yes just to continue if this is a graph classification we need to do pooling this is one way to pull we can do meaningful and I think the homework does max cool but doesn't matter different kinds of polling mechanisms that you can also call so you can also refer to the package to understand like what kind of pony mechanisms are there and in general is also an interesting research topic of like how do you actually pull all the node together in two way into one graphing money so here global min pool just means I'm taking average of all the naughty bunnies oh and here's the post MP remember the post MP is defined by this sequential MLP here oh and and that's it and locks off max is again doing this soft max with long so that you can do cross-entropy I'm also returning embedding here because I want to later visualize what the what the graph looks like or what the inviting there is is learn to be so I'm just returning inventing also and for the last because now we computed locks of Max we just need to do the negative log likelihood of the of the predicted locks of max with respect to the true label distribution which is a one hot distribution of the label class so as you can see this is a four node classification and graph classification and this is the general model that's that is also like very similar to your homework of course in homework I was asking to like implement some a few of more details and then also in crafts age we have like specific convolution models but this is the general form okay so now we talk about like what is GC and GC and comp and gene count really is so so these convolutions are like an concept defined by table height or geometric so this is the class so all these Jin Cong and GC and Comp inherits from this message passing class [Music] okay so here we're saying like maybe we come up with our own convolution module maybe we have like special needs that we want to do or like we have special ideas we want to implement so we don't want to be follow the standard GC and kampf so here is how you implement a custom convolution layers so your our custom convolution layers is also inheriting from message passing and and notice that the aggregation here we when we call a super class we define the aggregation function here so here we we just add all the messages together so that's add and we can also take me max and all these here so we do add here so yes the the previous students was asking the parameters where are they defined so here is where it is defined so we're here we have a linear layer linear like the parameters in the linear layer that's defined for that specific convolution and as usual because this end n dot message passing is a subclass of n n dot module in Python we also implement both the in it and forward function it's all the same it's like module inside a module inside so in this forward function because this is graph neural Nets we have two things that we want to we want to use as input one is the connectivity which is the adjacency list in the edge index and one is the feature matrix which is X here and I also have the shapes listed here the n is number of nodes this is the number of input dimensions edge is number of so two here just means like every edge you had two nodes right you want to specify the two nodes and here is number of edges here it's something like add self rule for this is just means that inch because in GC and we do something like this we did we do something like apply to a London non-linearity future matrix and so on right so so here yeah so so here the edge the the self-loop basically means that my aim is different so so I also also want to add myself into into your conclusion so you'd not only aggregates neighbors you also like add yourself this just means that your a is actually like not actually your a but this is actually a plus yourself kind of thing and writing that but we can change it a little bit also so for example like there's a lot of customization that can be done here right you don't have to add self-loop and there's actually like a remove self group over here so you make sure you don't do self edges and the reason we don't do self address is that then we can do like a skip layer on top of that so what we do is okay so I have like say I transform my edge I transform my x-axis and then I do the propagation so this propagation what this does is it's calling the message function so it computes the messages for all the nodes and then does the aggregation to get the new representation based on your neighborhood defined by the edge list but then I can do a skip skip connection here which means that I can define say I another another linear layer for sale in self so I defined a not just define another linear layer and then I want yourself embedding to be passing through that linear layer and then I want to rest the messages to passing to pass through this linear layer so will we do is here instead of passing the message itself are passing the the linear layer here like that so we don't beat this and and this is this is how we propagate all your neighbor information and what you can do is you you can also add yourself information which is self which is what we just computed here right is that like I add myself and then I add all the neighborhood messages that we have a bit from the public gate function so the public gate function is propagating all your neighborhood information and the thing that we are propagating is this like yes yes so the edge list is telling it how to aggregate so it's using the edge list to define the neighborhood of the node where you want to pass messages so you don't have to write that up so essentially the this package is helping you to do that that's the only thing that is helpful yeah so here messages is just computing message and not a lot of things here and here you can like switch places it's not like but it's just that so the message function is called inside propagate so yes super like when you cook what is the question is what is super here so we're here means that I'm calling the super pass off of the current class so here we say cast and count is inheriting from and then Paula message asking which is parenting so when I call super here I'm calling the constructor or the initialization function of the message passing class yeah yes you have to consumer and it's best to call it before everything in your in it because then you get older like Python and then module is initialized first before you define everything and here I'm doing a lot of computation here just because I want you to listing so this is the gcn comm family essentially reproducing the GC income but you don't have to do this like like if you just want to add all your neighbors because there's a paper that shows that adding like location by adding all your neighbors is best for graph classification you don't actually need all these right you just use yourself as messages and alternatively you can like put this thing here it's all the same thing and and also there's also another thing that so in the message message can taking everything that propagate takes so these two has the same signature so message can not only take take the XJ which is your neighborhood neighbor embedding but it can also take X I X is yourself in value so your message doesn't have to be like a function of XJ only you can also be a function of X I you can write something like X I - it's probably not gonna work well but like you can define a complex model as a function of X I and X ray here as your messages update here just means that after you do the message passing after it computes the no diem padding do you have additional layers on top of that to further transform this for instance I think in graph search what we do is I did like a normalizing body and so I I think it's like something like this like he'd like to L to normalization with with the last dimension of the something like but I'm not I yeah you can do a normalization you can do em up here or like anything that's after message-passing let's check if the yeah I think I think that's it so so here the story is that you once you write this kind of basic housing in subclass we can essentially replace this with the message passing so let's let's try yeah so so essentially instead of saying where is it instead of saying like Pyg and then torches income I can change it to our custom column right so then you can [Music] yeah oh yes how are the messages yeah the question is problems specify the depth of so they're so this is energy NN step even remember they're a bunch of is where I put all the counts so this essentially does three layers right I couldn't one count from input to a key them I put a second count on from hidden to hidden so it's actually in the in the forward function this gives also a loop here so I put I take the input I tap use the publish and that life is defined I get output and the alpha is fed back in this loop right as input of X so that's Nestor there there is a debt that could be so obstinate Thank You Kate yes so the question is I guess where is the way to is so the yes so so this is still by looking at these few lines this module is does this three hops so this is the depth in terms of what Yuri said like the depth how far you want to find out it's not the death of your neural network layer because every one of these can have multiple layers and so this is the number of hops you want to go away so here in the forward because we have the loop over three layers so for every for every time you take your input you would perform your convolution at that layer and then you get your output right and the output is fed back into the input as the input of the next layer it's just a follow to go [Music] oh yeah this one so because in practice in graph classification oftentimes it's beneficial if you have a few more linear like few more layers after you finish the message passing so this is just like if you look at the post message passing this is an end of sequential as I explained and then sequential does like every of these layers sequentially so it performs two layer MLP on top of your graphing but in that you just computed again it's not necessary it's just something we added so which is just to show that we can customize the model with different architectures so let's show the training loop here yeah the training loop is very similar to the previous one so in a sense that we build this data loader and this code already written for you in the homework so you can also look at the homework the thing to note is that we have this Train loader which loads the data set from the first 80% of the data and then the test loader is from the 80 percent so the rest of the 20 percent of data so in practice you have to know that there's also like train validation and test plate right as you all know test plate something that you don't want to look at you look at the validation split a training time you come you value the performance on validation and then you can perform something some kind of early stopping you see the validation curve goes like your accuracy increases but then decreases you want to make sure like you stopped at there like do some early stopping and then once you determine the best airport where your model performs like the epoch that your model performs best we can use that model to perform on your tester so test that is like one time okay that you don't look at before your training before you finish training so but here for simplicity I'm just I'm just doing this train and test but note that like improper things that you have to do train say front to train from everything to 0.8 and then very very did on 0.2 0.9 and then you test from 0.9 to 1 so you have a different test plate and specify batch size by 64 I also shuffle the data so that I can I can have different ordering of the mini batches here here is the same I come construct a model it's very similar to the EM Minister framework construct a model constructor optimizer and those that I still call all the parent parameters so the optimizer is optimizing all the parameters of your model and then I start the training group the trainer loop starts with like a 200 200 airports and and I just for every app Hawk it's the same like I got a mini batch from the loader 0 grad super important get a prediction if this is a node classification I also have this mask so just to clarify this mask thing because in North classification a lot of times we have one graph and in this grant graph we want to do trained head split right you want to train your notes a subset of nodes in a graph and test on some other nodes that you have not seen before so this masking is saying like you can have a look at those nodes that are not masked out so these are your training set and you mascot all the validation it has set so this is what this mask is doing and this mask is also like it's defined in the data another so it's also in your home which you can have a look and a test time your mask is instead your validation mask or your test mask so this is a way like this is how we train on a single graph with different train validation test nodes and in the same graph but if this is not an old classification if this is a graph classification typically there are a lot of graphs right so we just split by graphs so your train set is say graph 1 to graph 10 validation is graph 10 12 11 12 13 and so on and and now now we just compute the last plus I so here the loss is added to the total loss and we monitor the loss so note that the add scalar here so this is from the writer so I will explain that later so this is a very good tensor broad stuff so instead of looking at the loss as we just did in amnesty like printing out every entries it is also very convenient if you can if you write out the intensive board so this is just doing this so I'm writing this loss with this total cost value and then the Epoque so it will help you about the graph of that and this is testing so testing does it's also very similar but except there are three masks I'm using validation masks for no transportation and I'm checking whether the label is equal to the prediction yeah it's also it's very similar like you loop over all the other data all the manufactures in the data loader but here notice that this is a little bit faster you don't have to add this but if you said with torciano grad you will not compute any gradients so that will make it slightly faster because that has time you don't optimize so you don't have to compute gradients there here yeah so so this is the tas of what's up that's good yeah this is a little bit tricky because we're running on collab but if you're running on the on like a Python file you don't have to say all these you just like you just need to import that hands about X so this is the URL that they provided currently there is nothing because we haven't started training but this is this is a graph classification task so we're using the IMDB binary graph classification benchmarks in your homework I think I I think you're given the enzyme data set but it's very similar though the data said shuffle the data set and then start training the training group is remember the training group is over here for every a prop you perform optimization so yeah you know you can also find it other I think it's a little bit slow but yeah but let's not waste time in the meantime this this is the node classification task so we said has equals to knows so which means that you don't you actually using am asking for their training and validation split this is just another dataset called sites here which is also a citation data set very similar to the Cora dataset that's in your homework but this is in notification okay so so this is trainee and when India's training it will start logging right so the summary writer is from the tensile Paul X you will log everything here long everything that where you say like writer who's writer writer doll at scaler will add the scaler to the cutest log directory that that you specified here so so is this is just like very random like you can name your log everything like your favorite name of your model this is just an example like it will log into that directory and when you log into that directory this HTML tensor can support server will retrieve data from that directory from that log directory and then you will see a visualization of that so so this is the name the random they name that we just give the give our model we can see the last function decreasing so this is for every epoch we're doing this ad scalar function there so we lock the ad scalar so we get a plot of this and note that this is also the test accuracy but the difference is that at test time because testing is very expensive you have to look through the inherit dataset we only perform it every 10 aprox so you will see one data point at every 10 a box so note that this is also updating because we are running it and generating log in real time so this is the this is the 10 support so it's very simple like just to read through it all you need to do is that you import has about X and then you define this writer with the directory that you want to log to and finally you do where is it add scalar right it's very simple and you can also log images and so on but that's a bit more involved and I won't talk about it so lastly we just want to look have a look at the visualization here so once you learn these kind of node invariance no matter from deep work no to that or GN ends you can always visualize them in 2d to get a sense like how well your you're embedding is like how well the nodes are clustered together I those from the same class are close to each other you can answer a lot of these kind of questions just by visualizing this so this is also very simple this is just using the standard plot plot matplotlib stuff that you all probably already know where X s and y as is the Disney embedding here and the tease me embedding is also simple to compute because we have this Disney library from Sify so we call his knee we construct his knee and then we say fit transform and then the input to this fit transform function is our embedding that has been learned so note that this embedding is is an output of the model but remember that in our model where's our model remember in our model we actually output the invariant right here so it will get the embedding of your graph neural net and it will perform the Disney technique to map this embedding high dimensional embedding into two dimensions so that you can visualize in this 2d plane by calling plot or scatter with with the color representing the class of your each node so as we can see like in stats here for instance it's not perfect right there's a lot of notes that are clustered together like in the middle of a brown caster there's like red node so it's not a perfect one it actually gets like 70% accuracy so it's not it's an issue with the clustering versus with the classifier oh the question is how do you how do we know like if you have something wrong whether it's from the clustering algorithm or whether it's from the classifier so I think I think the answer is we we gotta trust these these low dimensional embedding type techniques so these are definitely not preferred perfect the thing is like if you imagine you have like a 300 embedding like 300 dimension very you want to compress it into two principal components or like it's not gonna give you like perfect things right so so I I think I think it's very hard to determine like whether this is like an artifact or not especially from this this is many to like show how oh this this is many like show why is it disappearing so this is this many to show like as a comparative study so say like if you have two models one model gives you this visualization of one model gives with that visualization and you observe that that model has a better clustering then given that you use the same kisnya algorithm you think that this one is doing better right but absolutely like in absolute scale it's very hard to determine whether like two things are close together is really because your method is not good or your Disney is not good but yeah good question I think so that's pretty much the end of the tutorial and lastly I just want to show you a little bit about the an extra task that is not required in homework so this is for link prediction which is also very important that if you're doing knowledge graph or if you are doing like graph reasoning it's very important basically it's tell it's asking whether you can do completion of knowledge graph or things like that you can predict whether to two nodes are linked together in the graph that has a lot of missing edges like this kind of cases so this is just using the graph auto encoder which you can also refer to the potential magic documentation so the encoder so this kind of auto encoder has this encoder part and decoder part so the encoder part here we use graph convolution who is graph convolution to get invariance for all the nodes that's the encoder part and at the decoder part we basically use something very simple you can use inner product but if the to embedding have large inner product close to one then we think there's likely a link between them and if they're not then there's no link so the encoder is graph comp the decoder is in a product right so here we define very simple encoder you're probably already very familiar of this already like two layers of graph conf here this is your encoder right and decoder here decoder here we actually didn't specify because the default decoder is in a product if you want to have fancy in a product like fancy things like neural network or any other decoder then you will modify the decoder also but overall it's like just just define you another end end or module so as usual we do the Train test and we use the Site C or data set so this is also the same data set that we use for North concentration but now we're doing link prediction there and we're recording these two matrix a you see and average precision and both of which are also defined in a scaler and so you can use what like what I said in towards the beginning of the class to use SK learn to compute these metrics so link relation is also pretty fast and one thing that's that's worth noting is that there is the split edges so this is the importance function that helps you to construct positive and negative examples of link relation as you probably remember in class we talked about like you have to construct positive examples and then active examples by negative sampling right and then you you have positive log likelihood minus the log sum of log likelihood of your negative samples so this is doing this separation between the positive edges and negative edges right so yeah this is a good function for link prediction but other than that everything is very similar you define the all these parameters and and make it to GPU and then you run over the model the this is the stream function that we wrote so so yeah and again we can also visualize this so color is no defined I guess I had to run this again just to define a catalyst well yeah so the visualization is also very similar so the the idea is that we have an encoder that's graph comp we haven't decoded that's in the product the the thing in the middle is our embedding so we visualize the embedding in the middle yes I don't know why this slow but it's the same functionally as you saw in a previous node classification thing and and that's it for everything related to Python geometric and just want is there any question okay thanks [Music]
Info
Channel: Lindsey AI
Views: 30,066
Rating: 4.9886041 out of 5
Keywords:
Id: -UjytpbqX4A
Channel Id: undefined
Length: 74min 22sec (4462 seconds)
Published: Thu Jun 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.