Lecture 9 Graph Neural Networks Implementation with Pytorch Geometric

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] [Music] [Music] [Music] hi everyone can everyone hear me everyone hear me okay okay cool so so this session is great man I'm just meant to be like a practical session to get you started with using graph neuron that's Python like general machinery in pipeline because we recognized that there is a quite a bit of variance in terms of people's knowledge in building neural networks or experience with graph neural nets so this session is just to get everyone on board know how to how to do training from scratch know how to use Python geometric and then towards the end of this session we're also going to briefly talk about models that are related to homework and you guys are welcome to ask me anything regarding the model regarding the how to use Python how to use pattern geometric or any questions that you think is relevant to the homework and to graph neural nets so so the tool that we're going to be using today is many the Google collab so this is for those who not familiar this is just a jupiter notebook type of thing where we're like user sharing is also enabled so you guys can open this notebook with me and then you guys can run this notebook on your different google cults or servers so the link is already on the website and so if you guys want to follow you can open the open the link just follow the steps that i i'm going to yes oh yeah maybe maybe I'll probably because andrew is rolling this [Music] Oh because Andrew creates created the file and then I was editing on his file does everybody get a permission okay so so for the first part I'll just give a brief introduction on how to use my torch or like some Python basics to build machine learning pipelines so we'll start with a very simple example say M newest classification that every everyone in motion on you should be familiar so the package that we're going to be using is mainly like those Python packages so torch dot n n which we abbreviate as and then towards an end of functional which would grieve as F so and then has a lot of modules on neural networks and the function functional has a lot of functions function definitions thats related to neural network operations these two are just specific for amnesty which you don't have to care and then the last one is the matrix which we use to do perform evaluations of the model so I've already ran this because so so here in the second part this is just loading the data set so this is gonna take a little bit so I just ran it before before I came but I guess the thing that you have to know here is that before you start training you have to get a sense of the concept of data set so this is like the data structure that patrasche keeps for you to get input for the model so these are these are storing all the data set in this case this is the M this data set and for later on we're gonna use a sightseer enzyme IMDB this other kind of graph data sets but all in the format of this height watch data set object and so here this is where we define the training set this is the loader that loads the training set so the what is said is various it's just a inheriting from some abstract data set format that is an iterable so the main functions that you want to inherit when you build your own data set is the thing called LAN which is the length of the iterable and also the index like you'll be able to index into specific examples into the data set for example maybe I can show here so this trains that thing [Music] oops so so this train set is just just this thing here which is a Python data set oh it's gonna take a bit but but you can also do something that's very common to like a very common to any iterable Python interval which is like you can print the length of the iterable and you can and you can train say what's the first example like any index that you can put in so this basically retrieves you the tenth example [Music] why is this also [Music] we're able to download the dataset okay [Music] and you should be [Music] okay but but this is what you what you have to do for it for iterating over the data set so you can index into this that to retrieve one one example from the training set but in practice what we do is mini batch training right we do stochastic gradient descent which means that at every iteration or at every year and every iteration you taking multiple examples so this is also the way that you take multiple example you can take like say 10 11 12 or like after shuffling or something like that so yes so this is the basic concept of data set and and as you can see like it will print out the number of data set this was actually the previous run so it printed out the the Dessel itself but like now I'm printing out the length of the training set I can also print out like the specific data data example that is that's here so basically this array is just a image array of 26 by 26 which represents the amnesty j't so so okay so so once you once you construct it maybe you can just yes so once you've constructed the data set now it's the time to construct the model and so we will do like this we first construct it is that we will then build our model and then we link the model and the dataset together to do a training and testing so here it's a very brief or very simple example of like MLP would come one wonder your convolution this like really simple model just to demonstrate how how would use Python to build this kind of build kind of network just show of hands how many people are familiar with like PI torch concepts ok like half of it ok so maybe I'll explain a bit more in detail so every model in so every model that you build in Python is inheriting from something called like an end or module which is like a super class of and neural network model so this provides a lot of good things that you all later see you know like it provides easy interface to optimize easy interface to run in and training on that model so this module mainly has like two functions that you have to implement one is initialization and y is the forward function so the initialization basically helps you to define all the parameters that you use it in the model so these are the places where you put in all the trainable parameters and later on you can retrieve these parameters in your optimizer and optimize for these parameters right and and the second part is the forward function so the what the function does is it basically tells you how to construct the computation graph from the input to the output but in this case the parameters that we we care here is just a convolution to the convolution which you don't need need in in graph neural nets but and then the second one is the linear layer the third one is also linearly so these are all the parameters that we use to train this model and here we're defining so linear layer has two inputs the input Channel and the Alpha Channel so these are pretty useful because I will see a lot of linear layers in graph neural nets so here the input is something like 26 by 26 which is the size of the image and then 32 is the number of channels because we noticed that the number of output channel is 32 so here this is the input dimension and this is the output dimension I have a hidden layer of 128 neurons and in the second convolution are taking this number of output here and then I output 10 out of 10 because this is a classification task right so this is classifying which digit is the M is the number is so the positive number of possible digit is 0 to 9 so this is like a 10 class classification task and and in the forward function this is what we do we we take an input X so the forward function takes in the input of the actual tensor and then he builds there come it builds the computation graph here so the notion of computation graph actually like starts from tensorflow but the good thing about Python which also makes it easier is that you don't have to pre construct your entire computation graph before feeding your oil data so you can actually like on-the-fly retrieve your data from your computation graph look at it maybe even modify it and then feed it back to feed it feed it back into your computation well so it's like dynamic yes sure again it's not gonna be important for this class but the shape for the convolution has like three parameters one is kernel size which is like the the filter size of the filter dimensions which is three by three in this case and the input channel is one because this is a black-and-white image so just like a grayscale so input channel is one because it's one dimension and output channel is 32 it's just like the curve output Channel yeah but it's not gonna be used for graphing on us it's just there for image classification so there's like a patched image here so the 32 in this case is the batch size yeah yeah again this is variable so there's also like a flattened here which means that you flatten is like multi-dimensional array to one dimension this can also be very useful in graph notice by the way and this is not the only way so this way this is actually the flatten is actually an operation that connects the the input tensor which is this X to a very to a different answer which is also assigned to X so you first do the computation on X and then you you put the resulting value back to X but this cookies are a little bit expensive like what we usually also be able to do is like we can use this like view function like this one I think so the tensor thought view and we just specify like because we want to flatten it to 32 by whatever the the rest of the dimensions or just negative one here so this essentially does the same thing with Latin but the good thing about this is that it doesn't actually does the actual computation of changing the shape of the X it just providing a different view of this tensor so a different view means like I look at the difference B but it's in the same memory in the memory so it's a little bit more efficient but computationally they are equivalent these two are equivalent you can by the way you can look at the documentation of the view in the Python tensor documentation but so after these computation we do non-linearity and we produce a logit logit is essentially the 10 class here the output is dimension 10 so 10 class we pass through his self max on the largest with the largest is the 10 cast score so for those of you who are not familiar with softmax it's just a way to differentially differentiate kind of max right so say I provide a score for each of the for each of the candidate class and I use this particular function ^ / so so this will be the probability of the model choosing that particular cast it's just basic softmax and then with this office we can use like cross-entropy to compute the actual loss is is the basics about Python clear is there any questions on PI torch okay so so once we define the model we can actually start working on like actually running this model event so maybe maybe let's let's first start with a very simple example where we don't have to retrain the model but just like to look at like what these tensors are like for example we can define a equals to is num high so we can say define a equals to some kind of cancer so this is not cancer this is an umpire array and alpha is not different [Music] so this is a numb higher rate everyone is familiar with numpy right and we can't define something like and he thought once we can do like similar things right this is just a two-by-two matrix of once right if you print B it will be a 2 by 2 matrix at once right and and in numpy we can do operations like matam all right you can do matrix multiplication between these two so you can do NT dot that's so this will give you the matrix multiplication between matrix a matrix B which turns out to be this but this is in um hi and which is also which is just using CPU for your computer but the good thing about packages that you can do this linear combination again like GPU so you notice that like here our runtime type is set to GPU here so which which means that we have a GPU available so we can do this using height wash on GPU so what we do here is that we can define this thing called tensor which is the the object like Python object that they used to perform certain computation the tensor a is something like a torch don't answer and then you're putting this numpy array so it will so if you print the TA now this is going to be a high-touch tensor which is wrapped around the numpy array right you can also define that they had type clothes right so it now you it tells you i this is using flow of 64 right and we can do the same thing for another and so here we can do either tosh.dough tensor be but we can also do like twelve dot once so you notice a lot of similarity between calling non high functioning and calling the the Pythons function so is it this one so now we print TB all right so and we can also define the type similarity this is all very similar to known on PI interface so it's very easy to use and now you can perform computation using torch functions right so now instead of calling the non height of mat more I can do torch dots much more right so this will give you matrix multiplication using high touch and or just as like a very simple shortcut is just do this which also gives you the matrix multiplication right but the here the other tensors still live in CPU because we just held the pipe arch to construct this tensor but it's doing CPU so the way that you do GPU is that you can look at what the GPU is available so Tasha CUDA is available this will give you Q so now because our run time as you noticed is GPU so it says CUDA is available and because it's available we can now do something like transferring the data into GPU and the way you do it is the very simple with this you just that's the the torch put it to any GPU that you can specify so now if you print this thing called TP sorry ta whatever it will give you something in a GPU okay yeah the first time transferring today is a bit of bit slow but you notice that it's there's like a device for device thing that wasn't there before so now this means that it's in GPU but a lot of times that you want to make sure it's specific like which GPU you want to actually use so this this time you want to call to device so so device is something like essentially the string there so in this case we only got one GPU so this is the GPU that we are using so so it's like that so but it's equivalence it gives you the same result so is there any question on like GPU using GPU pythor tensor stuff okay great [Music] so yeah so here it's the same thing here we're just define if the CUDA is available we set it to cool down zero otherwise we'll use CPU as the as the device that's now we to use the tool code that I used before to get to the device and this is how we construct a model and then we define a loss so so this is the loss that we use to to compute the difference between our training softmax versus the gwangju label which is like one page eight out of ten digits right and then this is the optimizer so a lot of optimizer czar there so you can look at the torch in a torch dot of team but whatever like there are a bunch of things like s GD r s GD a degrade at a delta a bunch of those and you can you can try whether these fit you and there's one thing that's in like worth looking at because a lot of times you want the learning rate to be kind of annealing over time right initially because the model is not very very trained very well you want a learning rate to be big so you can like step with big steps and then later on you want to enduring it to be decaying or like annealing so there's an option called schedule in the in the optimization package so there's like different company doing kneading operations that you can do there's like step and you mean there's like linear annealing there's its financial annealing there's also something interesting called cosign which people envision likes so so yes so you're encouraged to look at this but it's not strictly necessary but equal you could improve your performance so once you run this probably have already run it then you can start the training so we have everything ready we have the model ready we have the data set we have the optimized already we have the last ready so now we can do the training so the training group essentially goes like this so there's an fixed number of airports that we need to go through typically like this just represents like the number of times you need to go through the entire data set right we're also recording the laws and stuff and then okay so here is for each of the airport where you go through the entire set of data now we enumerate over the data loader remember that data loader is a wrapper around the data set which is like which you can index into so this enumeration here just tells you like at every iteration you want to extract the number of examples equal to a batch size by the way batch size is also defined in the loader so now for every iteration this is the thing that that is retrieved from the data set remember the train train set I said train set 10 so this is the thing that's in the train set 10 so here in this cases image and label right so we put everything into GPU we perform the model so when so when you see see like the model and then parentheses the input this means that this is calling the input function sorry calling the forward function of the model so this this is the function that it costs so the input here is X which is the image and that's why we put the image into the model right so once you do that there's an important sorry and the last is of course the loss is the cross and ropey loss which you can do copies is very standard cross-entropy but here there's an important step that you really need to do which is 0 grad 0 grad basically says the gradient to 0 for other variables this is very important because if you don't do this by default the the potage is gonna accumulate your gradient right say like your previous you have already computed your gradient for all your parameters in your previous iteration if you don't do the regret it's gonna add back the gradient of the current computation so it's gonna be a some of your previous gradient the current gradient and this is not what you want in a lot of times so you really need to call it zero grant if you want this kind of behavior and and yeah so this is this is a function where we compute away again so backward means backdrop so backdrop computes all the gradients for you for all the parameters that you define so if you don't forgot where the parameters are these are the parameters right for linear the parameter is the weights weight and the bias fur so the weight will equal to this the number of elements in weight is equal to the input times the output and the number of elements in the bias term is the number of output right so these are the parameters and when we call backward or other gradients of these parameters are computed and now we do optimizer thought optimizer third step so when we call optimizer third step where work we are saying for all these for all these parameters where we already have gradient we want to perform one step optimization with the learning rate that's specified here so and also okay maybe I'll also explain the parameters function call so this is just a so it's perfectly fine if you put a list of parameters there but this model of parameters is just a revaluation to help you so say like I want to optimize all the parameters that is defined in it so this is what you do right you just call parameters and then you will help you so it's essentially giving you all the parameters here and how do you decide is that anyone has a crew how do you decide whether a parameter here is included in being modeled on parameters is there like say for example if I do something like this say x equals two and and dot linear five five is X optimized is X included in model table parameter no yes good so is anyone know why it's not it's not very it's a very tricky thing so okay so so the the reason that X is not included here is that I didn't add self if if I do this this is in the parameters it's easy model table parameters this is just like a very specific thing about any and all module like you have to define this self for the model to of all the parameters to be included in optimizer so that you can optimize and there's also a caveat like say I want to have like a list of parameters I want to say I have a three layer neural network see I have this for I in range three oh whatever so if I if I have if I have this anyone know why anyone know if this is included in the parameter list so this will also not be including the parameter list even if you have self but because this list so it's not being included so what you do instead is do n n n dot ma Jo list and then you put the list there in which case it will be in the model of parameters and you can optimize that just like whenever like sometimes you don't see the last coming down or you don't see the up the trembler changing it could just be because that the parameters are not actually in your model doctrine through this so you didn't actually optimize that so it's good if you know this and know how to debug those make sure that your list of parameters is the list of things that you want to optimize okay so now we can we can perform training so this is essentially the entire training group that I just explained I explained on to this tab which is optimization of optimize at all step and there's also accumulating the loss and but these are very simple Python stuff and note that this is the label so I will talk about it later on the evaluation magic I will explain what this is but just to get a sense this is like how you're trained your own that's like going through a lot of a box so here I'm recording the iPod 0 which is here but there's we can go through that they have multiple times to get results so ok so let's talk about testing so I assume that you guys know basic concepts about like test set training set right the basic idea is that you want to evaluate how good our model is but you but you don't want to use your training set to evaluate it right because you aren't trained on this so there's a possibility that you over fit on the training set but the model doesn't generalize to data that you have not seen before so this is the how you do the test set so the thing about has is that the evaluation metric does not have to be have to be differentiable right the true your training loss has to be differentiable because then you can do as pretty but your test test matter it need not be differentiable and these are typically things like precision recall F 1 things like that is there any question oh cool okay this is just training I can kill this right but you will see like lost sort of going down and all that so in the test we basically want to evaluate how good it is so what we do is we take a different set of data in the data set so here see the test loader so test loader is constructed from the test set not the training set so you you notice from the previous code section that these are two different data sets but the the idea is the same we enumerate over the entire test set and for the images we go to GPU for labels we go to GPU and we produce our model output and remember that the model output is this thing is the softmax that we kind of computed from the from the model from the model cross and and okay so this is maybe I need to decompose this so so this is essentially the model output and then out of all the scores that we computed this way we want to pick the maximum of these to be on a class right so say like the model predicts zero with possibility 0.5 0.1 one with possibility 0.14 two with possibility 0.8 right you would say like the model actually thinks that the digit is 2 right so in which case we are picking up augments so essentially your you can treat this as as your predictions so torch dots are max which tells you the maximum is outputs right and and of course you can flatten for sure so so this is your predictions and and this is your labels right so we can we can print both we can print labels we can also print predictions and we just run it for one iteration so you judge just to let you know like what this looks like so this is the our prediction the model predicts seven the label is actually seven it's all correct right no it's all correct okay so but we want to evaluate a based on these model helpers and the labels right so what we want to do here is that we we can make use of the remember the library that I just taught like I mentioned at the beginning so there's like this a scalar matrix which helps you which already defines a lot of metric function that you can use for example accuracy for example precision etc so so here we're computing accuracy ourselves using me and items but we don't have to do that we can just make use of the SK line but the thing is SK line is implemented based on numpy so you have to convert this thing back to dumb time and the way you convert this thing back to numpy is you call CPU because this thing is in GPU right now remember and also also you need to compute the numpy so convert to an umpire is that clear and and i can do same thing for the labels labels cpu and just convert back to an umpire right and then i can make use of my favorite SK learn taught metric I say there's a lot of different different functions in it so almost all the evaluation methods that you can think of is already implemented so you can basically call a lot of different things for example position remember position is false positive to positive over what shoe positive plus qu- no true positive plus false positive yes true positive as foster positive so so this is how you compute the precision score you can print a precision precision [Music] [Music] so I'm just gonna run for it point iteration oh so so here this is a multi-class classification but by default averages binary which is not which is only for binary classification so in which case you want to do physical average Micro say you do micro average of ten class oh yeah this this I I guess don't care about this so now precision is one but then you can do different other things right we call and there's PR curve there's everything like you can do it this way okay so let's remove these just to let you know that you can use SK learn to perform all these evaluations and in this case is doing has accuracy which means not so right now just now I printed the precision recall for my intuition now here I need to aggregate over all the iterations of the data set so to get overall accuracy which is happens to be zero point Anna so yeah so that's the basic that's all for the basic tutorial for the height watch is there any questions on python related training the question is how is the loss time optimizer linked together so they don't link direct link together directly so what you see here is that you compute gradient of the loss so when we say lost all backward accomplished gradients of all the parameters in the model right this is the signal from the last function so once you get the gradient for other parameters then optimizer here only cares about parameters so you realize that optimizer takes in the set of other parameters right and all the parameters have already been computed the gradient so so once you call optimizer step you just take all the parameters look at their gradient and it performs one-step optimization exactly [Music] but lastly from criterion which is yes exactly that's that's the point so so the loss is not related directly related to optimizer lost is related to your gradients so Wells tells you the gradient of all the parameters and then based on all the gradient you compute your optimizer come see look at the gradient and perform one-step optimization based on these gradients so the loss and optimizes themselves to not call each other [Music] yeah okay yeah the general idea is that you're you compute gradient for all the parameters and then you perform optimizers based on these gradients so you don't have to look at the last two company computer your optimum compute your step because you already know all the gradients of your parameters so okay so will will then cover the piper geometric so before covering the penalty of mastery I'll just say like this is just one library that we can use we think is like easy to use but it's not like the thing that you have to follow there are a bunch of libraries like graph nets from Google if you're interested in cancer flow and there's also DGL which is from Amazon that supports both Python and MX net so there are different kind of frameworks that you can use and this is just like one way to like as an example of how to do this but you can if you want you can also implement it from scratch it's actually not that hard as you would think because I think in your slides there's also like one slide about how to do the computation which essentially says like you just do the matrix multiplication of the adjacency of your weight function of your feature matrix you can do this matrix multiplication in Python so you can even write it from scratch it's not that hard but this is just a way to like help you provide additional utilities for you to write it so I already installed the libraries here and these are the set of dependencies that we would need for the Python model Geppetto geometric model so let me just briefly explain you already seen is and here here these two are specific to panel geometric so this NN module here is the neural network module that's specific to panel geometric so implements a lot of things like graph cough gene cough and like different kind of graph convolution models and the youto here is basically perform some graph graph utilities functions so we also use Network X to just visualize the graph it's not important and it's not necessary optimizers so these are data set that we would use so this the coating is not a lot of standard crafted sets are already like easily put ported to these libraries so you can just autumn download them automatically from from Palo geometric but but if you have custom data set it's also easy to build them like as I showed in the Python data set function okay and here a bunch of things that helps you to perform visualizations so I like tensor board X so this is basically an interface between height coach and our 10 support for those of you who have not used 10 support this is just a way to track your training like how how well you perform over time like you can see like proto graph of loss with respect to the number of epochs or like metric like accuracy with number of airports and so on so it's very easy to track those during our training and I will show how to do this and this is for Disney visualization so you want to impact their learn embedding into like a two dimension think that's easy to visualize this is how this is one way to do it and finally this is just a plot bad for live plot function okay so so let's start with how to write a general model for graph Kampf assuming that you already built a or use a built-in convolution operation that that pie thought you might have already defined right so this is just a stack of graph conversions like very simple model here and as you already know from from the previous previous example that's this is a model so it inherits from the n n dot module the torture is not actually necessary I think so because I already so it's inheriting from an N tomorrow so so there's like the initializer which you have to implement and there is also like the forward which you have to implement so it's the same but but there's a few more things that you have to put in here for example here I was using the module list which I talked about before so here you're putting all your convolution operations so here I'm putting in a convolution model that takes in the number of input dimensions and number of output dimensions and I'm also building two additional layers of computation here right here so this is going from hidden to hidden right the hidden dimension the input is seen in dimension the output is also hidden dimension and and we can look at the build comp is really very simple and interesting I gave it's like an old classification I'm using the most simple GC kampf the graph convolutional networks and here if it's a graph conclusion I'm using a particular model coaching for graph model but we'll look at how to implement this later so you can even like input your custom column function and this is actually the the the function that we are asking you to implement in a homework right the class that is top cast from the message passing superclass so here here after the convolution I also have additional parameters here like a sequential sequential here is very similar to module list so we define a list of these these layers and and the difference here is that if we call sequential that means we execute all these layers sequentially whereas in module list you can execute out of order in whichever way that you like so this so this sequential always executed sequentially so you perform this first linear layer and then drop out and their second in earlier okay is there a question on the well okay good [Music] so we again we talked about the forward function but this forward function is a kind of specific to graph neural networks so this forward function takes in fixing the data which is which is your input so so this data is an element of the data set remember the pie table data set that I talked about you can index into every element of the data set you get an object here called data so the data here consists of X which is the feature feature matrix is the dimension is number of moles times the number of nodes in varying dimensions and our node feature dimensions and there's also edge index you can see our index as like an adjacency matrix sparse adjacency list basically says like what are the edges in your graph so if if there's an edge between node 1 node 2 here there is the first row is 1 2 and three four one three so this is a adjacency list of the graph and here is the batch the batch is a little bit complicated because now now we need to so instead of passing images which is like regular having the same number of nodes now we need to batch graphs so when as we all know graphs have different number of nodes so if we want to patch a lot of graphs together especially in graph classification then we want to know which node actually belongs to which graph so this is telling you like we're on to witchcraft so internally it's an array that says the like wish note for every element of the array you record which graph it belongs to so in practice you see something like 1 1 1 1 1 which means there are five nodes that belongs to graph 1 and the follow up is a 2 to 2 which means there are 33 nodes in graph 2 and so on so this is a batch of some number of graphs and if it's no classification as we will see here so there are a lot of no classification that only runs on one graph so in which case the batch here is like all ones so it's like a trivial so here we have we have just a sanity check that if there is no features we use constant feature we also talked about it in class and and here we have a number of layers that execute the convolution so here accounts I remember this comes is a module list here so this is a module list which has a list of list of these elements for every man element is a layer right it's a convolution layer so for I in these many layers for every layer and performing one different cover ocean here pass and then pass through a level and drop out right so one thing about dropouts is that there is also this training flag here which you might want to take a look because dropouts competition is different for training and testing as you all know like at test time you cannot you have to rescale the value so that's because right it's because training a training time you drop out so the magnitude of the order all the outputs is kind of larger so at the test time you kind of want to down which that so you have to you have to specify whether you are in training or not so that the dropouts knows how do you come here this so a test time it just passed through everything at training time you will drop out so and lastly we said if this is a graph if this is a graph task so here I just mean like you are doing craft classification rather than those classification you have you need to do a pulling mechanism which we also discuss so you need to pull all the know the impact out yes in the previous part you had IG and and GCN convolution that fits a convolutional layer and then just specify the parameters yes so so yes so the question is here we have this pure genin top GC and Kampf so this defines one layer of convolution and we are now simply stacking together a bunch of these layers as shown here we have a for loop that stacks all these layers and and every layer here does has its own parameters so these parameters will be internally defined in this GC comm function GZ come module class that I will explain later so the parameters are not are not directly defined in this model but defined inside these convolution layers of individual layers ok yes just to continue if this is a graph classification we need to do pulling this is one way to pull we can do meaningful and I think the homework does not school but doesn't matter different kinds of porting mechanisms that you can also call so you can also refer to the package to understand like what kind of Pony mechanisms are there and in general is also an interesting research topic of like how do you actually pull all the notes together into a way into one graphing money so here global mean pool just means I'm taking average of all the nodal values oh and here's the post MP remember the post MP is defined by this sequential MLP here oh and and that's it and log softmax is again doing the softmax with log so that you can do cross-entropy I'm also returning embedding here because I want to later visualize what the what the graph looks like or what the inviting there is is learn to be so I'm just written in banning also and for the last because now we computed logs of Max we just need to do the negative log likelihood of the of the predicted logs of max with respect to the true label distribution which is a one hot distribution of the label class so as you can see this is a four node classification and graph classification and this is the general model that's that is also like very similar to your homework of course in homework I was asking to like implement some a few of more details and then also in graphs it would have like specific convolution models but this is the general form okay so now we talk about like what is GC and GC and comp and gene count really is so [Music] so these convolutions are like an concept defined by topic or geometric so this is the class so all these gene calm and GC and comp inherits from this message passing class okay so here we're saying like maybe we come up with our own convolution module maybe we have like special needs that we want to do or like we have special ideas we want to implement so we don't want to be follow the standard GC and kampf so here is how you implement a custom convolution layers so your our custom convolution layers is also inheriting from message passing and and notice that the aggregation here when we call a super class we define the aggregation function here so here we we just add all the messages together so that's ad and we can also take me max and all these here so we do ad here so yes the the previous students was asking the parameters where are they defined so here is where it is defined so we're here we have a linear layer linear like the parameters in the linear layer that's defined for that specific convolution and as usual because this end n dot message passing is a subclass of N and dot module in Python we also implement both the init and forward function it's all the same it's like module inside a module inside a module so in this forward function because this is graph neural Nets we have two things that we want to we want to use as input one is the connectivity which is the adjacency list in the edge index and one is the feature matrix which is X here and I also have the shapes listed here the n is number of nodes this is the number of input dimensions edge is number of so two here just means like every edge we had two nodes right you want to specify the two nodes and here is number of edges here did something like add self rule for this is just means that inch because in GC and we do something like this we did we do something like the fly to London non-linearity future matrix and so on right so so here so so here the edge the the self loop basically means that my aim is different so so I also also want to add myself into into your conclusion so you'd not only aggregates neighbors you also like add yourself this just means that your a is actually like not actually your a but this is actually a plus yourself and there is nothing that but we can change it a little bit also so for example like there's a lot of customization that can be done here right you don't have to add self loop and there's actually like a remove self group over here so you like make sure you don't do self edges and the reason we don't do self address is that then we can do like a skip layer on top of that so what we do is okay so I have like say I transform my edge I transform my x-axis and then I do the propagation so this propagation what this does is it's calling the message function so it computes the messages for all the nodes and then does the aggregation to get the the new representation based on your neighborhood defined by the edge list but then I can do a skip live connection here which means that I can define say another another linear layer for sale in self so I defined a not just defined another linear layer and then I want yourself embedding to be passing through that linear layer and then I want the rest the messages to passing to pass through this linear layer so what we do is here instead of passing in the message itself are passing the the linear layer here like that so we don't beat this and and this is this is how we propagate all your neighbor information and what you can do is you you can also add yourself information which is self which is what we just computed here right like I add myself and then I add all the neighborhood messages that we have a bit from the propagate function so the public a function is propagating all your neighborhood information and the thing that we are propagating is this like X passed through a linear layer yes yes here the aggregation sort of happens behind the scenes you don't have to manually help yes so the edge list is telling it how to aggregate so it's using the edge list to define the neighborhood of the note where you want to pass messages so you don't have to write that so it's actually the this package is helping you to do that that's the only thing that is helpful yeah so here messages is just computing message and note that a lot of things here and here you can like switch places it's not like but it's just that so the message function is called inside propagate so yes the question is what is super here so we're here means that I'm calling the super parcel of the current class so here we see constant count is inheriting from and then message asking which is in parenting so when I call super here I'm calling the constructor or the initialization function of the message passing class yeah yes you have to look on super and it's best to call it before everything in your in it because then you get older like Python and then module is initialized first before you define everything and here I'm doing a lot of computation here just because I want to do this thing so this is the gcn conv me essentially reproducing the GC income but you don't have to do this like like if you just want to add all your neighbors because there's a paper that shows that adding like location by adding all your neighbors is best for graph classification you don't actually need all these right you just use yourself as messages and alternatively you can like put this thing here it's all the same thing and and also there's also another thing that so in the message message can taking everything that propagate takes so these two has the same signature so message cannot only take take the XJ which is your neighborhood neighbor embedding but it can also take X I X is yourself in Banyan so your message doesn't have to be like a function of XJ only you can also be a function of X I you can write something like X I - a it's probably not going to work well but like you can define a complex model as a function of X I XJ here as your messages update here just means that after you do the message passing after it computes the know the impact into you have additional layers on top of that to further transform this for instance I think in graph search what we do is I did like a normalizing body and so I I think it's like something like this like Pete like - L - normalization with with the last dimension of the something like them don't I yeah you can do a normalization you can do em up here or like anything that's after message passing let's check if the yeah I think I think that's it so so here the story is that you once you write this kind of message housing in South class we can essentially replace this with the message passing so let's let's try yeah so so essentially instead of saying where is it instead of saying like Pyg and entities income I can change it to our custom column right so then you can yeah [Music] oh yes part of the messages yeah the question is problems specify the depth of opium so there so this is energy NN step remember the room is where I put all the cows so this essentially does 3 layers right I could mean one count from input to a hidden I put a second count on from hidden to hidden so it's actually in the in the forward function this gives also a loop here so I put I take the input I tap use the publish on that life is defined i get output and the alpha is fed back in this loop right as input so [Music] the question is I guess where is the where do is so the yes so so this is still by looking at these few lines this module is does this three hops so this is the depth in terms of what Yuri said like the depth how far you want to find out it's not the death of your neural network layer because every one of these can have multiple layers and so this is the number of hops you want to go away so here in the forward because we have the loop over three layers so for every for every time you take your input you will perform your convolution at that layer and then you get your output right and the output is fed back into the input as the input of the next layer it's just a follow to go over multiple hops yeah oh yeah this one so because in practice in graph classification oftentimes it's beneficial if you have a few more linear like few more layers after you finish the message passing so this is just like if you look at the post message passing this is an INT or sequential as I explained and then sequential does like every of these layers sequentially so he performs two layer MLP on top of your graphing by integer you just computed again it's not necessary it's just something we added so which is just to show that we can customize the model with different architectures let's show the training loop here yeah the training loop is very similar to the previous one so in a sense that we build this data loader and these code are already written for you in the homework so you can also look at homework the thing to note is that we have this trained loader which loads the data set from the first 80% of the data and then the test loader is from the 80 percent so the rest of the 20 percent of data so in practice you have to note that there's also like train validation and has split right as you all know has played something that you don't want to look at you look at the validation split a training time you come evaluate the performance on validation and then you can perform something some kind of early stopping you see the validation curve goes like your accuracy increases but then decreases you want to make sure that you stop at there like do some early stopping and then once you determine the best Airport where your model performs like the epoch that your model performs best you can use that model to perform on your test so test that is like one time thing that you don't look at before your training before you finish training so but here for simplicity I'm just I'm just doing this training and test split but note that like improper things that you have to do train say front to train from everything to zero point eight and then very very date on 0.2 0.9 and then you test from zero point nine to one so you have a different test plate and specify patch size by sixty-four I also shuffle the data so that I can I can have different ordering of the many patches here here is the same I come construct a model it's very similar to the M nest framework construct a model constructor optimizer and those that I still call all the parent parameters so the optimizer is optimizing all the parameters of your model and then I start the training group the trainer loop starts with like a 200 200 airports and and I just for every app Hawk it's the same like I got mini-batch from the loader zero grad super important get a prediction if this is an old classification I also have this mask so just to clarify this mask thing because in North classification a lot of times we have one graph and in this grant graph we want to do trained head split right you want to train your notes a subset of nodes in a graph and test on some other nodes that you have not seen before so this masking is saying like you can have a look at those nodes that are not masked out so these are your training set and you mascot all the validation test set so this is what this mask is doing and this mask is also like it's defined in the data loader so it's also in your home which you can have a look and a test time your mask is instead your validation mask or your test mask so this is a way like this is how we train on a single graph with different train validation test notes and in the same graph but if this is not an old classification if this is a graph classification typically there are a lot of graphs right so we just played by graphs so your train set is say graph 1 to graph 10 validation is graph 10 12 11 12 13 and so on and and now now we just compute the last was I so here the loss is added to the total loss and we monitor the loss so note that the add scalar here so this is from the writer so I will explain that later so this is a very good like tensor broad stuff so instead of looking at the loss as we just did in amnesty like drinking out every entries it is also very convenient if you can if you write out the intensive board so this is the doing this so I'm writing this loss with this total loss value and then the Epoque so it will help you about the graph of that after loss and this is testing so testing does it's also very similar but except the tree mask I'm using validation masks for no transportation and I'm checking whether the label is equal to the prediction yeah it's also it's very similar like you loop over all the other data all the manufactures in the data loader but here knows that this is a little bit faster you don't have to add this but if you said with torture no grad you will not compute any gradients so that will make it slightly faster because that has time you don't optimize so you don't have to compute gradients there here [Music] so so this is the hazard what's up that's good yeah this is a little bit tricky because we're running on collab but if you're running on the on like a Python file you don't have to say all these you just like you just need to import that has about X so this is the URL that they provided currently there is nothing because we haven't started training but this is this is a graph classification task so we're using the IMDB binary graph classification benchmarks in your homework I think I I think you're given the enzyme dataset but it's very similar though the data said shuffle the data set and then start training the training loop is remember the training loop is over here for every app copy or perform optimization so yeah you can also find it I think it's a little bit slow but yeah but let's not waste time in the meantime this this is the node classification task so we said has equals to knows so which means that you don't you actually using am asking for the trainee and validation split this is just another data set called sites here which is also a citation data set very similar to the core dataset that's in your homework but this is in North classification ok so so this is trainee and when it is training it will start logging right so the summary writer is from the tenth of all X you will log everything here log everything that where you say like writer course writer writer adult at scaler will add the scalar to the cutest log directory that that you specified here so so is this is just like very random like you can name your log everything like your favorite name of your model this is just an example like it will log into that directory and when you log into that directory this HTML cancer can support server will retrieve data from that directory from that log directory and then you will see a visualization of that so so this is the name the random they name that we just give the give our model we can see the last function decreasing so this is for every epoch we're doing this at scalar function there so we lock the add scalar so we got a close of this and note that this is also the test accuracy but the difference is that at test time because testing is very expensive you have to look through the inherit data set we only perform it every 10 aprox so you will see one data point at every 10 a box so note that this is also updating because we are running it and generating log in real time so this is this is the 10 support so it's very simple like just to really through it all you need to do is that you import answer ball X and then you define this writer with the directory yet that you want to log to and finally you do where is it add scalar right it's very simple and you can also log images and so on but that's a bit more involved and I won't talk about it so lastly we just want to look have a look at the visualization here so once you learn these kind of node invariants no matter from deep work no to that audience you can always visualize them in 2d to get a sense like how well your your embedding is like how well the nodes are clustered together I those from the same class are close to each other you can answer a lot of these kind of questions just by visualizing this so this is also very simple this is just using the standard plot plot of matplotlib stuff and that you are probably already know well access and why as is the Disney embedding here and the teeny embedding is all so simple to compute because we have this Disney library from Sify so we called his knee we construct his knee and then we say fit transform and then the input to this fit transform function is our embedding that has been learned so note that this embedding is is an output of the model right remember that in our model where's our model remember in our model we actually output the environment right here so it will get the embedding of your graph neural net and it will perform the Disney technique to map this embedding high dimensional embedding into two dimensions so that you can visualize in this 2d plane by calling plot or scatter with with the color representing the class of your each node so as we can see like in stats here for instance it's not perfect right there's a lot of nodes that are clustered together like in the middle of a brown cluster there's like red node so it's not a perfect one it actually gets like 70% accuracy so it's not issue with the clustering versus with the classifier oh the question is how do you how do we know like if you have something wrong whether it's from the clustering algorithm or whether it's from the classifier so I think I think the answer is we gotta trust these these low dimensional embedding tech techniques so these are definitely not preferred perfect the thing is like if you imagine you have like a 300 embedding like 300 dimension very you want to compress it into two principal components or like it's not gonna give you like perfect things right so so I I think I think it's very hard to determine like whether this is like an artifact or not especially from this this is many to like show how this this is many like show why is it disappearing so this is this many to show like as a comparative study so say like if you have two models one model gives you this visualization of one model gives with that visualization and you observe that that model has a better clustering then given that you use the same kisnya algorithm you think that this one is doing better right but absolutely like in absolute scale it's very hard to determine whether like two things are close together it's really because your method is not good or your disney is not good but yeah good question i think so that's pretty much the end of the tutorial and lastly i just want to show you a little bit about the an extra task that is not required in homework so this is for link prediction which is also very important that if you're doing knowledge graph or if you are doing like graph reasoning it's very important basically it's asking whether you can do completion of knowledge graph or things like that you can predict whether to two nodes are linked together in the graph that has a lot of missing edges like this kind of cases so this is just using the graph auto encoder which you can also refer to the potential magic documentation so the encoder so this kind of auto encoder has this encoder and decoder part so the encoder part here we use graph convolution with graph convolution to get embeddings for all the nodes that's the encoder part and at the decoder part we basically use something very simple you can use inner product but if the to embedding have large inner product close to one then we think there's likely a link between them and if they're not then there's no link so the encoder is called the decoder is inner product right so here we define very simple encoder you're probably already very familiar of this already like two layers of graph conf here this is your encoder and decoder here decoder here we actually didn't specify because the default decoder is in a product if you want to have fancy dinner product like fancies things like Nero has a network or any other decoder then you will modify the decoder also but overall it's like just just define you another end end or module so as usual we do the Train test and we use the site CEO dataset so this is also the same data set that we use for north classification but now we are doing link prediction there and we're recording these two metrics AUC and average precision and both of which are also defined in SK learn so you can use what like what I said in towards the beginning of the class to use SK learn to compute these metrics so link relation is also pretty fast and one thing that's that's worth noting is that there is the split edges so this is the importance function that helps you to construct positive and negative examples of link relation as you probably remember in class we talked about like you have to construct positive examples and then active examples by negative sampling right and then you you have positive log likelihood minus the log sum of log likelihood of your negative samples so this is doing this separation between the positive edges and negative edges right so yeah this is a good function for link prediction but other than that everything is very similar you define the and all these parameters and and make it to GPU and then you run over the model the this is the stream function that we we built so so yeah and again we can also visualize this color is no defined I guess I have to run this again just to define a catalyst well yeah so the visualization is also very similar so the the idea is that we have an encoder that's graph comp we have decoded that's in the product the the thing in the middle is our embedding so we visualize the embedding in the middle yes I don't know why this slow but it's the same function as you saw in a previous node classification thing and and that's it for everything related to height or geometric and this one is there any question [Music] okay [Music] you check one two check one two check one two [Music] [Music] [Music] [Music] you you [Music] [Music] the folder says he needs us [Music] [Music]
Info
Channel: Hussain Kara Fallah
Views: 9,467
Rating: 4.9776535 out of 5
Keywords:
Id: X_fmiIy_YyI
Channel Id: undefined
Length: 88min 59sec (5339 seconds)
Published: Thu Jul 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.