Xavier Bresson: "Convolutional Neural Networks on Graphs"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so yeah I'm going to talk about grass and first I'd like to say this is a walk with me Mikkel p.m. under guest microphones team Fidelio Monty Hall Navy and Thomas Laurent so I'm gonna I'm gonna talk about coordinate architectures so first I will introduce the original canet architecture so just to to make easier the transition to graphs and then I will need to produce spiritual bond nets for fixed graph and in the last part I will I will I will talk about format architecture for graphs with variable size and connectivity so you see what I'm going to do actually is I'm going to relax the the data domain from something which is very well structure you know which has directions like AB down left right and then I'm first I'm going to relax that to arbitrary graph with no specific direction but it is a fixed graph and then I will have variable graphs okay so let me start with the startup community architectures ok so we know that calmness is a breakthrough in computer vision so this is basically what the change of paradigm informed handcrafted feature to hand to our learn existence it's not only in computer vision it's also in speech recognition and also in language translation so basically this is an architecture that can learn high dimensional learning problems ok and when we know this is a high dimensional data because of the curse of dimensionality so if you take an image for example like 512 by 512 pixels so then the Machine inch of this data is one medium and if for each for each variable you take tenths and pours so it means that the space is going to be 10 to the power of 1 million points so it's very huge spaces so it's like impossible you know to fill the space but still Cornette they are powerful enough ok to solve this high dimensional problems you know to find meaningful information inside this yet gigantic spaces ok but there is something here okay so as as Jen said so there was a very strong assumption about the data that we are using so images video sounds and they look like to share a universal property which is compositionality okay constant it means that there are form of patterns that are local stationary and multi scale I will come back to that again to make the transition to graphs and basically cornets it's a way to leverage this compositional structure basically to to extract Commission or features and fit them to classify your regression system recommender system visualization okay so then you have a fantastic number of application of that so I call this you know the swiss knife now of data science it's you can do you can do so much too much things so the key property of compositionality so the first one is basically locality we know that one neuron as a recession field which is actually local okay on a compact support so and we know that and it is inspired by the human visual cortex neurons the second key property is stationarity so you can get globally in violence so basically you have a like a big object a little cat and the objective mean the right path left path this is the same information okay we are we have this symmetry but here what we are more interesting is or two local stationarity so basically we have some patches which are the same across the data domain okay like the yellow patches the green patches the red the blues and so on and these are very important to create local environments so usually it's good because when you want to classify objects usually your objects in the same class can have some local variations so this is a way to deal with these local variations and the last property is a multi scale okay so the idea is we start from simple structure and then we compose this simple structure to create a little more abstract structure and then we we keep going you know composing more and more abstract structures okay so this is a is it for the implementation complexity so Cove net it's a linear complexity okay you can see in the sense of the dimensionality of your input data so if you look if you look at so if you want to create local filters so basically this is a constant number of of parameters okay for the stationarity okay this is the convolution you can use FFT so you will get n log n but usually because the kernel are compact what you can simply do is to paralyze you lose the scan of your home back support across the image and you will get a linear complexity for the multi scale path so done something and pulling basically this is the order of the number of the diversity of your of your input data okay so everything is linear and this is the complexity and then okay we can we can we can concatenate compositional layers okay so these are our input features like RGB for four images and this is our output output features and basically just a series of point wise nonlinear activation with some linear operator so here it's a coalition and and you can take the loop and the put in can be done you know using depending you can do a median pulley you can do mean pooling and you can do max pulley okay so usually max pooling is the one that is that is used and in one slide so this is all the nice properties of cornets so you have again locality stationarity multiscale and constant number of parameters to learn a filter and the learning cost n n complexity okay so the thing is yeah what is the data domain that we are using so if we take an image like this one and we zoom in actually what we see is that we have a 2d grid and on the top of the grid we have some values okay like RGB but this is this is a fundamentally new kid Ian grid you know and this is the same 3d if you're going to do volume the same also for natural language processing so if you have a sentence and and you extract the sentence here and you have a one-to-one mapping between you know world and numbers to basically just a series of numbers that's lie on one gene grid okay and the same also for sound so here what happen is that the domain that you used is very nice it's a Euclidean domain okay so and this is the reason why all this operation of convolution of putting are very fast and well posed but the thing is not all that time they lie on the Euclidean Euclidean space okay so if you take for example social network the connection between between users is not homogeneous okay and on the top of that for each user so for each vertex you can have a lot of data like images sound text so the question is how do you know lies this data given this this attila dienes structure the same as saw for the gene regulatory network so you have connection between the genes and the genes are composed of DNA so again how can you analyze this kind of data the same also for the brain functionality so you have a structural connection and on the top of the social connection you have functional activities so how do you analyze this and so on for computer graphics chemistry so with molecules and then key physics and so on so what you see the idea is we can use this Universal mathematical representation that is a graph and then on the top of the graph we can we have we have the data basically and the question is okay how can we how can we apply for net for this kind of situations so the assumption that we are going to do again and this is probably universal assumption is that we are going to say that the non-euclidean data are also compositional okay and and the question is okay so if they are composition or how do you define composition hm graph how do you define the coalition and grass pooling okay and and and also how to make them fast like linear complexity if you want to use it in practice so and let me consider the case of spectral content for fixed graph this is the problem setting so if we are going to consider one graph and only one god forever okay this that will be fixed and on the top of this graph we will have a set of signals okay that we want to analyze using comm units so here is an example this is the brand social network okay so it is fixed it's not going to change but still we have a different activation different time series for each region of the brain so depending if I speak if I if I try if I do different functions so I would get a different signal here and I want to be able to for example to classify this signal okay so this is the problem setting the same here so what we go to doing and before this first part is basically so we have the stand up cornet so the input is an image okay this is signal s scale and the image is defined on the Euclidean grid and then we can do for example classification so this is the standard of nets okay what we will do here we will make this this modification is OK we were gonna change the Euclidean grid into a graph or fixed graph okay and the input signal it can be like a functional MRI signal and we am anyway and we can do for example classification okay so we need to redefine convolution on graph for this we are going to use a spectral graph theory so this is a actually order or technically mathematics and hold on something we are also going to use some old tricks from graph theory so spectral graph theory so first we need a graph so we know basically which is a graph so we have okay nodes vertices and they are connected by edges and then it have some weights for for the connection and then what what is the most important operator here is the graph laplacian so the graph laplacian basically this is the heat the linear a diffusion operator on graph so basically if your function is smooth the laplacian applied to this function will be small but if your function you see dates a lot then the laplacian of this function will have high values and this is really the cooperator of spectral graph theory everything start from this guy okay so in the continuous domain we have only one definition which is the replacement an operator in the in the setting of graphs we may have a beautiful definition so the the most su popular are the normalized laplacian and the homework the patient's okay so from the life action I can do Nikon decomposition okay so I'm I can extract eigenvectors and eigenvalues from from this guy and this guy are unknown they are called the Fourier basis functions or the Fourier modes so here I can rewrite the matrix by the factorization of the matrix of eigen laplacian eigenvectors the diagonal of the eigenvalues okay and la pasión eigenvalues and eigenvector okay so this is a this is one known and the interpretation is okay what i want to do is actually i want to build the smoothest orthogonal basis on the domain so in this case on graph okay so again a good measure of smoothness is the laplacian so what i want to do is basically to find the function that is going to minimize the laplacian okay so I want to minimize this guy okay and I have the October no it across right here and this is okay so this is stoned a theorem so the solution is given by the eigen vector of this guy okay so this is the system so if we look at this laplacian eigenvector the euclidean domain okay this is very familiar okay this is a stunner for your bases the scene is oh it's and Kosovo and the Cassini's and this is useful to shape a composition for example for the graph it's it's more interesting actually the the fully mods the graph laplacian they are related to the topology of the graph so the first eigen vectors usually they identify and the communities on the graph okay so and this is something which is called spectral clustering so which do basically is that you compute the let's say ten I can Victor of your gravitation and you apply Kimmie's and then you have you can identify communities on the graph so this is this is that there is a large literature on that alright okay so now I have three basis functions I can use that to decompose a function with these bases okay so I can do I can you compose my function f as a the Fourier series so I take my function I project on my on my phone in blood okay and basically I can reduce the composition I can do the same on behalf okay I take my function f which is defined on the graph I project on the graph laplacian functions okay and and then I have a decomposition on F on this on these four modes what is nice is I can do everything Matis vector notation so in one line of code I can put the Fourier transform in my computer okay so I just need to collect the graph laplacian in one matrix and I apply this to one function on the graph and I just I just did the graph Fourier transform and of course it is invertible so i can also go back by doing the inverse Fourier transform okay so so why I want for you and this is because convolution is easier to do in the in the Fourier domain right for graph at least so this is this is a standard definition in the Euclidean space of convolution so we are shift invariance that we come back to that but most importantly we have the convolution or theorem so basically if we want to do the convolution of F by G first what we do is that we do the Fourier transform of F multiplied by the Fourier transform of G and then inverse Fourier transform okay so and we have a very efficient algorithm by John Tukey so this is a 50 and log n so in the discrete case so when I I have any limits okay so the way I can rewrite F involved by by G is by using is by having this matrix okay of the G's times the vector s but we know that surfing on matrix actually can be diagonalized by the eigenvectors right so if I put the eigenvector here and here I can't I go diagonalize this guy and the diagonal is going to be the Fourier transform of chip okay so an again I can rewrite this guy by one line of code with the with the matrix and the vectors okay so what I'm going to do now is that I'm going to do the same thing for the graph but it's not a definition it's actually an analogy I'm going to say okay I can do fully I can I can do the convolution on the graph by analogy to the Euclidian case so what I'm going to do is that I'm going to take my function on the graph okay do the Fourier transform on the graph times the Fourier transform of G on the graph and then I do the inverse Fourier transform okay so it's a analogy not the definition here so if so in the matrix vector notation this is coalition on graph okay so this is the the question that I have before if I do it a bit of computation when I have is something I want you to keep in mind is actually doing the completion you're just doing okay I have the laplacian okay that I apply to F and this laplacian I apply some spectral function okay so but basically this is just applying the laplacian to f convolution is applying the repression to F with some spectral function so this is not shift invariant I'm going to show you that in the next slide and the coefficient depends on the on the highway modes so if you change your graph this is not going to work okay and and this can be very expensive because you have no FFT so so you get n square so this is not shift invariant because of that so if you want to move for example regression function on the Euclidean domain so the shape will not change but because on the graph or you can see very well but because on the graph the structure is not homogeneous so the Gaussian will change shape okay but every location is not actually a problem okay so I know how to do glut evolution now the second operation that I need is to is to do graph done something or graph course any so what I when I need to do that so this is exactly as as the standard case I want to put together similar local features because this way I will be able to create global imbalance with multiple multiple layers like translation for example so here the challenge is the grid is not clean it's not early now it's really nonlinear grid so I would need I need to be very careful with that I need to find like a which is called non linearity scale concerning technique to to basically respect the non-linearity of the grid and I also I want to make this fast so graph Johnson thing is the same as actually graph partitioning okay so we have this domain of the data okay so this is a graph I have some data on this domain and if I want to do concerning what I want actually to do is I want to cluster this graph into meaningful meaningful communities okay so this is the problem of graph partitioning and it's very hard problem so we need approximation to solve this so balanced cuts so there is a large literature of of graph clustering algorithms so balanced cut is a very excellent class to solve this kind of program so usually new normal I scared by generic they also normalized association with their equivalent and here what we use actually we use and heavy edge matching and to do this nonlinear cautioning of the graph okay so basically you have two steps so at this level of cautioning gosh for you to take one point only and then you try to find the neighbor that is going to look or in maximized numerous Association okay and then you create a super vertex okay that will be for the next costing level you do that for all points and you create a new class so a new address see metrics and you keep going okay so here you have the geometry to get a local Maximizer of a loop of the normalized Association but nothing else okay so one thing so when you apply the AVH matching you would get an indexing of your graph so you know for each vertex you get a number okay you need to index it if you if you don't know anything the problem is that you will need to keep a table where you merge for example these vertex we will merge to this vertex at the next level okay so you need to keep in memory basically all the matching between cosine graph so it's it's really not efficient so one way to do this is to structure the pudding so we don't change the topology we only change the the indexing so we just do a binary tree arrangement of of the indexing so this way if you look at two vertex they're going to be mailed at the next level and I don't need to keep any table okay and I can do that in parallel if I want to do to one max three or four one max fully I can do that in ballet so it says to start out 1 D 1 D great Pony okay okay so now I have the tools for revolution and and graph done sampling I can I can I can design spectral communities so so this is the first paper by zhang zhehan ba a sir and yan so this is basically the inspiring and and the reason why we are all this workshop today of trying to extend deep learning on graph everything starting here so the idea is whatever I said in the problem setting so you have the irregular grid and now you want to change this to two to an arbitrary graph and and then you want to extract copper additional features so in the standard case what you will do is that you will learn you will know filters in the spatial domain okay here what we will do it actually we will learn the filter in the spectral domain because it's easier to do for us okay so we don't learn in the special domain we learn in the spectral domain so we are going to learn spectral filter so you remember in image processing when we want to G know is image with noise we know that the noise are in the high frequency so we design we design spectral filter which which are look low-pass so this way they will kill the high frequency and we will G noise the image okay so this was for the noisy here depending on the application if this is classification question anything given the data given the application this will be loaned okay by back propagation so we let the data decide for us what are the best spectral functions okay so so now we need to take care of some limitation so the first one is we need special organization remember that's very important if we want to have local variability for object in the same class so if we want a local space special special and of course a special organization what we need actually if we use the bus over identity if K is one is going to be the variance so if you want to loop around something which is localized it means that the derivative will be smaller okay so we need smoothness localization in space means smoothness in the frequency domain okay so what we can do is again what we can parametrize the spectral filter by smooth basis function like splines and then what we what will in to learn we need to learn actually the coefficient of the of the spline coefficient of the yes bang questions and we can also do that by by propagation okay so we we are localized in space and the number of parameters we need to learn the filter is actually a constant number which is the number of spam coefficients now we need to take care of the quadratic complexity to compute the convolution so for this what we are going to do is that we are going to we are going to permit vice the spectral feature by using polynomial of laplacian eigenvalues okay when we do that we have to haven't ages the first seven pages you can control exactly the support of your of your filter we see that easily so if you have a heat source okay on the graph so it's value one and the rest is yo you apply 1 times the laplacian then the support is going to be increased by one hop so one hop is basically all the points that you can each with one jump okay if you apply the second time the laplacian and then it's going to be too hot so you can get an exact control of the size of your support more importantly that you can get linear complexity for convolution how you see that is basically so this is the convolution okay with F at the end you need to solve this sequence of operation which are basically multiplying the laplacian by a vector but remember the laplacian in in the real world they are they are very very sparse okay so they can be very huge matrices but they are really really sparse and for sparse graph actually the complexity so this the number of edges is basically the you know this is this is the dimensionality of your input data and we can see that in the experiments okay and one thing maybe one observation is that okay the way we do the convolution we don't need any more to do the eigen decomposition of the graph laplacian we don't need to do again vector and eigen value decomposition okay we just do laplacian times the vector so that's why it's a computationally efficient okay so now we still have this issue that this monomial basis is actually not orthogonal it means that during the learning if we change one co-efficient they are going to affect all the other coefficients so we need orthogonal basis so there are different options so here we which shows chebyshev polynomials which are orthogonal and we just a bar on top elevation okay so we still need to fix and this will be the last part of the talk so you cannot transfer basically what you learn from one half to a different one okay so this are just sanity check on on a list so we build a teeny less nimble graph and we just want to look at if we get something close to the original the net 5 architecture ok so now I'm going to to talk about the graph conditional Nets introduced by Keith and winning so this is a simplified version of check net ok with another the proneural order is 2 and but what is very nice here this is actually the first time well this kind of graph neural network have an impact okay on an application so this is the semi-supervised crystalline problem so you have communities and inside each each community you have one label and you want to infer basically all the communities ok and here if you use these techniques then you have a huge improvement by 6% so that was really the first successful application of this technique ok so kindly net so as I said the choice of spectral of spectral filters are not you know limited to some some basis function so you can you can get a richer which respect for feet or sofa delica Monte we will talk about this in the next in the next talk and and also here we only talk about one graph one fixed graph but actually you can extend that to multiple gasp but still they they must be fixed okay and also figure in commonly we will talk about this in the next slide ok so for the last part okay so for the last part I'm going to take about am convent for for arbitrary graphs ok so let me let me assume summarize what we did so we started for with the stone up on that so we have a Euclidean gridlock of structure with direction and it's fixed okay now we relaxed actually there the directional information we don't have any specific direction here but the DA ecology is fixed okay so we can use spectral graph move nets to basically define very rich family of filters okay to extract information from the signals so that's because basically respectful techniques you can you can take advantage of all the geometry of your graph so that's that's yeah that's pretty powerful now the question is what a twenty five change only one one age okay if I change the topology of the graph so I go to some variation of the domain so can we still use virtual graph of that all this question is can we transfer okay the filters from this graph to another graph and and actually in the general sense no we cannot do that the generalization is not going to work the problem is that the Fourier modes are actually not robust to perturbation so if you change and you see that I mean if you change one edge you change you change the topology of your graph okay you can get very different topology so and you can see that also with this equation so if you have two graphs g1 and g2 and and this is your for your four modes then this is one spectral filter learn four from one graph and this is one signal okay and this is the same signal on on on on the new graph so usually you don't have equality okay you have equality when the Toula are very close to each other it's a very close to each other then you might be able to transfer but but we need to be careful so the problem is is okay maybe what we should do it we should try to align the Fourier modes but aligning the Fourier modes it's like actually solving you know the graph matching program so that's very hard that's very hard problem so the literature about doing that and it's even it's even harder when the graphs are variable and then the three modes are very different to each other so the comparison it are to do directly graph so there is no clear definition of the graph location for the LG graph so Mac emotional streamers as a new walk about this but it's not very clear so function share some definition but yeah it's not clear on how we can use that and also the week so with spiritual techniques we assume that the size of the graph is fixed so how can we deal with variable size so we have this new problem setting so remember the first one was to eliminate so we have one wrap which is fixed and then we have lot of signal on this graph and we want to analyze this describe using confidence now this is the new setting so we have okay some glass and on the top of the graph we have some signals okay and we have a bunch of that like molecules and how do we apply Cornette to analyze to the classification to the regression okay or all the standard techniques so the framework to do that is actually given by by Scott's Ellie and Andy's collaborators in 2009 and it is called graph neural network so this is a special neural network techniques that can deal with arbitrary graph and the idea is okay what is like the minimal inner structure you will need okay to to design neural networks for arbitrary graphs so first what you want is you want to be invariant by vertex for indexing so basically you don't want you know to do to solve the graph matching problem each time you have you have a new graph okay so you want to be invariant to that then you want locality so remember again okay it is very important for discrimination so only the neighbors must be considered you also want to have weight sharing so basically we apply okay coalition between between the neighbors and you want to be independent of the glass sides so this hidden vector is like I would say the most generic a question transfer a question you can do from the neighborhoods to the to the core to the to the to the central vertex and the question is okay what is the installation of F what is this guy so in the original paper by Scott City they use the material perceptual okay so basically they are going to sum over all the neighbors okay okay using which layer perceptron in the recent paper by Lee and echo so they use actually for F they use the blue cell okay they use the GU cell to do that and okay so this is the particular neural network architecture and then you can also consider the graph the the communit architecture so so sook beta after and n failures so they introduced in 2016 what I call the veneer graphic net architecture so basically okay a very nice properties to have actually multiple years so let's consider okay this Ricci Polly gasps okay this is the next layer and let's do a veneer definition of the convolution okay so we take the central vertex and then we take the neighbors and we do the convolution of that okay so and so this is going to be also talked by some beta on on Friday with the application of new th and communication okay so when I started working on that then when when I look at the literature actually the common trend 90% of the paper they don't use cognate to solve this kind of problems they use lecture on the whole network and so so I was wondering actually if there was something special about using you know recurrent neural network because the the complexity of the learning system is is higher I don't know exactly so what we did we did a numerical study to compare both graphically texture on a very to basic very basic and all presentative graph learning problems so we use first subtract machine I will come back to that so he was introduced by scarcity and so mr. power specification and also along the way so we also added some inner structure to this platform net so when you are on a graph something which is very natural and that weight from Ty sort so many much again inch it of basically is to say okay sometimes on a graph you want to to stop the flow of information so if you have two communities sometimes you would like to stop you know the flow of information between the two communities you want you want this to stay in the same committee and sometimes you want the flow to be open okay so you need some you need some gates so we here we added the gates so it's not usually just a very natural way to do that and it also add anisotropy to the system because the speak for examples of spectral filters they are anisotropic filters so they don't care okay which vertex in the neighbor is considered so here you can you can add back some anisotropy and you need to learn again of course this this parameter alright so this is the first graph learning program we cost analysis I would say the most basic problem in in graph graph analysis so you want to do pattern matching okay so you have some patterns like this one okay and you have some value on the top of the bottom and you embed this pattern in okay larger of graphs so here we use stochastic model okay and they are all different so different connectivity is different size and we want to be able to identify okay this original pattern what you see here so the x-axis will be the number of layers okay and here this is the accuracy so the the the higher better so if you if you don't have to and we also addressed unity so what important is also to address the duality especially for arbitrary graphs you really get a big improvement so so what you see that for a small number of layers the original neural network architecture actually are all better and in the literature what we see that actually people don't use too much layers they use maybe one two layers but don't matter so that's why make your neural network I use but if you if you apply for net and you you use the security and you increase the number of or failures then you have a minute reduce increase of performance of all the confident architecture so yeah I forgot to say that all the solid lines are covenant architectures and and the dotted lines are secure on neural network architecture so at some point this their performance a decrease okay there are too much stuff to learn so the performance decrease and this is the time and that you need this is bad time so when you use ComNet it takes less time than if you use like your neural network so we also use that for for the service supervised for showing program so again you have communities different size different topology and we give only one label inside each community and we try to infer all communities so we have the same conclusion so if we don't use too much layers okay so the lake you want our network we do better but if you if you if you get if you use more and more layers then again you have this increase of performance of comrades and yeah and also you mean run faster so this is a this is the you see the accuracy and this is the time so you don't faster than micron or network so the conclusion is if you have arbitrary graph actually wish you use covenants with a lot of we should stack that with a lot of number of layers yeah and there was also one question from the Oracle you say okay can you compare this running technique for semi-supervised cluster rate to something which is via tional well-known bias or no technique so when we use only one label okay for each Wester actually so the performance is for the non e technique 82% and for the non interacting with 45% so if there is a very huge gap actually you were surprised there is a huge gap of performance and the test time is better because here you need to solve this is number of edges by the way here you need to solve this path in our system of equation so it takes more time okay so I think yeah right on the time so so the conclusion is okay we have now different architecture okay to apply for nets to different that a domain we can still keep you know the properties of localization we have linear complexity we can do deep implementation which accessibility of the of the filters multiple arbitrary graph and now I think it opens you know the the path to to application in geography like molecules for example okay so there are some papers some codes and thank you very much [Applause]
Info
Channel: Institute for Pure & Applied Mathematics (IPAM)
Views: 27,935
Rating: 4.8932037 out of 5
Keywords: ipam, ucla, math, mathematics, convolutional neural networks, graphs, deep learning, machine learning, xavier bresson, neural networks
Id: v3jZRkvIOIM
Channel Id: undefined
Length: 40min 48sec (2448 seconds)
Published: Fri Feb 16 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.