ICLR 2021 Keynote - "Geometric Deep Learning: The Erlangen Programme of ML" - M Bronstein

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello I'm Michael Bronstein I'm a  professor at Imperial College London   and head of graph learning research at Twitter  I will talk about geometric deep learning   I guess you are all familiar with deep learning  so let me decipher the word "geometric"   and for this purpose allow me  to take you back in history for almost 2000 years the word "geometry"  was synonymous with Euclidean geometry   simply because no other types of geometry  existed Euclid's monopoly came to an end   in the nineteenth century when Lobachevsky Bolyai  Gauss Riemann and others constructed the first   examples of non-euclidean geometries together  with the development of projective geometry an   entire zoo of different geometries emerged with  mathematicians debated which geometry is the true   one and what actually defines a geometry a way  out of this pickle was shown by a young german   mathematician Felix Klein appointed in 1872 as  a professor in the small Bavarian university of   Erlangen in a research paper that entered the  history of mathematics as the Erlangen program   Klein proposed approaching geometry as  the study of invariants or symmetries   the properties that remain unchanged  under some class of transformations   this approach created clarity by showing that  different geometries could be defined by an   appropriate choice of symmetry transformations  formalized using the language of group theory   the impact of the Erlangen program on geometry  and mathematics in general was very profound it   also spilled into other fields especially physics  where symmetry considerations allowed to derive   conservation laws from the first principles an  astonishing result known as Noether's theorem   it took several decades until this  fundamental principle through the   notion of gauge invariance in its generalized  form developed by Yang and Mills in 1954   proved successful in unifying all the fundamental  forces of nature with the exception of gravity   this is what is called the standard model and  it describes all the physics we currently know   i can only repeat the words of a Nobel  winning physicist Philip Anderson   that it's only slightly overstating the  case to say that physics is the study of   symmetry you may wonder at this point what  does it all have to do with deep learning   i believe that the current state of affairs in the  field of deep learning reminds a lot the situation   in geometry in the 19th century on the one hand  in the past decade deep learning has brought a   revolution in data science and made possible  many tasks previously thought to be unreached   on the other hand we now have a zoo of  different neural network architectures   for different types of data but few unifying  principles as a consequence it is difficult   to understand the relations between different  methods which inevitably leads to the reinvention   and rebranding of the same concepts so we need  some form of geometric unification in the spirit   of the Erlangen program that i call geometric  deep learning it serves two purposes first to   provide a common mathematical framework to derive  the most successful neural network architectures   and second to give a constructive procedure to  build future architectures in a principled way   the term itself geometric deep learning  i made it up for my ERC grant in 2015   and it became popular after a paper we wrote for  the ieee signal processing magazine geometric   deep learning is now used almost synonymously with  graph neural networks but i hope to show you that   it's part of a much broader picture if we look at  machine learning at least in its simplest setting   it's essentially a function estimation problem  we are given the outputs of some unknown function   on a training set let's say label dog  and cat images and try to find a function   from some hypothesis class that fits well the  training data and allows to predict the outputs   on previously unseen inputs what happened in the  past decade is that the availability of large   high quality data sets such as imagenet coincided  with growing computational resources gpus allowing   to design rich function classes that have the  capacity to interpolate such large data sets   neural networks appear to be a suitable choice to  represent functions because even with the simplest   construction like the perceptron shown here we  can produce a dense class of functions using   just two layers which allows us to approximate  any continuous function to any desired accuracy   we call this property universal approximation  the setting of this problem in low dimensions   is a classical problem in approximation theory  that has been studied to death in the past century   we have very precise mathematical  control of the estimation errors   but the situation is entirely different in high  dimensions we can quickly see that in order to   approximate even a simple class of let's say  Lipschitz-continuous functions like an example   shown here a superposition of Gaussian  blobs put in the quadrants of a unit cube   the number of samples grows very fast with  the dimension it is in fact exponential   so we get a phenomenon colloquially known as  the curse of dimensionality and since modern   machine learning methods need to deal with data  in thousands or even millions of dimensions   the curse of dimensionality is always there making  such a naive approach to learning impossible this is perhaps best seen in computer  vision problems like image classification   even tiny images tend to be very high dimensional  but intuitive they have a lot of structure that   is broken and thrown away when we parse  the image into a vector to feed it into   the simple perceptron neural network if the image  is now shifted by just one pixel the vectorized   input will be very different and the neural  network will need to be shown a lot of examples   in order to learn that shifted inputs  must be classified in the same way   the remedy for this problem in computer vision  came from the classical works in neuroscience by   Hubel and Wiesel the winners of the Nobel prize  in medicine for the study of the visual cortex   they showed that brain neurons are  organized into local receptive fields   which served in inspiration for a new class of  neural architectures with local shared weights   first the neocognitron of Fukushima and then the  convolutional neural networks the seminal work of   Yann LeCun where weight sharing across the image  effectively solved the curse of dimensionality let me now show another example what  you see here is my favorite molecule   of caffeine represented as a graph the nodes  here are atoms and edges are chemical bonds   if we were to apply a neural network to this input  for example to predict some chemical property   like its binding energy to some receptor we could  again parse it into a vector but this time you see   that any arrangement of the node features will  do because in graphs and like images we don't   have a preferential way of ordering the nodes and  molecules appear to be just one example of data   with irregular non-euclidean structure on which  we would like to apply deep learning techniques   social networks are another prominent example  these are gigantic graphs with hundreds of   millions of nodes we also have interaction  networks or interactomes and biological sciences   manifolds and meshes in computer graphics and so  on all these are examples of data waiting to be   dealt with in a principled way so let's look again  at the multi-dimensional image classification   example that at the first glance seemed  hopeless because of the curse of dimensionality   fortunately we have additional structure that  comes from the geometry of the input signal   we call this structure a geometric  prior and it's a general powerful   principle that gives us optimism and  hope in dimensionality cursed problems   in our example of image classification the  input image is not just a d-dimensional vector   it's a signal defined on some domain which  in this case is a two-dimensional grid the   structure of the domain is captured by a symmetry  group the group of 2d translations in our example   which acts on the points on the domain in  the space of signals the group actions on   the underlying domain are manifested through  what is called the group representation in   our case it's simply the shift operator a dxd  matrix that acts on the d-dimensional vector   this geometric structure of the domain underlying  the input signal imposes structure on the class of   functions f we are trying to learn we can have  functions that are unaffected by the action of   the group what we call invariant functions and a  good example is the image classification problem   no matter where the cat is located in the image we  still want to say it's a cat so this is example of   shift invariance on the other hand we can have  a case where the function has the same input and   output structure for example image segmentation  where the output is a pixel wise label mask we   want the output to be transformed in the same  way as the input or what we call an equivarian   function and again in this example what we see  is shift equivariance another type of geometric   power is called scale separation in some cases we  can construct a multi-scale hierarchy of domains   by assimilating nearby points and producing also  a hierarchy of signal spaces that are related by   coarse graining operator P on these coarse  scales we can apply coarse-scale functions   we say that our function f is locally stable  if it can be approximated as the composition of   the coarse graining operator P and the coarse  scale function f' while the original function   f might depend on long-range interactions on the  domain in locally stable functions it is possible   to separate the interactions across scales by  first focusing on localized interactions and then   propagating them towards the coarse scales these  two principles give us a general blueprint of   geometric deep learning that we can recognize in  the majority of popular deep neural architectures   we can apply a sequence of equivariant layers and  then an invariant global pooling layer aggregating   everything into a single output and in some cases  we can also create a hierarchy of domains by some   coarsening procedure that takes the form of  local pooling in neural network implementations   this is a very general design that can be  applied to different types of geometric   structures such as grids homogeneous  spaces with global transformation groups   graphs and manifolds where we have global isometry  invariance as well as local gauge symmetries we   call these the 5G of geometric deep learning the  implementation of these principles leads to some   of the most popular architectures that exist today  in deep learning such as convolutional networks   emerging from translational symmetry craft neural  networks deep sets and transformers implementing   permutation environments and intrinsic mesh cnns  using computer graphics and vision that can be   derived from gauge symmetries i hope to show  you that these methods are also very practical   and allow to address some of the biggest  challenges from understanding the biochemistry   of proteins and drug discovery to detecting fake  news let me start with graphs probably each of us   has a different mental picture when we hear the  word graph but for me maybe because of my work   at Twitter i first think of a social network that  models relations and interactions between users   mathematically the users of a social network  are modeled as nodes of the graph and their   relations are edges or pairs of nodes which can  be ordered in this case we call the graph directed   or unordered in this case the graph is undirected  the nodes can also have some features attached to   them modeled as d-dimensional vectors say the age  gender or birthplace of social network users in   our example a key structural characteristic of  a graph is that we don't have a canonical way to   order its nodes so if we arrange the node feature  vectors into a matrix we automatically prescribe   some arbitrary ordering of the nodes the same  holds for the adjacency matrix that represents   the structure of the graph if we number the nodes  differently the rows of the feature matrix and the   corresponding rows and columns of the adjacency  matrix will be permuted by some permutation matrix   is a representation of the permutation  group and we have n! such elements   if we want to implement a function on the graph  that provides a single output for the whole graph   like predicting energy in our molecular graph  example we need to make sure that its output   is unaffected by the ordering  of the input nodes we call such   f permutation invariant if on the other  hand we want to make node wise predictions   for example to detect malicious users in a social  network we want a function that changes in the   same way as the input with the reordering of the  nodes or in other words is permutation equivariant   a way of constructing a pretty broad class  of tractable functions on graphs is using   the local neighborhood of a node we look at the  nodes that are connected by an edge to a node i   and aggregate their feature vectors together  with the fischer vector of the node i itself   because we don't have a canonical ordering of  the neighbors this must be done in a permutation   invariant way so this local aggregation function  that i denote by ϕ must be permutation invariant   when we apply this ϕ at every node of the graph  and stack the results into a feature matrix we   get a permutation equivariant function F the  way how the local function ϕ is constructed   is crucial and its choice determines the  expressive power of the resulting architecture   when phi is injective it can be shown that the  neural network designed in this way is equivalent   to Weisfeiler-Lehman graph isomorphism test it's  a classical algorithm and graph theory that tries   to determine if two graphs are isomorphic we say  that two graphs are isomorphic if there exists   an edge preserving bijection between them which  can be represented by a permutation matrix that   rearranges their adjacency matrices such that  they are equal The Weisfeiler-Lehman algorithm   is an iterative color refinement procedure  that starts with all the nodes of the graph   having the same color and then applies an  injective function to the neighbor colors   this function has exactly the same structure  as the local aggregation ϕ we defined before   and because it's injective it produces distinct  colors for differently structured neighborhoods   in this example we have nodes with three blue  neighbors that become yellow and nodes with two   blue neighbors that become green this procedure  is applied once again and now we have three types   of neighborhoods yellow yellow yellow green green  and yellow green that are mapped into violet red   and grey if we try to refine the colors further  and they don't change anymore the algorithm stops   and outputs a histogram of colors if another  graph has a different histogram of colors we   can say for sure it's not isomorphic but if we  get two equal histograms we actually don't know   the graphs might be isomorphic so it is  in necessary but insufficient condition   in fact there are examples of graphs that are  deemed equivalent by the Weisfeiler-Lehman test   but they are not isomorphic like the example shown  here the graph on the right has triangles while   the graph on the left doesn't and it can be shown  that this test cannot count triangles in graphs   so here is a typical way the local aggregation  function looks like in graph neural networks we   have a permutation invariant aggregation operator  such as sum or maximum a learnable function ψ that   transforms the neighbor features and another  function ϕ that updates the features of node i   using the aggregated fishes of its neighbors there  are lots of nuances on how to design each of these   components and this is a very active research  topic in deep learning on graphs but fortunately   most architectures fall into one of the following  three flavors the first flavor is convolutional   this is how some of the early works on graph  neural networks look like originating from   spectral analysis on graphs in this setting we  aggregate neighbor features weighted by some   fixed coefficients c_ij that depend only on the  structure of the graph and can be interpreted as   the importance of node j to the representation  of node i we will see that on grids this scheme   boils down to the classical convolution the second  flavor is based on a tension when the aggregation   coefficients now depend on the features themselves  and in the most general flavor we have a nonlinear   function dependent on both feature vectors of node  i and j whose output can be regarded as a message   that is sent to update the features of node i  graph neural networks of this type are called   message passing in chemistry applications they  were introduced by justin gilmer from deepmind   and in copyright graphics in our collaboration  with yuvang and justin solomon from mit   if you look at the typical graph neural network  architecture you will immediately recognize an   instance of our geometric diplomatic blueprint  with the permutation group as the geometric prior   we have a sequence of permutation equivalent  layers often referred to as propagation or   diffusion layers in the literature  and an optional global pooling layer   to produce a single graph-wise output some  architectures also include local pooling   layers obtained using some form of graph  coarsening that can also be learnable let me say a few words about some interesting  special cases of graph neural networks   first a graph with no edges is a  set and sets are also unordered   in this case the most straightforward approach  is to process each element in the set entirely   independently by applying a shared function ϕ  to their feature vectors this translates into   a permutation equivalent function over the set  and is a special setting of graph neural network   this architecture is known as DeepSets in deep  learning or PointNet in computer graphics as   another extreme instead of assuming that each  element of a set acts on its own we can assume   that any two elements can interact this translates  into a complete or a fully connected graph   in this case the convolutional flavor actually  makes no sense because the aggregation will   be over the set of all nodes and thus the second  argument of our function would be the same for all   nodes we can use an attention-based aggregation  which we can interpret as a form of learnable soft   adjacency matrix and i hope you can recognize the  famous Transformer architecture that is now very   popular in nature language processing applications  and it is also a particular case of a graph neural   network i should say that Transformers are  commonly used to analyze sequences of text   where the order of nodes is given this node  information is typically provided in the form of   what is called positional encoding an additional  feature that uniquely identifies the nodes   similar approaches exist for general graphs and  there are several ways we can encode the node   positions we showed one such way in a recent  paper with my students Giorgos Bouritsas and   Fabrizio Frasca where we counted small graph  substructures such as triangles or clicks   providing this way a kind of structural encoding  that allows the message passing mechanism   to adapt to different neighborhoods this  architecture that we call crafts abstraction   networks can be made strictly more powerful than  device failure element test by appropriate choice   of substructures it is also a way to incorporate  problem-specific inductive bias for example in   molecular graphs cycles are prominent structures  in organic chemistry we have an abundance of what   is called aromatic rings and here again you  can see the caffeine molecule that has a six-   and a five-cycle what we observe in experiments  with this architecture is that our ability to   predict chemical properties of molecules improves  dramatically if we count rings of size 5 or more   so you can see that even in the cases where the  graph is not given as input graph neural network   still makes sense even when the graph is given we  don't necessarily need to stick to it in order to   do message passing and in fact a lot of recent  approaches decouple the computational graph from   the input graph it can take the form of graph  sampling usually to address scalability issues   rewiring the graph or using larger multi-hop  filters where aggregation is performed also on   the neighbors of the neighbors like any recent  work we did at twitter which we call sign   scalable inception like gnns we can also  learn the graph on which to run a graph   neural network that can be optimized for  the downstream task i call this setting   latent graph learning we can make the construction  of the graph differentiable and back propagate   through it and this graph can also be updated  between different layers of the neural network   this is what we call dynamic graph CNN the  first architecture to implement latent graph   learning that we did with colleagues from MIT  perhaps in historical perspective latent graph   learning can be related to methods called manifold  learning or nonlinear dimensionality reduction   the key premise of manifold learning is  that even though our data lives in a very   high dimensional space it has low intrinsic  dimensionality a metaphor for this situation is   the swiss roll surface we can think of our data  points as if they were sampled from a manifold   the structure of this manifold can be captured  by a local graph that we can then embed into a   low dimensional space where doing machine learning  such as clustering is more convenient the reason   why manifold learning never really worked beyond  data visualization is that all these three steps   are separate while it's clear for example that the  construction of the graph in the first step hugely   affects the downstream task with latent graph  learning we can bring new life to these algorithms   i call it manifold learning 2.0 we now have a  way to build an end-to-end pipeline in which we   build both the graph and filters operating on this  graph as a graph neural network with latent graph   structure we have recently used the latent graph  learning architecture we call differentiable graph   module or DGM for automated diagnosis applications  and show that we can consistently outperform GNNs   with handcrafted graphs let me now move to another  type of geometric structures we are all familiar   with and perhaps show you a different perspective  grids are also a particular case of graphs and   a grid with periodic boundary conditions that i  show here is what is called a ring graph compared   to general graphs the first thing to notice on a  grid is that it has a fixed neighborhood structure   not only that, the order of the neighbors is fixed  i remind you that on graphs we were forced to use   a permutation invariant local aggregation function  ϕ because we didn't have a canonical ordering of   the neighbors on the grid now we do we can always  put the green first then the red and then the blue   if we choose a linear function with the sum  aggregation operation we get the classical   convolution which if we write it as a matrix has  a very special structure called a circulant matrix   a circulant matrix is formed by shifted  copies of a single vector of parameters θ   these are exactly the shared weights in  CNNs circulant matrices are very special   they commute and in particular they  commute with a special circulant matrix   that cyclically shifts the elements of a vector  by one position we call it the shift operator so   circulant matrices commute with shift which is  just another way of saying that convolution is   a shift equivariant operation now this statement  works in both directions not only every circulant   matrix commutes with shift but also every  matrix that commutes with shift is circulant   so what we get is that convolution is the only  linear operation that is shift equivariant   and you can see here the power of our geometric  approach convolution automatically emerges from   translational symmetry i don't know how about you  but when i studied signal processing nobody really   explained where convolution comes from it was  just given as a formula completely out of the blue   let me show you another nice thing we know from  linear algebra that commuting matrices are jointly   diagonalizable or there exists a common basis  in which all convolutions amount to pointwise   products they become diagonal matrices since all  circular matrices commute we can pick up one of   them for the convenience of analysis and it's  easy to look at the eigenvectors of the shift   we can show that the eigenvectors of the shift  operator are the discrete Fourier basis or the DFT   so all convolutions are diagonalized by the  Fourier transform the eigenvalues are actually   given by the Fourier transform of the vector of  parameter stata that forms the convolution so   you can see that even the fourier transform also  comes from the fundamental principle of symmetry   this relation between the convolution and the  fourier transform is called the convolution   theorem in signal processing and it gives  us two ways to perform convolution either by   multiplying by a circuit matrix risk corresponds  to sliding a filter along our signal or in the   fourier domain as element wise product of the  Fourier transforms of the signal and the filter   let me now describe a more general case where  our group formalism will be more prominent we   can think of convolution as a kind of pattern  matching operation in an image this is done by   sliding a window across the plane let me write  it a bit more formally we need to define a shift   operator T that will shift the filter which i  denote here by ψ and an inner product that matches   the filter to the image x if we do it for every  shift we get the convolution the special thing   here is that the translation group can  actually be identified with the domain   itself each element of the group -- a shift  -- can be represented as a point on the domain   this is not the general case and in general we'll  have the filter transformed by the representation   of the group ρ and the convolution will now  have values for every element of the group   g it is easy to show that this group convolution  is equivariant under the group action   here is an example of how to do convolution on  the sphere and it's not some exotic construction   spherical signals are pretty important for example  in astrophysics where a lot of observational data   is naturally represented on the sphere like the  cosmic microwave background radiation that i   show here our group here is the special orthogonal  group SO(3) of rotations that preserve orientation   and its action on points on the sphere can be  represented by an orthogonal matrix R that has   determinant equal to one so the convolution is  defined on SO(3) we get a value of inner product   for every rotation R of the filter if we want  to apply another layer of such convolution we   need to apply it on the SO(3) group it's a  three-dimensional manifold on which points   are rotations themselves i denote them by Q the  sphere in this example is a non-euclidean space   a manifold but it is quite special every point on  the sphere can be transformed into another point   by an element of the symmetry group of rotations  so in a sense there is complete democracy among   the points in geometry we call such spaces  homogeneous and their key feature is a global   symmetry structure this global symmetry structure  doesn't obviously hold for general manifolds one thing that we know when we apply a sliding  window on an image is that it doesn't matter   which way we go from one point to another will  always arrive at the same result the situation   is dramatically different on a manifold if we  go along the green path or the blue path we'll   arrive at a different result in differential  geometry we call it parallel transport and the   result of moving a vector on a manifold is path  dependent a crucial difference between manifolds   and euclidean spaces is that manifolds are only  locally euclidean we can map a small neighborhood   of a point to what is called the tangent space  the tangent space can be equipped with additional   inner product structure that is called Riemannian  metric it allows us to measure lengths and angles   if the manifold is deformed without affecting  the metric we say it's an isometric deformation   and isometries also form a group so we can define  an analogy of convolution on manifolds using a   local filter applied in the tongue in space  and if we make this construction intrinsic   or expressed entirely in terms of the metric we  get deformation invariance or invariance with   respect to the isometry group of the manifold this  was in fact the very first architecture for deep   learning on manifolds that we called geodesic cnns  one important thing i didn't say is that because   we are forced to work locally on the manifold we  don't have a global system of coordinates we need   to fix a local frame at each point or a gauge  how physicists call it the gauge can be changed   arbitrarily at every point by applying a gauge  transformation typically a local orientation   preserving rotation and we need to account for the  effect of the gauge transformation on the filter   by making it transform in the same way such  that the filter is gauge equivalent what you   can see here is again the comeback of our  geometic deep learning blueprint either   in the form of invariance to the isometry group  or equivariance to what is called the structure   group of the tangent bundle of the manifold the  reason why we care about manifolds is that in   computer vision and graphics two-dimensional  manifolds or discrete surfaces meshes   are a standard way of modeling 3D objects what we  gained from our geometric perspective is filters   that can be defined intrinsically on the object  and this equips our deep learning architecture   with invariants under inelastic deformations one  application where dealing with deformable objects   is crucial is motion capture or mocap that is used  in the production of expensive blockbuster movies   what you see here is a cheap marketless motion  capture setup from a Swiss company called   FaceShift the company was bought by apple in 2015  and its technology now powers the animoji feature   on the iphone what this video nicely shows i think  is two prototypical problems in computer vision   the problem of shape analysis where we are given  the noisy face scan of the actor captured by the   sensor that has to be brought in correspondence  with some canonical facial model and the problem   of synthesis where we need to deform this model  to reproduce the input expression of the actor 10 years ago one would need a 3d sensor  to produce this motion capture effect   and i myself was very adamant about it since  there were no cheap real-time sensors with   sufficient resolution on the market at that  time we had to build one this was our startup   in vision and here you can see a eureka moment  from 2011 where an FPGA implementation of our   sensor prototype worked for the first time  envision was acquired by intel in 2012 where   i spent the following eight years building  what is now called the real sense technology   RealSense was released in 2014 with this  funny commercial featuring Sheldon Cooper   from the Big Bang Theory and i'm sorry that  i always forget his real name and it was the   first mass-manufactured integrated 3D sensor  that became a commercial success for intel fast forward 10 years and we don't need 3D  input anymore for something similar to that   motion capture video that i showed we can actually  have hybrid geometric diplomatic architectures for   3D shape synthesis problems with a standard 2D CNN  encoder that works on the input image or video and   the geometric decoder that reconstructs a 3D shape  this was the work of my phd student Dominic Kulon   and last year at CVPR we showed a demo of full  body 3D avatars with detailed hands from purely   2D video input that ran on an old iPhone 10 times  faster than real time this was a collaboration   with a startup Ariel AI that was acquired  by Snap last year let me now talk about some   applications of geometric deep learning which  is probably the part i'm mostly excited about   in this field if we look at graphs they're really  ubiquitous we can describe practically any system   of relations or interactions as a graph from  nano-scales as models of individual molecules   to micro-scales looking at interactions between  different molecules in our body all the way to the   macro scale at which we can model social networks  of whole countries or even the entire world   one thing that you often hear in the  popular press in relation to social networks   is the problem of misinformation or so-called  fake news there is empirical evidence that   fake news spread differently on the social  network and using graph learning we tried   to detect misinformation by looking at the  spreading patterns of different stories we   got quite encouraging results and together with  my students i founded a company called Fabula AI   that commercialized this technology in 2019 Fabula  was bought by twitter where i currently had a   group that does research on graph ML and as you  can imagine graphs of different types such as the   follow graph or the engagement graph are  among the key data assets for twitter   but if you ask me to pick just  one application where i believe   geometric deep learning is likely to produce the  biggest impact i think it's biological sciences   and drug design you may know that making new drugs  is a very long and extremely expensive business   bringing a new drug to the market takes more than  a decade and costs more than a billion dollars   one of the reasons is the cost of testing  where many drugs fail at different stages   the space of possible drug-like molecules  that can be chemically synthesized   is extremely large while on the other hand we  can test maybe just a few thousands of compounds   in the lab or in the clinic so there  is a huge gap that has to be bridged   and it can be done by computational methods  that perform virtual screening of the candidate   molecules predicting properties such  as toxicity and target binding affinity   graph neural networks have recently achieved  remarkable success in virtual screening of drugs   nowadays they are already more accurate and orders  of magnitude faster than traditional approaches   last year the group of jim collins at mit used  graph neural networks to predict antibiotic   activity of different molecules leading to the  discovery of a new powerful antibiotic compound   called halicin that originated as a candidate  antidiabetic drug if we look at traditional small   molecule drugs one thing that characterizes them  is that drugs are typically designed to attach   to or as chemists say bind some pocket-like  regions on the surface of a target molecule   which is usually a protein here you can see again  my favorite molecule caffeine when i drink from   this cup it will get into my bloodstream cross  the blood brain barrier and attach itself to the   adenosine receptor in the brain its interface is  cut out in this figure so you can clearly see the   deep pocket on the protein surface more recently  the pharma industry is interested in drugs that   disrupt or inhibit protein to protein interactions  or PPIs because most biochemical processes in our   body including those that are related to diseases  involve proteins that interact with each other   one of the most famous such mechanisms  is the program death protein complex   it is used in cancer immunotherapy for which  the Nobel prize in medicine was awarded in 2018.   since PPIs typically have flat interfaces like the  programmed death ligand PD L1 protein i show here   they are usually considered undruggable  by traditional small molecules   a promising new class of drugs  is based on large biomolecules   peptides proteins or antibodies that are  engineered to address difficult targets   these drugs are called biologics and there are  already several of them on the market with my   collaborators from EPFL we developed a geometric  deep learning architecture called masif it was on   the cover of Nature Methods last year that allowed  to design from scratch new protein binders you can   see three such examples that were experimentally  confirmed to bind the PD L-1 oncological target   another promising direction towards cheaper  and faster development of therapies is drug   repositioning when existing approved  drugs are used against new targets   sometimes in combinations with other drugs this  is called combinatorial therapy or polypharmacy   many such drug combinations may have  unknown potentially dangerous side effects   and graph neural networks were recently applied  to predict them these ideas are actually not   limited to synthetic drugs with collaborators  from imperial college and vodafone foundation   we took graph based drug repositioning  approaches to the domain of food you may   know that plant-based food ingredients are rich in  compounds belonging to the same chemical classes   as anti-cancer drugs with every bite of food we  put thousands of bioactive molecules into our body   most of them still remain larger and explored by  experts not struck by regulators and unknown to   the public at large how many of you have heard for  example about polyphenols flavonoids or indoles   so it's truly the dark matter of nutrition the  way we model the effect of molecules is by how   they interact with protein targets since proteins  interact with each other the effect on one target   ripples through the ppi graph and affects other  proteins a kind of network effect because in our   body's biochemistry a lot of the biomolecules  are interrelated if we now take a training set   for example of drugs with known anti-cancer effect  we can train a classifier based on a graph neural   network that predicts how likely a molecule is to  be similar to an anti-cancer drug from the way it   interacts with protein targets we can then apply  this classifier to other molecules contained in   food from which we know the interactions with  proteins and this gives us a list of potential   anti-cancer food molecules now i'm hugely  simplifying here the biggest part of this   work that appeared in nature scientific reports  was actually to study the pathways affected by   these molecules and to confirm their anti-cancer  effect and lack of toxicity but to make long story   short we constructed the anti-cancer molecular  profiles of over 250 different food ingredients   and we see that they are prominent champions  that we call "hyperfoods" for example tea cabbage   celery and sage these are all rather common cheap  and ideas a boring ingredients that we better add   to our diet perhaps the coolest part of this  project is that the ingredients we identified were   used by the famous chef Bruno Barbieri to present  short recipes for Vhristmas if you wonder why he's   in bed this was part of the Vodafone Foundation  citizen science campaign called DreamLab   we collaborated with them to use the idle  power of smartphones at night to make our   computations i think it's a good moment to  end on this tasty note so let me conclude   we started with this somewhat irreverent desire to  imitate the Erlangen program in machine learning   trying to derive different deep learning  architectures from fundamental principles   of symmetry this took us all the way from image  classification to molecular gastronomy all the   approaches we've seen were instances of a common  blueprint of geometric deep learning where the   architecture emerged from the assumptions on the  domain underlying our data and its symmetry group   in the past few years geometric depending matters  especially graph neural networks have exploded   leading to several success stories in industrial  applications and i think it's quite indicative   that last year two major biological journals  featured geometric deploying papers on their cover   which means that it has already become mainstream  and possibly will lead to new exciting results   in fundamental sciences last but not least i would  like to acknowledge all my amazing collaborators and thank you for your attention
Info
Channel: Michael Bronstein
Views: 86,631
Rating: 4.9759498 out of 5
Keywords: geometric deep learning, machine learning, graph neural networks, artificial intelligence, AI, drug design, symmetry, geometry, convolutional neural networks, transformers, manifld learning, computer vision, computer graphics, hyperfoods, proteins, cancer, immunotherapy, erlangen program, deep learning, neural network, CNN, GNN, Transformer, positional encoding, graph learning, group theory, invariance, equivariance
Id: w6Pw4MOzMuo
Channel Id: undefined
Length: 38min 26sec (2306 seconds)
Published: Tue Jun 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.