GEOMETRIC DEEP LEARNING BLUEPRINT

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so you're talking about the symmetries in data could these analogies also be represented using the kind of symmetries that you were talking about good question what do you think of it a bit um modern machine learning operates with large high quality data sets which together with appropriate computational resources motivates the design of rich function spaces with the capacity to interpolate over the data points now this mindset plays well with neural networks since even the simplest choices of inductive prior yield a dense class of functions now symmetry as wide or narrow as you may define its meaning is one idea by which man through the ages has tried to comprehend and create order beauty and perfection and that was a quote from hermann vile a german mathematician who was born in the 19th century now since the early days researchers have adapted neural networks to exploit the low dimensional geometry arising from physical measurements for example grids in images sequences in time series or position and momentum in molecules and their associated symmetries such as translation or rotation now folks this is an epic special edition of mlst we've been working on this since may of this year so please use the table of contents on youtube if you want to skip around the show is about three and a half hours long the second half of the show roughly speaking is a traditional style mlst episode but the beginning part is a bit of an experiment for us maybe a bit of a departure we want to make some netflix style content and we've even been filming on location with our guests so i hope you enjoy the show and let us know what you think in the youtube comments many people intuit that there are some deep theoretical links between some of the recent deep learning model architectures particularly the ones on sets actually and this may be why so many popular architectures keep getting reinvented now the other day fabian fuchs from oxford university released a really cool blog post about deep learning on sets elucidating a math heavy paper that he co-authored with edward wagstaff now he wanted to understand why so many neural network architectures for sets um resemble either deep sets or self-attention because sets come in any ordering there are many opportunities to design inductive priors to capture the symmetries so raises the question how do we design deep learning algorithms that are invariant to semantically equivalent transformations while maintaining maximum expressivity now fabian pointed out that the so-called genoci pooling framework gives a satisfying explanation genocide pooling is when you generate all of the k-tuples of a set an average over your target function on those permutations it gives you a computationally tractable way of achieving permutation in variance so rather than computing n factorial combinations of the examples you compute n factorial divided by n minus k factorial which for small k is very tractable now clearly setting k to n gives you the most expressive yet the most expensive model but that would be cool it would model the high order interactions between the examples but it turns out that deep sets are this configuration with k equals one and self-attention is this configuration with k equals two now fabian also spoke about approximate permutation and variance which is when you set k to n the number of examples but you sample the permutations turns out you don't have to sample very many of them to get good results but anyway if you want to check out that in a little bit more detail go and check out fabian's blog i've put a link in the video description high dimensional learning is impossible due to the curse of dimensionality it only works if we make some very strong assumptions about the regularities of the space of functions that we need to search through now the classical assumptions that we make in machine learning are no longer relevant now in general learning in high dimensions is intractable the number of samples grows exponentially with the number of dimensions the universal function approximation theorem popularized in the 1990s states that for the class of shallow neural network functions you can approximate any continuous function to arbitrary precision by just stacking the neurons right assuming that you had enough of them so it's a bit like kind of sparse coding if you like the curse of dimensionality refers to the various phenomena that arise when analyzing and organizing data in high dimensional spaces that do not occur in low dimensional settings such as the three dimensional physical space of everyday experience now the common theme of these problems is that when the dimensionality increases the volume of the space increases so fast that the available data effectively becomes sparse this sparsity is problematic for any method that requires statistical significance now in order to obtain a statistically sound and reliable result the amount of data needed to support the result often grows exponentially with the dimensionality most of the information in data has regularities now what this means in plain english is just like on a kaleidoscope most of the information which has been generated by the physical world is actually redundant just many repeated semantically equivalent replicas of the same thing the world is full of simulacrums machine learning algorithms need to encode the appropriate notion of regularity to cut down the search space of possible functions you might have heard this idea referred to as an inductive bias now machine learning is about trading off these three sources of error statistical error from approximating the expectations on a finite sample and this grows as you increase your hypothesis space approximation error which is how good is your model in that hypothesis space if your function space is too small then the one that you find will incur a lot of approximation error and finally optimization error which is the ability to find a global optimum now even if we make strong assumptions about our hypothesis space you know it's just to say that it should be lip shits or in plain english it should be locally smooth it's still way too large we want to have a way to search through the space to get anywhere so the statistical error is cursed by the dimensionality if we make the hypothesis or the function space really small then the search space is smaller but the approximation error is cursed by dimensionality so we need to define better function spaces to search through but how we need to move towards a new class of function spaces which is to say geometrically inspired function spaces let's exploit the underlying low dimensional structure of the high dimensional input space the geometric domain can give us entirely new notions of regularity which we can exploit now using geometrical priors which is to say only allowing equivariant functions or ones which respect a particular geometrical principle this will reduce the space of possible functions that we search through which means less risk of statistical error and less risk of overfitting we should be able to do this without increasing approximation error because we should know for sure that the true function has a certain geometrical property which will bias into the model so introducing the geometrical deep learning proto book so recently professor michael bronstein professor joanne bruna dr tako cohen and dr peta velizkovic released an epic proto book called geometric deep learning grids groups graphs geodesics and gauges these researchers are elegantly linking classical theory and machine learning in geometry and group theory to deep learning which is fascinating now the proto book is beautifully written right it's presented so well it even has some helpful margin notes and honestly i could read sections of it out loud on mlst with scant need to change a single word it's that well written i mean it is there's a lot of maths in there as well let's be honest we can't can't dodge that but um i often try to impress upon people that if your writing sounds weird when you say it out loud then you're probably writing it the wrong way but these guys have written it really well now they've essentially created an abstraction or a blueprint as they call it which prototypically describes all of the deep learning architectures the geometrical prize that they have described so far they don't prescribe a specific architecture but rather a series of necessary conditions the book provides a mathematical framework to study this field and it's essentially a mindset on you know how to build new architectures it gives constructive you know procedures to incorporate prior physical knowledge into neural architectures and it provides a principled way to build future architectures which have not yet been invented the researchers have also recently released a series of 12 brilliant lectures on all of the material in the book and i've linked these in the video description now what are the core domains of geometric deep learning so in geometric deep learning the data lives on a domain this domain is a set it might have additional structure like a neighborhood in a graph or it might have a metric such as you know what's the distance between two points in the set but most of the time the data isn't the domain itself it's a representation or it's a signal which is on a hillbit space let's talk about symmetries symmetries are really important to understand this framework so a symmetry of an object is simply a transformation of that object which leaves it unchanged now there are many different types of symmetries and deep learning i mean for example there are symmetries of the weights if you take two neurons in a neural network and you swap them the neural network is still graph isomorphic there are symmetries of the label function which means that an image is still a dog even if you apply a rotation transformation to it um actually if we knew all of the symmetries of a certain class we would only need one labeled example right because we would recognize any other examples that you give it as kind of semantically equivalent transformations but we can't do that right because the learning problem is difficult which means we don't actually know all of the symmetries in advance now in the context of geometric deep learning we talk about symmetries of the core structured geometric domains that we're interested in so grids or graphs for example a symmetry is any transformation which preserves the structure of the geometric domain that the signal lives on so for example permutations of a set preserves the set membership or euclidean transformations like rotations or reflections preserved distances and angles there are a few rules to remember because uh the way we deal with this paradigm is we talk about how composable those symmetries are the identity transformation is always a symmetry composing a symmetry transformation is always a symmetry the inverse of a symmetry is always a symmetry we can formulate this with this mathematically abstract notion of a group group theory in mathematics is fascinating because it concerns only with how elements compose with each other not what they actually are so different kinds of objects may have the same symmetry group for example the group of rotational and reflection symmetries of a triangle is the same as the group of permutations of sequences of three elements so let's talk about the blueprint itself the blueprint has three core principles symmetry scale separation and geometric stability in machine learning multi-scale representations and local invariants are the fundamental mathematical principles underpinning the efficiency of convolutional neural networks and graph neural networks they are typically implemented in the form of local pooling in some sense now these principles give us a very general blueprint of geometric deep learning that can be recognized in the majority of popular deep neural network architectures a typical design consists of a sequence of locally equivariate layers i mean think of the convolution layers in a cnn then a pooling or a coarsening layer so you recognize those in cnn's as well and finally followed by a globally invariant pooling layer so that might be your classification head now these building blocks provide a rich approximation space which have prescribed invariance and stability properties by combining them together into a scheme that these researchers refer to as the geometric deep learning blueprint now the research has also introduced the concept of geometric stability which extends the notion of group invariants and equivariance to approximate symmetry or transformations around the group they quantify this in some sense by looking at a metric space between the transformations themselves this is professor michael bronstein the problem is that traditional machine learning techniques work well with images or audio but they are not designed to deal with network structured data in order to address this challenge we've developed a new framework that we call geometric deep learning it allowed us to learn the network effects of clinically approved drugs and to predict anti-cancer drug-like properties of other molecules for example molecules contained in food neural networks have exploded leading to several success stories in industrial applications and i think it's quite indicative that last year two major biological journals featured geometric depending papers on their cover which means that it has already become mainstream and possibly will lead to new exciting results in fundamental sciences the book i hold is called the road to reality it's written by a british mathematician and recent nobel laureate project penrose a professor at oxford and it's really probably one of the most complete attempts to to write and describe modern physics in its mathematical underpinning and you can see it's very heavy but if i were to compress the the thousand plus pages of this book into just a single concept i can capture it in one word and this is symmetry and symmetry is really a fundamental concept and fundamental idea that underpins all modern physics as we know it so the for example the the standard model of particle physics can entirely be derived from the considerations of symmetry and that's the kind of idea that we try to use in deep learning to derive and create new neural network architectures entirely from fundamental concepts and fundamental principles of symmetry in the past decade deep learning has brought a revolution in data science and made possible many tasks previously thought to be beyond reach on the other hand we now have a zoo of different neural network architectures for different types of data but few unifying principles the authors also point out that different geometric deep learning methods differ in their choice of domain or symmetry group or the implementation specific details of those building blocks that we spoke about but many of the deep learning architectures currently in use fall into this scheme and can thus be derived from common geometrical principles as a consequence it is difficult to understand the relations between different methods which inevitably leads to the reinvention and rebranding of the same concepts so we need some form of geometric unification in the spirit of the erlangen program that i call geometric learning it serves two purposes first to provide a common mathematical framework to derive the most successful neural network architectures and second to give a constructive procedure to build future architectures in a principled way this is a very general design that can be applied to different types of geometric structures such as grids homogeneous spaces with global transformation groups graphs and many folds where we have global isometry in variants as well as local gauge symmetries we call these the 5g of geometric deep learning the implementation of these principles leads to some of the most popular architectures that exist today in deep learning such as convolutional networks emerging from translational symmetry craft neural networks deep sets and transformers implementing permutation environments and intrinsic mesh cnns used in computer graphics and vision that can be derived from gauge symmetries people are quite cynical about the interpolative nature of deep learning and i think that finding this structure this this deeper structure could allow us to extrapolate in a way which is significantly better than we can now and i asked whether he thought deep learning could get us all the way to artificial general intelligence it's a hard question because it has several terms that are not well defined what do you define by intelligence so we don't understand what is human intelligence everybody probably gives a different meaning to this term so it's hard for me to even to define and quantify artificial intelligence i don't think that we necessarily need to emulate human intelligence and as you mentioned in the past we thought of artificial intelligence as being able to to solve certain tasks and it's a kind of a moving target we thought of i don't know playing intelligent games or uh perception of the visual world like computer vision or understanding and translating language or even creativity and today we have machine learning systems that are able to to address at least to some extent all of these tasks sometimes even better than humans and are we there yet at artificial intelligence i don't think so and probably artificial intelligence will look differently from human intelligence it doesn't need to look like human intelligence it's of course an interesting scientific question whether we can reproduce a human in silico but uh for solving practical problems that will make this technology useful for the humanity for the human kind we probably need something different it will certainly involve certain level of abstraction that we currently don't have it will probably require methods that we currently don't have but it doesn't necessarily need to look like a recreation of a human this is dr peta velichkovic by now you'll have probably seen or heard something about our recently released proto book on geometric deep learning on grids graphs groups geodesics engages or as we like to call it the 5gs of geometric deep learning which i've co-authored alongside michael bronstein john bruna and taco cohen and you might be wondering what all the fuss is about because there's already a lot of really high quality synthesis textbooks on the field of deep learning in general and also on some sub areas of geometric deep learning such as graph neural networks where will hamilton recently released a super high quality textbook on that area this is dr taco cohen so what we've been trying to do in our book project is to show that this geometric deep learning mindset is not just useful when tackling a new problem but actually allows you to derive from first principles of symmetry and skill separation many of the uh architectures and architectural primitives like convolution attention graph convolution and so forth that have become popular uh over the last uh a few years even in cases where these considerations vector variants and skill separation were not front and center when the methods were first discovered now we think that this is useful for a number of reasons first of all it might help to avoid reinventing the same ideas over and over and this can easily happen when the number of papers that come out every day is this far larger than what any one person can can possibly impossibly read and when different subfields use different language to describe their their ideas furthermore it might help to clarify when a particular method is useful a geometric deep learning method is useful when the problem domain has the particular symmetries that are built into the to the architecture and finally we hope that that by making explicit the commonalities between seemingly different methods it will become easier for new cars to learn geometric deep learning ideally one would not have to go through the large number of architectures that have been designed but just learn the general ideas of uh groups equivariance group representations and feature spaces and so on and then see for the particular instances you're interested in how that fits into the general pattern to really illustrate why do we think that such a synthesis is important and relevant for deep learning research going forward we have to go way back way back in the time of euclid around 300 years bc and as you might know euclid is the founding father of euclidean geometry which for many many years was the only way to do geometry it relied on a certain set of postulates that euclid had that governed all the laws of the geometry that he proposed all of this started to drastically change around the 1800s when several mathematicians in an effort to prove that euclid's geometry is the geometry to be following ended up assuming that one of the postulates is false and failing to drive a contradiction they actually ended up deriving a completely new set of self-consistent geometries all with their own set of laws and rules and also quite differing terminologies among some of these popular variants are the hyperbolic geometry of lobochevsky and bulgari and the elliptic geometries of riemann and for a very long time because all of these geometries had completely different sets of rules and they were all self-consistent people were generally wondering what is the one true geometry that we should be studying a solution to this problem came several decades later through the work of a young german mathematician by the name of felix klein who had just been appointed for a professorship position at the small bavarian university of erlangen the so-called friedrich alexander university in erlang and nuremberg while he was at this post he had proposed a direction that would eventually enable us to unify all of the geometries that were in existence at the time through the lens of invariances in symmetry using the language of group theory and his work is now eponymously known as the erlangen program and it there is no way to overstate how much of an important effect the erlangen program had on mathematics and beyond because of the fact that it provided a unifying lens of studying geometry suddenly people didn't need to hunt for the one true geometry they had a blueprint they could use to drive whatever geometry was necessary for the problem they were solving and besides just mathematics it had amazing spillover effects to other very important fields of science for example in physics the erlangen program spilled over through the work of emmy nother demonstrated that all of the conservation laws in physics which previously had to be validated through extensive experimental evaluation could be completely derivable through the principles of symmetry and needless to say this is a very fundamental and game-changing result in physics which also allowed us to classify some elementary particles in what is now known as the standard model thinking back towards theoretical computer science the erlangen program also had a spillover effect into category theory which is one of the most abstractified areas of theoretical computer science with a lot of potential for unifying various directions in mathematics and actually in the words of the founders of category theory the whole field of category theory can be seen as an extension of felix science erlangen program so the erlangen program demonstrated how it's possible to take a small set of guiding principles of invariance in symmetry and use it to unify something as broad as geometry i like to think of geometric depending as not a single method or architecture but as a mindset it's a way of looking at machine learning problems from the first principles of symmetry and environments and symmetry is a key idea that underpins our physical world and the data that is created by physical processes and accounting for this structure allows us to beat the curse of dimensionality in machine learning problems it is really a very powerful principle and very generic blueprint and we find its instances in some of today's most popular deep learning architectures whether it's convolutional neural networks graph neural networks transformers lstms and many more and it is also a way to design new machine learning architectures that are yet to be invented maybe in the future not based on back propagation and incorporate inductive bias in a principled way being a professor and a teacher i would also like to emphasize the pedagogical dimension of this geometric unification what i often see in deep learning when deep learning is taught is that it's it appears as a bunch of hacks with weak or non-justification and i think it is best illustrated with how for example the concept of convolution is explained it is often given as a formula just out of the blue maybe with a bit of hand waving but what we try to show is that you can derive convolution from first principles in this particular case of translational symmetry and i think the difference in this approach is best captured by what el vezos once said that the knowledge of principles easily compensates the lack of knowledge of facts professor bronstein has been a professor at imperial college in london for the last three years and received his phd with distinction from technion the israeli institute of technology in two thousand seven he's held visiting academic positions at mit harvard and stanford and his work has been cited over 21 000 times his main expertise is in theoretical and computational geometric methods for machine learning and data science and his research encompasses a broad spectrum of applications ranging from computer vision and pattern recognition to geometry processing computer graphics and biomedicine professor bronstein coined and popularized the term geometric deep learning his startup company fabula ai which was acquired by twitter in 2019 was one of the first applications of graph ml to the problem of misinformation detection i think it's no exaggeration to say that professor bronstein is the world's most recognizable expert in graph representation learning research we are really in probably some of the the nicest locations in london which is kensington if you're familiar so we have the the natural history museum the science museum and the victoria and albert museum and imperial colleges right here so i think it's as central as you can get and if you walk all the way there then you have hyde park which is probably one of the nicest parks in london i'm a professor in the department of competing at imperial college london and head of graph learning research at twitter i work on geometric deep learning in particular on graph neural networks and their applications from computer vision and graphics to computational biology and drug design dr petter velichkovic is a senior research scientist at deepmind in london and he obtained his phd from trinity college in cambridge his research has been focused on geometric deep learning and in particular devising neural network architectures for graph representation learning and its applications in algorithmic reasoning and computational biology pettar's work has been published in the leading machine learning venues peter was the first author of graph attention networks a popular convolutional layer for graphs and deep graph infomax a scalable unsupervised learning pipeline for graphs hi everyone my name is peter velichkovic and i'm a senior research scientist at deepmind and previously i have done my phd in computer science at the university of cambridge where i'm actually still based and today we're actually here in cambridge filming these shots and it is my great pleasure to be talking to you today about our work on geometric deep learning and related topics i first got into computer science through competitive programming contests and classical algorithms the likes of which you might find in a traditional theoretical computer science textbook and this was primarily influenced by the way schooling worked for gifted students back in my hometown of belgrade in serbia where students were generally encouraged to take part in these theoretical contests and try to write programs that are just going to finish as fast as possible or work as efficiently as possible over a certain set of carefully contrived problems all of this changed when i actually started my computer science degree here at cambridge where i was suddenly exposed to a much wider wealth of computer science topics than just theoretical computer science and algorithms and for a brief moment my interests drifted elsewhere everything started to come back together when i started my final year project with professor pietro leo at cambridge and i had heard that bioinformatics is a topic that's brimming with classical algorithms and competitive programming algorithms specifically so i thought a project in this area would be a great way to bring these two closer to unfortunately that was not to be as my mentor very quickly drew me into machine learning and that kind of spiraled out into my phd topics where i was for a brief moment focused on computational biology topics before eventually drifting to graph representation learning and eventually geometric deep learning my journey into geometric deep learning started actually through investigating graph representation learning which i think for a very long time these two areas have been seen as almost synonymous with one another because almost everything you come up with in the area of geometric deep learning can be if you squint hard enough seen as a special case of graph representation learning what originally brought me into this was an internship at montreal's artificial intelligence institute miele where i worked alongside joshua bengio and adriana romero on initially methodologies for processing data that lives on meshes of the human brain we found out that the existing proposals for processing data over such a mesh both in graph neural networks and otherwise were not the most adequate for the kind of data processing that we needed to do and we needed something that would be aligned more with image convolutions in spirit in a way that allows us to give different influences to different neighbors in the mesh and this led us to propose graph attention networks which was a paper that we published at iclear 2018 it was actually my first top-tier conference publication and what i'm probably most well known for nowadays the field of graph representation learning has then spiraled completely out of control in terms of the quantity of papers being proposed only one year after the graph retention network paper came out i was reviewing for some of the conferences in the area and i found on my reviewing stack four or five papers that were extending graph attention nets in one way or another so the field certainly has become a lot more vibrant because of a nice barrier of entry which is not too high recently petta has been doing some really interesting research in algorithmic reasoning part of the skill of a software engineer lies in choosing which algorithm to use only rarely will an entirely novel algorithm be warranted the key guarantee which traditional symbolic algorithms give us is generalization to new situations traditional algorithms and the predictions given by deep learning models have very different properties the former provides strong guarantees but are inflexible to the problem being tackled while the latter provide few guarantees but can adapt to a wide range of problems now petta in his paper proposed a neural architecture which can take in natural inputs but output a graph of abstract outputs as well as natural outputs pettar believes that neural algorithmic reasoning will allow us to apply classical algorithms on inputs that they were never originally designed for i am studying algorithmic reasoning which is a novel area of representation learning that seeks to find neural networks that are as good as possible at imitating the computations of exactly the kind of classical algorithms that initially brought me to computer science it turns out that this area is remarkably rich and could have remarkably big implications for machine learning in general because it could bring the best of the algorithmic domain into the domain of neural networks and if you look at the pros and cons of the two you'll find that they're very complementary so the fusion of the two can really bring the kinds of benefits we haven't seen before so i am very pleased to say that i'm among those researchers that is extremely proud and happy of what i do because it brings together some of my earliest passions in computer science with the latest trends in machine learning especially geometric deep learning which we recently released a proto book about with john michael and taco we spoke to christian sagetti and he's doing some interesting work with algorithmic reasoning creating abstract syntax trees to represent uh mathematical theorems for example and then he believes that in that representation space he projects it all into euclidean space that there's some interesting interpretative points in that space but again surely there must be some deeper structure which analogizes mathematics which would allow us to extrapolate and discover new interesting mathematics it just feels that what we're missing is the right kind of structure i think for mathematics it's relatively easy to formalize it because well we can write logic rules and basically we can build mathematics axiomatically from very basic principles these matters are already being used for computer proof of certain theorems i think it's not well regarded by the purists and pure mathematics but probably they will need to accept it and well you know maybe there will be a field medal that will be given for a proof that is done by a computer i think even recently there are some important breakthrough results proofs that were given by a computer so it is probably just the beginning of a new way of doing science essentially even as pure science as creative sciences mathematics which was considered really the hallmark of human intelligence it can be maybe if not replaced assisted by by artificial intelligence peter invokes daniel kahneman's system 1 and system 2. he thinks that we need something like system 2 to achieve the kind of reasoning and generalization which currently eludes us in deep learning models what i'm holding in my hands right now is the international bestseller on thinking fast and slow from the famous nobel prize winner daniel conman this book can be seen as one of the main theses behind my ongoing work in algorithmic reasoning and what it stands for because it argues that fundamentally we as humans employ two different systems that operate at different rates system one which primarily deals with perceptive tasks and system two which deals with longer range reasoning tasks and it is my belief that currently where our research in neural networks has taken us is to get really really good at automating away system one so being able to perform perceptive tasks from large quantities of observed data probably in a in a not too dissimilar manner from the way humans do it what i feel is really missing from these architectures nowadays is the system two aspects being able to actually take these percepts that we've acquired from the environment and actually properly do rigid reasoning over them in a manner that will stay consistent even if we drastically change the number of objects slightly perturbed the laws of physics or something like that in my opinion algorithmic reasoning the art of capturing these kinds of reasoning computations inside a neural network that was trained specifically for that purpose and then slotting that neural network into a different architecture that works with raw percepts is one potentially very promising blueprint that will take the space of classical algorithms that we have been building in this system two space and carry them over into raw perceptive inputs which these algorithms were very rarely designed to work over this is dr taco cohen hello i'm taco kong i'm a researcher at coco ai research and i work on geometric deep learning equivalent networks and more recently also on causal inference and causal representation learning now i've been interested for a number of years already since about 2013 in the application of ideas around symmetry invariance equivariance in the underlying mathematics of group theory and group representation theory to machine learning and deep learning specifically and so it's been quite exciting to see over the last few years really the blossoming of this field that we now call geometric deep learning many new methods such as various kinds of equivariant convolutions for different spaces different groups of symmetries different geometric feature types aquivariate transformers and attention mechanisms point cloud networks graph and real networks and so forth and along with these new methods also a large number of applications that have been tackled anything from medical imaging to the analysis of global weather and climate data to the analysis of dna sequences and proteins and other kinds of of molecules so if you apply this mindset the first question you ask when faced with such a new problem is what are the symmetries what are the transformations that i can apply to my data that may change the numerical representation of my data as stored in my computer but that nevertheless don't change the underlying object we're interested in whether that's reordering the nodes of a graph rotating a molecule in 3d or many other kinds of symmetries once you know the group of symmetries you can then develop a neural network that's aquivariate symmetries and hopefully is a universal approximator among acquivariate functions what we've found what many others have found time and again is that if you build this symmetry prior to your network if you make your network equivalent it is bound to be much more data efficient and to generalize much better uh hey tim great to be here uh so um so yeah the blueprint so what we uh what we realized uh as we were writing this this book is that really a lot of different architectures they can be understood in one as essentially as special cases of one generic structure that we call the geometric deep learning blueprint so to explain a little bit about what this is all about the the uh the blueprint refers to first of all feature spaces so i'll explain how we model those and then it refers to maps between feature spaces or layers of the of the network and else they have to somehow respect the structure of the feature spaces so in all cases whether it's you know graph neural net a network for processing images on the plane or a network for processing signals on a home sphere like global weather data we're dealing with data that lives on a domain so the domain in the examples i just gave would be the set of nodes on the graph of the graph or perhaps also the set of edges the the points on the plane or the points on the on the sphere and of course you can think of of many many other examples as well this is what we call the domain we write it as omega it's typically it's a set first of all and it may have some additional structure so in the case of uh of the sphere it has some interesting topology uh and typically we also want to think about the geometry want to think about distances and angles so it's a set with some kind of structure and in addition it has some symmetries meaning there are some transformations we can do to this uh set that will preserve the structure that we think is important so if we think distances are important uh on uh let's say the sphere when the kinds of symmetries we we end up with are rotations three-dimensional rotations of the sphere and perhaps also reflections in the case of a graph the symmetries would be permutations of the of the nodes and uh also you know a corresponding permutation of the of the edges uh so you just change the order of the nodes if node one and two were connected and you apply permutation and move those to node three and four then now three and four has to be connected so that's it that's a symmetry of our space oh we got now the data uh this is a very important point the data are typically not points in this space so when we're classifying images well our space is the plane but the data are not points in the plane they're not the two-dimensional vectors the data is really a signal uh on the plane a two-dimensional image which so you can think of that as a function from for each point in the plane we have a pixel value um so in more general cases it might be something called a field so you might say have wind direction on uh on earth uh that's a signa there's a vector field on the sphere now the symmetries that we talked about they act on this space you could show how they automatically also act on the space of signals so those are the key ingredients to define what a feature space is you have your your space omega it's a set of nodes sphere whatever then you have a group of symmetries of that space and then you have the space of signals or feature maps uh on this space and you have the group acting on your signal so you can rotate a vector field or you can shift a planar image or etc so those are the key ingredients to define a feature space and then we can talk about layers and so the layers or maps of the network they have to respect this structure so if we have a signal on uh on the sphere uh and we believe that rotating it doesn't let's change it in any essential way or we permute the nodes in the graph but it's still the same graph then we want the layers of the network to respect that and to respect this structure means to be active variant to the symmetries so if we apply our transformation to our signal and then we apply our network layer it should be the same uh as applying the network layer to the original input and then applying a transformation in the output space now how the transformation acts in the output space could be different from the input space so for example uh you could have a vector field as input uh and the scalar field is output they transform differently so that's why for each layer network you're going to have a different feature space with a different action of the group but it's the same group acting in each feature space the maps they should be equivariant and these maps they include both the linear maps which are typically the the learnable layers uh and the non-linearities now for linear maps you can study all sorts of interesting questions you can ask what is the most general kind of equivariant linear map and it turns out that for a large class of group actions or linear group actions the most general kind of equigent linear maps are generalized forms of convolutions so that's really an explanation for why convolutions are so ubiquitous uh the final ingredient uh that i think is key to the success of many architectures is some kind of pooling or coarsening operation so the structures we've talked about so far are you know this global space and the symmetries on it etc but typically there's also a notion of distance or locality in our space and if we just enforce that our layers have to respect the symmetries well that would force us to use a convolution in many cases and i just mentioned uh but not uh forces to use local filters and we all know that using sort of global filters is not going to be very effective use of parameters um so uh locality is is the other key thing locality of the filters uh and also locality is exploited via some kind of pooling of coursing operation now going forward to say the year 2020 deep learning is all the craze right now and so many different deep learning architectures are being proposed convolutional neural networks graph neural networks lstms and so on when these architectures are being proposed different terminologies are used because people tend to come from different areas when they're proposing them and also they are usually followed by kinds of bombastic statements such as everything can be seen as a special case of a convolutional neural network transformers use self-attention and attention is all you need graph neural networks work on graphs and everything can be represented as a graph and lstms are turing complete so why would you ever need anything else so i hope that this illustrates how the field of deep learning in the year of 2020 is really not all that different from the state of geometry in the 1800s and if history is teach us anything about how we can unify these fields together now is the right time for us to look back study the geometric principles underlying the architectures that we use and as a result we might just derive a blueprint that will allow us to reason about all the architectures we have in the past but also any architectures that we might come up with in the future and in my opinion that is the key selling point of our recently released proto book and i hope that it is helpful in guiding deep learning research going forward i should highlight that i was also extremely extremely honored to deliver the first version of the talk presenting our proto book at frederick alexander university of erlang in exactly the same place where the erlangen program was originally brought up albeit because of the existing coveted restrictions i had to do so in a virtual manner so i think a lot of people understand convolutions in the way of a traditional planar cnn and the mathematical notion of a convolution which is very closely related to you know a fourier transform for example but um graph convolutional networks they seem to abstract the notion of a convolution into some kind of concept of the neighborhood and local connectivity and some of your work as well with um you know equivariant convolutional neural networks i think do the same thing so when we talk about convolution are we talking about a very abstract notion of it uh yeah that's a great question uh i i think that uh there are so many different ways to get at convolution uh you can uh you can think of it in sort of you trying to think of it in terms of sliding a filter over some space right so you put it in a canonical position and then you slide it around uh that is an idea that you can generalize to not just sliding over a plane but say applying some transformation from a group to your filter so let's say you're you have a filter on the sphere uh then you can you have you know the sphere has a symmetry group mainly three-dimensional rotations group so3 you can apply an element of s of three a three-dimensional rotation to your filter even sort of slide it over uh over the over the sphere that leads to something called group convolution so that's one way to to think about it um there is indeed as you mentioned the spectral way of looking at it so you could think there's the the famous fourier convolution theorem uh which says that to that in the spectrum uh a convolution is just a pointwise product uh so one way to implement a convolution would be to take a fourier transform of your signal your feature map and a fourier transform of your filter multiply them point wise and then inverse fourier transform to get the the result uh and this perspective also generalizes so it generalizes to graphs where the fourier transform or something analogous to it um can be obtained via graph laplacians uh as well as some some other ideas that is actually historically how some of the first graph neural nets were implemented and motivated uh there's also a spectral theory for for group convolutions so indeed um the the fourier transform can be generalized uh for uh or the standard fourier transform that we all know is actually the fourier transform for the plane the plane being a particular group uh the translation group in two dimensions so there is a whole of a very beautiful theory uh of uh uh of let's say four eight generalized fourier transforms four groups uh where now the spectrum uh is uh is indexed not not by the just by the integers as it is the case for the uh uh for the for the line or the plane uh but by something called irreducible representations in case of the plane those are indeed indexed by integers and the um uh the spectrum is not just scalar valued or complex scalar value but it can be matrix valued uh if you're interested in this sort of stuff you can uh you want sort of a very high level description you can check out our paper on spherical cnns where we actually implement convolution on a sphere uh using this kind of generalized fourier transform um so that's a fourier perspective on convolutions and there's a final uh perspective which i think is quite uh uh intuitive which is that the convolution is the most general kind of equivariant linear map uh between certain group between certain linear group actions so specifically the these group actions are the way that groups tend to act on the space of signals on uh your space so we might have space of scalar signals on the sphere and your feature map you might want to have another scalar signal on the sphere or a vector field on the sphere or something then you can ask what is it most general kind of equivalent linear map and the answer is it's a convolution and that is also true in a very general uh setting so that that to me is the most intuitive way of understanding convolutions as the most general kind of maps that are linear maps that are equivalent to to certain kinds of group actions professor bruner welcome to mlst is an absolute honor to have you on um introduce yourself yeah um hi tim i'm joanne bruna i'm an associate professor at the quran institute and center for data science at new york university and i'm very happy to be here chatting with you at the mlst juwan you just released this geometric deep learning proto book you know what does it mean to you what was your kind of intellectual journey that led to this yeah so this journey in fact uh started many years ago so um i would say even like during my postdoc i was a postdoc i was doing my postdoc here at nyu with uh young lee kun and that was the time 2013 where you know confidence were already uh um showing a you know big promise uh in image class and uh discussing with jan the the questions are okay how about uh domains that are not like grades right so and that was the beginning of a journey that also included my former collaborator artus lamp was like another researcher in the lab that had a similar background as me like coming from applied math but looking into into into more and more deep learning so i would say that the genesis was our first attempt at extending the success of convolutional neural networks to these um reliable domains and we published the paper at the clear 2014 and after that i think things started like quite naturally because other researchers that i would say came from the same background like you know maybe from geometric background also started to realize that there was something there maybe bigger than these particular papers in particular michael bronstein uh it's kind of reached out to us just afterwards saying oh and we are also looking at similar ideas i think we should just team together and start to to think about this more globally and so uh you know things that developed from there and we wrote this uh every a journal paper like a read paper in 2017 with uh arthur jan michael pierre van der geis who is another very well known figure in this area and so from there i think that yeah things like started to slightly take off we had tutorials and nerves that we had a very um successful workshop at ipam after this i think that the the thing we're clear that okay maybe at some point in the future we should try to stamp all these things into a book that tries to reflect something a bit more let's say mature and you know what is our if we wanted to have like some kind of legacy for future generations on how to implement and and communicate these methods how would that be so i guess that was the genesis of the book and so um uh very soon we said with michael that we also would like to have some you know fresh minds and fresh energy on board so naturally the the names of taco and better uh came very naturally to us as people who were have been doing excellent work in the domain that would very nicely complement uh our skills so the team was created and uh and that's what that project and so um yeah i mean i think it's been very interesting so far of course i guess that as you know this is just a uh in uh ongoing project right so it's still not not been finalized but hopefully we're getting uh interest from the community and this this gives us some kind of uh i would say positive vibes to finish it on time as you know writing books is always like this never ending process so i think that uh yeah that's it it's been an interesting uh endeavor so far for sure i asked professor bronstein what his most passionately held belief is about machine learning i think machine learning is such a field where holding strong beliefs is often counterproductive it happened to me multiple times that something that i thought or said was very quickly overturned and what i mean is that milestones or progress was achieved much faster than i could even imagine in a wild dream so making predictions about machine learning is to some extent an ungrateful job i do however believe that in order to make progress to the next level and make machine learning achieve its potential to become the transformative technology we trust and use ubiquitously it must be built on solid mathematical foundations and i also think that machine learning will drive future scientific breakthroughs and probably a good litmus test would be a nobel prize awarded for a discovery made by or with the help of an ml system it might already happen in the next decade i wanted to go into a few questions actually uh about the book so um uh first of all joanne in your 2017 paper the mathematics of deep learning you cited the universal function approximation theorem which is to say the ability of a shallow neural network to approximate arbitrary functions but the performance of wide and shallow neural networks can be significantly beaten by deep networks and you said that one of the possible explanations is that deep architectures are able to better capture invariant properties of the data compared to their shallow counterparts um could you just briefly introduce the universal function approximation theorem and do you think it's still relevant for today's neural networks yeah i mean that's a that's a very deep and an important question yeah so universal approximation theorem it refers to this very general principle that once you define a parametric class let's say you're uh you are learn functions using null nets it just describes this your ability that as you put more and more parameters into your class you're gonna be able to uh to approximate essentially anything that data nature throws at you and so uh this might seem like a very powerful property but in fact um in fact it's it's it's something that it you have probably already encountered many times during your undergrad i mean if you have any let's say background in uh you know sigma processing electrical engineering there's many ways in which students have learned how to represent data like signals for example using fourier transform so fourier transforms are an instance of a class that has universal approximation so in that sense it's a i think going back to the second part of your question how relevant it is to in the context of neural networks and um how far does this thing push us towards understanding why deep learning works so i would say there's a there's two sides of the answer on one hand i think that universal approximation is is a tool that when you combine it with other elements it becomes something that provides good guiding principles for instance universal approximation of a generic function we know that yeah we can as i said we can obtain it with you know very very naive architectures for example just a shallow neural network without any kind of physical structure special structure already has this property does it actually help us to to learn very efficiently no right and i'm gonna go at this afterwards but when you combine it for example with uh you know let's say that now your data lives on a graph or your data i don't know has a certain uh like come from a physical lab that has certain properties let's say that it's rotationally invariant like the first thing that that the designer like a domain scientist would like to know if you come there and you design your neural network look i have a neural net that i trick sticks detects your data and has very good performance the first thing that you will ask is okay how general is your architecture right can it explain anything that they could throw with you it seems like it's a in that sense i would i would present it more as a sufficient condition like a check mark that your your you know your architecture needs to fall right if you make more and more parameters can you express more and more elements functions from your class but then as i said this is far from being uh sufficient right it's like it's a sorry it's a it's a necessary condition but it's far from being sufficient in the sense that um universal approximation uh has a flavor it's a result that does not quantify how many parameters do i need right like you know if i want to approximate functions let's say i want to classify between different dot braids it doesn't tell me this theorem doesn't tell me how many parameters how many neurons do i need for that right it's it's a statement that in that sense uh it lets it leaves you a little bit with your uh like say like uh you know like a like it's a bittersweet result right it doesn't it doesn't really tell you anything actually so that's why and that's why we enter this uh this uh other question that is actually much deeper and to some extent still written by pretty much open that is how do you actually go from this this statement to something that is quantitative write something that tells you okay you know you need that many layers you need that many parameters and so this is where the role of depth in neural networks is uh you know becomes essentially the key open question and and yeah so in this in this quote that you thought that you brought from this paper that kind of reflected our understanding at the time of uh maybe the true power of universal approximation is you know when as you combine it with this other prior that is the asymmetries of the data like the invariances so i don't want to represent arbitrary functions i only want to represent functions that are invariant to certain transformations of the input so in fact our at least my particular you know view of the problem analysis of the problem has somehow evolved in the last years right of course through a research that i've done together with my collaborators and now and and this is actually the way we present it in the book we we we kind of identify two different flavors two different sources of prior information that one needs to bake into the problem right to to really go beyond this like basic approximation result of neural nets the first one you need is invariance right is a disability that you you need to break the fact that you actually put symmetries into the architecture is certainly going to have a benefit in terms of sample complexity right they i mean you are going to learn more efficiently if your model is aware of the symmetries of the world but in fact this this prior in fact we know now that it's not sufficient right if you only like agnostically build your learning system just with these symmetries in mind indeed you are going to become more efficient than a system that is completely agnostic to symmetries but it might not be you might not be able to formally establish what we call like a like a learning guarantee that has good sample complexity and i guess that i don't want to go too much into the jargon and the details of what this means but the idea is that if i want to you know learn this function with certain precision how many examples how many training examples do i need to kind of give you a certification like a guarantee that i'm going to be able to do that so with symmetries alone it's not something that we know how to do in fact we are we believe we have strong beliefs that it's not possible right there's a examples out there that i could i could you know construct a function that has the right symmetries has the right priors if you want but still needs a lot a lot of examples to build to learn so what we need is to add something else into the mix and this is this something that we call in the in the book this scale separation and and and if you want i can try to very briefly give you an intuitive idea of what this means so if you think about the problem of classifying an image like a dog or the cat so what is given to you is like a big bunch of pixels right every pixel has a color value so somehow you need to figure out uh the thing that you're looking for is lying in some kind of like it's really through the interactions between pixels that you get the answer right and the question is of course if i have a thousand pixels how many possible interactions do i have between thousand elements so this is where this exponential of the cursive dimensionality appears right i need to a priori i should be looking at all possible families of interactions between pixels and these are maybe where my signal would be live of course if i need to look for all of these things it's impossible right there's just too many things uh if i tell you that the you know these interactions are such that there's this translation symmetry well you might not be you know you you might not be needing to look at all of them but in fact you don't need you you do not throw enough so what what is really something that is powerful is that i tell you that maybe the interactions that matter the most are those between a pixel and its neighbors right and if you understand very well if you base your initial learning steps into understanding well which local interaction matters maybe you can use them to bootstrap the interactions that that will be to look at this neighborhood to slightly bigger neighborhoods right and so this idea that you can break a very complicated problem into a families of sub problems that lives in different scales this is at the i would say at the intuitive level something print like at the core at the essence of why these architectures are so efficient this idea as you might imagine is not new it's not specific to deep learning the idea that you can take a complicated system of interacting particles and break it into different scales these are the at the basis of essentially all of physics and chemistry right there's many many you know like when people study even like biology life right you have you have experts that are very experts at the molecular level then you have experts that you know might understand like you know doctors that understand things at the level of okay functions and then there's maybe experts at the level of the society right but this you know it's pretty natural to break the a very complicated thing into different scales and so the neural networks somehow are able to do that we don't have the full mathematical picture right of for example why this scale separation is strictly necessary what we know from empirical evidence like that is now i would say indisputable is that this is an efficient way to do that right because when i was when i was reading the proto book i noticed that there was a separation between the the symmetries and the scale separation could you explain in simple terms why is the scale separation not just a symmetry as well because it seemed a little bit i don't want to say kludgy but it seemed like um you had this scale matter and you dealt with it separately yeah that's a good that's a good point so so maybe that the way to to to separate these two would be if you think about uh like an algorithmic instantiation if you want to have a network that would just break the problem into different skills it would be like a neural network that would operate the different patches and for every page of the image i could be learning independent set of parameters right so that would be a model that is only told that it should be breaking the problem into different regions but it's not necessarily told that you know there's a weight sharing right there's some kind of like parameter sharing that somehow uh is you are able that you know in a sentence is helping you to learn with fewer number of parameters so somehow these these two these two conditions are uh slightly complementary uh we as i said there's still like a lot of mathematical puzzles as to how these things interact optimally um and i think that the the the one of the reasons why we chose to explain the story into two different in these two different priors is that they all survive this quest fortunately in the sense that these two principles are again something that you can think about for grids for groups for graphs you can also see these principles appearing completely everywhere as you study physical systems right like the the scale and the symmetry is are really at the core of uh of um many physical theories and and i would say that there's also um you know at the more maybe technical level these two priors somehow have been instrumental to organize like to basically to have a kind of a recipe to build architectures right so so maybe now we don't even think about it right but when you have a new problem a new domain and you need to build an efficient neural network we automatically have this this idea that okay we are going to start learning by composition right so we are going to extract the information one layer at a time that's the first appearance of scale and we know that the way we need to organize these layers right how do you parameterize a layer that takes some input features and produces maybe better features this idea that we do that by understanding this kind of equivariance structure right we we we have this notion of filters right in in in convolutional networks we have this i said we we organize everything in terms of filters when we talk about message graphical we have this kind of like a diffusion filters right and and so there's this object that we extract from the domain that is helping us giving us something very constructive very you know very like very practical and so this is really the this underlying group of transformations that is acting on our domain and so i would say that you know from a practitioner's perspective these are these these two principles right that i'm gonna learn by composing some some uh like fundamental layer that i repeat all the time and the way this layer is organized is structured uh through this like group transformation this has been i would say like like a trademark of uh you know the success of neural networks of course as also we mentioned in the book this is these are i would say proto-like meta uh you know meta parameterization right in the sense that there's as you know many many many uh variants that people have come up with many many let's say yeah like modifications on the basic architecture that have really make dramatic changes in performance right so there's um there there's of course as i say like the devil sometimes it's in the details right and so as as we are writing the book we are realizing exactly you know uh how some of the changes in the architectures are actually fit into this what we call this blueprint right this symmetric deep learning blueprint fascinating okay so i wanted to come back to what you were saying a few minutes ago about the um the sample efficiency of these models now um with graph neural networks for example there are factorially many permutations of adjacency matrices for a given graph and um i want to talk about a sorting algorithm right so francois schwarle came on the podcast and he said that in order to learn a sorting algorithm that generalizes you would need to learn point by point you would need to see factorially many examples of permutations of numbers do you think that we could train a neural network to learn a sorting i guess what i'm asking in a roundabout way is do you think there's a kind of geometry to computation itself good that's a very good question and in fact uh we have some um some recent work with uh some collaborators in my group where we kind of take up this question from and we try to formalize it mathematically and we give answers to this question and so um many of these like a computational tasks that uh that you were mentioning for example sorting or you know like um like algorithm right like algorithmic tasks they are if one wants to put them into some mathematical context the first thing that comes to mind is these are functions that already enjoy some symmetries for example like a sorting algorithm right is invariant to permutations right so the function that you are trying to learn is a an arbitrary function that has this symmetric class and so as such as i was saying in the beginning you can try to address this question uh saying okay now you give me an arbitrary so so the question would be relative to an arbitrary learning learner that is agnostic to symmetry how much does a symmetric learner gain right so you can basically like try to understand quantify the gains of sample complexity of learning without symmetry versus learning with symmetry and so uh the the the punch line of this work that recent work that we that we completed is that one can actually quantify the sample complexity gains and these are of the order of the size of the group and so here in the case of like permutations what it means is that if a learner is aware of this symmetry like one training example of the symmetric learner is roughly equivalent to n factorial samples of the agnostic learner right which is a something that you would expect like if you think in terms of data mentation right like if i if i if i tell you in advance that your function is symmetric is invariant to permutations it's you know like a brute force approach would say okay you give me an instant training an input right and instead of giving you this input i'm just gonna you know like permute like have any possible permutation of the input and you already know the output right so you can as well fit it to the learner so this is like heuristic and this addition turns out to be mathematically correct precise at least you know under some conditions right but but the i guess that the whole point is that these gains might look amazing like might look like a you know like a big boost in in supple complexity as i said before there's a grain of salt here is that the in these conditions in these general conditions you are already fighting essentially an impossible problem in the sense that the rate like the sample complexity is dominated by a rate of basically the the the rate in which you learn is what we call curves by dimension in this what it means is that if i want to you know i have a certain performance uh generalization error and let's say that now i want to ask you the question if i want to divide this generalization error by two how many more samples do i need to give you right like well so if i want to double you know like a double like reduce the my error by half by how much do i need to give you more samples so this dependency in fact is exponential in dimension right so basically the the the sample complexity gains by invariance they are exponential in dimension but they are fighting an impossible problem that is already caused by dimensions so what it means is that at the end of the day uh this is what what you know um what was in the in the like in the heart of what i saying before is that invariance alone might not be efficient might not be sufficient right because you are okay you are taking a very hard problem you are removing an exponential factor but you still have many x you know you have still have something in the exponent that is exponential right so so so what it means is that uh um that i mean that that's what really uh underpins why we think in this terms of x combining symmetry prior with this scale separation prior but uh certainly the algorithmic tasks are a very interesting playground because um i i think that for the case of sorting um i mean as you know uh skill separation is also an issue it's also a very important thing right i mean it this is what basically is at the heart of the dynamic programming approaches right like these efficient algorithms that are not only efficiently statistically but also efficiently computational right this idea that you can divide and conquer so so so so algorithmic tasks also are kind of exposed to this dual physical right of a scale and in variance on the cursive dimensionality there's this issue where you have a data point and you want to surround it by other data points in two dimensions to create a convex hull and as you increase the number of dimensions the number of data points you need to create this covering um to create a kind of interpolative space increases exponentially and when you get past a certain number of dimensions let's say 16 or not not very many you would need essentially more data points than there are atoms in the universe so this leads to a very interesting realization i think some people refer to it as the manifold hypothesis which is that most natural data is only really spatially novel on very few dimensions and a lot of data falls on very smooth low dimensional manifolds but what are the implications of this essentially all machine learning problems that we need to deal nowadays are extremely high dimensional so even if we take very modestly sized images they live in thousands or even in millions of dimensions and if you think of machine learning or at least the simplest settings of machine learning as a kind of glorified function interpolation the the standard approach is to function interpolation is just use the data points to to predict the values of your function will simply not work because of the phenomena of course of the recursive dimensionality that increasing the number of dimensions the number of such points blows up exponentially so what you really need to take into account and probably this is really what makes machine learning work in practice is the assumption that there is some intrinsic structure to the data and it can be captured in different ways so it's either the manifold assumption where you can assume that the data even though it lives in a very high dimensional space intrinsically it is low dimensional this can be captured also in the form of symmetry for example uh image is not just a high dimensional vector it has underlying grid structure and grid structure has symmetry this is what captured in convolutional networks in the form of shared weights that translates into the convolution operation i think the symmetries are part of the magic here because it's not just the interesting observation that natural data is only spatially novel in so few dimensions there's something magic about symmetries and when we spoke to francois charles recently he invoked the kaleidoscope effect which is this notion that almost all information in reality is a copy of some other information probably here it will be a little bit stretching but i would say that because data comes from nature from physical processes that produce it physics and nature itself is in a sense low dimensional so it's application of simple rules at multiple scales you can create very complex systems with very simple rules and this is probably how our data that we are mostly interested in machine learning is structured so you have this manifestation of symmetry and self-similarity through different scales the principles of symmetry and certain environments or equivalence and of scale separation where you can separate your problem to multiple scales and deal with it at different levels of resolution and this is captured for example in pooling operations in convolutional neural networks and in other deep learning architectures this is what uh makes deep learning systems work fascinating it's so good that you raised the cursive dimensionality because i was going to ask you about that could you explain in really simple terms so not invoking lip ships i can't even say it properly enough but you know not invoking a mathematical jargon and why exactly in your um articulation does geometric deep learning reduce the impact of the curse of dimensionality yeah so the curse of dimensionality it it refers generally to them to the inability of algorithms to keep certifying certain performance as the data becomes more complex and data becoming more complex here means that you have more and more dimensions right more and more pixels and so this inability of you know like scaling basically it's like it really says that if i scale up the input my algorithm is going to have more and more trouble to keep the base and so this this curse can take different flavors right so this curse might might might have like a statistical reason in the sense that as i make my input space bigger there would be many many many much exponentially more uh functions real functions out there that would explain the training set that would basically pass through the tingling points and so the more dimensions i add the more uncertainty i have about the the true function right so i would need more and more training samples to keep the base these the curves can also be from the approximation side right so in the sense that my the number of neurons that i mean that i'm considering to approximate my target function i need to keep adding more and more neurons at the rate that is exponential in dimension under and the curves can also be from the computational side right the the sense that the if i keep adding parameters and parameters to my training model i might have to optimize the you know to to to solve an optimization problem that becomes exponentially harder and so you can see that you are basically bombarded by three different by all angles and so uh an algorithm like you know here in the context of statistical learning or learning theory if you want having a kind of a theorem that would say yes i can promise you that you can learn you need to actually solve these three problems at once right you need to be able to say that that in their conditions that you're studying you have an algorithm that it does not suffer from approximation nor statistical nor computational courses so as you can imagine it's it's very hard right because then you need to to master many things at the same time so why do we think that geometry deep learning is at least an important piece to overcome this curse as i said before so uh geometric deep learning is really a device to put more structure into the target function right so basically to make the learning problem easier because we are we you know we are we are promising the learner more properties about the target function right we are basically making the hypothesis class if you want smaller uh that said as i said there's there's still some path to go right as we were describing just a bunch of principles that make this hypothesis spaces smaller and more adapted to the real world but one thing that we are still lacking for example is the guarantee in terms of optimization right i mean i described that the depth of these architectures is somehow something that is akin associated with the scale right the fact that you need to understand things at different scales so as you know in the from the optimization side there's still some open questions and open mysteries as to why the gradient descent for example is able to find good solutions right so these are things that uh you know we believe that these architectures can be optimized officially just because we have this experimental evidence that is piling up but we are for example we are still lacking theoretical guarantees for the approximation we it's a bit of the same story right so we understand very well approximation properties of shallow networks starting from universal approximation but of course many many recent interesting work but we are also still lagging a little bit behind in approximation properties for deeper networks so as you see it's like you can you can see from from this discussion that yes we have some you know good reasons to believe that these are fundamental principles of learning but there's also a bunch of mathematical questions that are still open and and and this is also one of the things that i like about writing a book on this topic because it's a very live domain right it's not you know as you can see the field is still evolving and and it's a i think it's a good time to yeah it's a good time i mean if researchers out there are listening to us it's a good time to to think and to work and to join this program amazing will we ever understand the approximation properties of deep networks because with the um the shallow you know function approximation algorithm it's you can almost think of a neural network as being kind of like sparse coding and you know the more neurons you have you're just kind of discreetly fitting this arbitrary function but you don't have that visual intuition quite so much with the deep networks yeah i mean it's um it's an important and it's a deep question so so in the shallow neural networks i really really correspond to this idea that you learn a function by stacking a linear combination of basis elements right and and this is really at the roots of essentially all of harmonic analysis so functionalized right you typically think about the basis and you ask questions about the linear approximation or approximation et cetera deeply deep neural networks they introduce a fundamentally different way to approximate functions that is by composing right by composition and so you're right uh our knowledge about this question uh right now is mostly concentrated in what we call separation results like a depth separation which consists in trying to find construct mathematical examples of functions that cannot be approximated with shallower networks with certain number of neurons but indeed they can be much better approximated with different networks right so this is really understanding which kind of functions benefit fundamentally from composition rather than from addition and so there's a certain mathematical you know vision of mathematical intuition that is building up but of course it's still very far from from under explaining that the true power of the depth and just to give you like a final pointer here there's there's a very related question that replaces neural networks as we understand them with what we call boolean functions right like these are just the circuits arithmetic circuits that take as input some bit string and they can manipulate the bits by you know add or operations and operations and they can keep adding gates and so this question about what is the ability of a certain circuit architecture to approximate certain boolean functions is actually a notoriously hard and and basically has been studied in the in the theoretical computer science community since the 50s and the 60s right and there's a actually very very very deep results and very challenging actually open questions concerning these things so so this is really we are really touching here questions that are pretty serious at the at like the deep mathematical and theoretical level and so yes uh you should not expect that in the year in the next year two we have a complete understanding of you know approximation powers of any architecture with any depth but i think you should expect that the theory like this will continue to try to catch up with the with the with uh with the experiments and so we are i think we yeah i think we we are hoping to to to get like a more precise mathematical understanding of the role of depth and uh and as i said before um there's all one thing that is fascinating about this this domain that is maybe very unique to deep learning is this very strong interaction between optimization uh statistics and approximation right maybe it turns out that you know the like the huge step that we have in this residual neural networks might not be necessary from the approximation side but in fact it's so useful for the optimization that every that overall is a big a big winner right so there's all these always these like twists about this question that that that are fundamentally mixing these three sources of error i'm sorry for asking you this question but can neural networks extrapolate outside of the training data um that's a good that's a yeah that's a it's a it's a good question uh i i would say that the answer i guess depends on on your specifications right so i guess that the the the conservative answer uh of us you know a statistical learning person would be no because we don't have good theorems right now that tell us that this is the case uh there's very like you know like a strong effort both from the practical and the theoretical community to really understand this question like by trying to formalize it a bit more i mean what do we mean by distribution shift you know what kind of a training procedures you can come up with that would give you precisely this kind of robustness there's of course what we call these biases right i mean uh i can always take it like a training distribution i make a choice of a certain architecture i'm going to learn a function and obviously there are some directions if you want in the space of distributions for which my hypothesis will turn out to be have good generalization there might be other direction in which the the country is true so there's actually very nice work for from stephanie alcaz group at mit where they for example they studied this question in the context of value networks also including graphs where they for example they discover or they identify this strong preference for radio networks to generalize along linear directions right in the sense that if i just decide to now you know shift my data in linear directions then my function has no trouble generalizing maybe there's other directions right in which this thing is actually catastrophic so i think that yeah the question i think it's it's very important from um let's say it's it's it's very important from a kind of a practitioner perspective right that's typical that's clearly something that a user would like to know but i think that from a more mathematical a theoretical level i think we are still at the stage of trying to formalize like okay what do we exactly mean by extrapolation and what are the kind of the the conditions for with architecture can do it and i think that yeah this is a yeah it's an important question but uh yeah i think we are still pretty far from having a full answer amazing and and final question uh what areas of mathematics do people need to study before reading the proto book uh good question so i think that uh our um our objective and our really like the the yeah like our idea is really to to have something that is quite self-contained in the sense that we are gonna provide uh appendices that uh expand on the areas that is maybe they are not typically kind of the bread and butter of machine learning people for example we are going to have an appendix on group theory uh different differential geometry uh harmonic analysis so these are areas that i think are going to be there for you to delve into into to get like the most of the paper the most of the book but i think that other than that any basic you know any basic knowledge of linear algebra statistics and analysis will will do so like if you have taken a graduate level class on machine learning you should be ready to go rather than just being applied in a lot of really important branches of research problems people outside of pure research have recognized that a lot of the data that comes to us from nature is most naturally represented in a graph structured form very rarely will nature give us something that can be representable as an image or a sequence that's super rare so very often the structure is more irregular more graph-like and therefore graph neural networks have already seen a lot of applications in domains where the data is supernaturally represented as a graph in the domain of computational chemistry where you can represent molecules as graphs of atoms and bonds between them the graph neural networks have already proven impactful in detecting novel potent antibiotics that previously were completely overlooked because of their unusual structure in the area of chip design graph neural networks are powering systems that are developing the latest generation of google's machine learning chips the tpu furthermore graph structured data is super ubiquitous in social networks and the kinds of networks maintained by many big industry players and accordingly graphene networks are already used to serve various kinds of content in production to billions of users on a daily basis in fact the recommendation system at pinterest the product recommendation system at amazon as well as the food recommendation system for ubereats all of them are powered using a graph neural network that helps serve the most relevant content to users on a daily basis and on a slightly personal note graph neural networks have also been used to significantly improve travel time predictions in google maps which is used also by billions of people every day so whenever you type a query how do i get from point a to point b in the most efficient way the travel time prediction that you get is powered by a graph neural network that we have developed at deepmind in collaboration with the google maps team and this is of high importance not only to users that use the app on a daily basis to find the most efficient way to travel it's also used by the various companies that leverage the maps api so they can tell their customers what's the time it will take for a certain vehicle to arrive to them so companies such as food delivery companies and ride sharing companies have also extensively profited from the system which in cities such as sydney has reduced the relative amount of negative user outcomes in terms of badly predicted travel times by over 40 percent making it one way in which uh graph representation learning techniques that i have co-developed are actively impacting billions of people on a daily basis when i was an undergraduate student i was interested in image processing and was excited about variational methods i think it's a very elegant idea that you can define a functional that serves as a model for your ideal image and then use the optimality conditions to derive a differential equation that flows towards the optimum and a particularly cool approach was proposed by ron kimmel where you could think of an image as a manifold or a high-dimensional surface and use an energy that originated in string theory and particle physics to derive a non-euclidean diffusion pde called beltramiflow that acts as a nonlinear image filter and this is what made me fall in love with differential geometry and i did a phd with ron on this topic and i think these were really beautiful and deep ideas that unfortunately now are almost forgotten in the era of deep running and it's a pity that the machine learning research community has such a short memory because many modern concepts have really ancient roots ironically we've recently used non-euclidean diffusion equations as a way to reinterpret craft neural networks as neural pdes and i think it really helps sometimes to have a longer time window now equivalent networks tend to generalize much better and require much less data if the data indeed has the symmetry that you assumed in your model but people often ask you know why do we even care about data efficiency when we can just collect more data we live in the era of big data right and i think the answer why you might still be interested in data efficiency first of all there are applications like say medical imaging where acquiring labeled data simply is very cost you have to get patients uh you're dealing with privacy restrictions you're dealing with costly highly trained doctors who have to annotate the data come together in a committee to decide on questionable cases and so forth so this is very expensive and if you can you can improve the data efficiency by a factor of 2 or 10 or whatever it may be you might just take a problem that was in the realm of economically infeasible and take it into the realm of the economically feasible which is a very useful thing there are other cases like graphical nets where the group of symmetries is so large in this case n factorial number of permutations that no amount of data or data augmentation in practice is going to allow you to learn the symmetry or to learn the invariants or equivalents in your your knuckle so indeed you see that in this space of graph neural nets everybody uses equivariate permutation equivalent network architectures and finally you could think about the grand problems of agi artificial general intelligence and so forth and here i think that we will most certainly need large data sets large networks a lot of compute power and so forth and the current architectures we have clearly are performing far fewer computations than the human brain so there's a ways to go there but i also think that one essential characteristic of intelligence is the ability to learn quickly in new situations situations that are not similar to the the ones you've seen in your training data and so data efficiency to me is an essential characteristic of intelligence it's almost like an action that you want your method to be data efficient and so i think one of the big challenges for the field right now is to try to think of very generic priors priors that apply in a wide range of uh of situations and they give you even though they're you know generic and abstract they give you a lot of bang for the buck in terms of uh improved generalization and data efficiency i think that the beauty of science and research is in connecting the dots and i find it fascinating that for example craft neural networks are connected to the work of vice versa and lemon from the 60s on isomorphism testing which in turn was inspired by problems in chemistry and chemistry was also the field that drove the research into modern formulation of diffusion equations that were adopted in image processing community in the 90s and came back recently as a way to reinterpret craft neural networks and i think such connections give really a new and deep perspective and probably the deeper you dive the broader they become but it's really a never-ending story today is an incredibly special episode and we're filming at nine o'clock in the morning uh it's really rare for us to film this early when i'm still caffeinated many of our guests are over in the states it's an absolute honor to have you both on mlst and uh professor bronstein could you start by briefly telling us how the young mathematician eninoffer used symmetries to discover the conservation laws in physics maybe i should take a step back and describe the situation that that happened in the field of geometry towards the end of the 19th century and it was an incredibly fruitful period of time for for mathematicians working in this field with the discovery and development of different kinds of geometries so a young mathematician based in germany called felix klein proposed this quite remarkable and groundbreaking idea that you can define geometry by studying the groups of symmetries basically the kinds of transformations to which you can subject geometric forms and seeing how different properties are preserved or not under these transformations so these ideas appear to be very powerful and what amy neuter showed in her work and she actually worked in the same institution where klein ended up in guttengen in germany and she showed that you can take a physical system that is described as functional as a variational system and associate different conservation laws with different uh symmetries of this system and it was a pretty remarkable result because before that conservation laws were purely empirical you would make an experiment many times and measure for example the energy before or after some physical process or chemical reaction and you would come to the conclusion that the energy is preserved so this is how for example i think lavoisier has discovered the conservation of energy so it was probably for the first time that you could derive these laws from first principles so you would need to assume in case of conservation of energy the symmetry of time we decided to take a tool into the world of algorithmic reasoning you said in your introduction that algorithmic reasoning seeks to find neural networks that are good at imitating the classical algorithms that initially brought you to computer science so you recently released a paper on this called neural algorithmic reasoning and it's often claimed that neural networks are turing complete and you know we're told that we can think of training neural networks as being a kind of programmed search but you argued in your paper that algorithms possess fundamentally different qualities to deep learning methods francois chalet actually often points out that deep learning algorithms would struggle to represent a sorting algorithm without learning point by point you know which is to say without any generalization power whatsoever but you seem to be making the argument that the interpretative function space of neural networks can model algorithms more closely to real-world problems potentially finding more efficient and pragmatic solutions than those classically proposed by computer scientists so what's your take yeah thanks for asking that tim i think the the concept of classical algorithms as opposed to uh deep neural networks at least the way we're currently applying them makes all these points about turing completeness a little bit moot because there's quite a few proofs out there saying that you can use neural networks or more recently graph neural networks in particular to simulate a particular algorithm perfectly but all of these proofs are sort of a best case scenario they're basically saying i can set the weights of my neural network to these particular values and voila i'm imitating the algorithm perfectly right so all these best case scenarios are wonderful but in practice we don't use this kind of best case optimization we use stochastic gradient descent and we're stuck with whatever stohasti gradient descent gives us so in practice the fact that a neural network is capable for a particular setting of weights to do something doesn't mean that it will actually do that when trained from data so essentially this is the kind of the big divide that separates deep learning from traditional algorithms and it has a number of other issues as well not just the fact that we cannot find the best case solution also the fact that we are working in this high dimensional space which is not necessarily easily interpretable or composable because you have no easy way of saying for example in in theoretical computer science if you want to compose two algorithms you're working with them in a very abstract space which means that you know you can easily reason about stitching the output of one to the input of another whereas you cannot make that easy of a claim about latent spaces of two neural networks right so all these kinds of properties interpretability compositionality and obviously also out of distribution generalization are plagued not by the fact that neural networks don't have the capacity to do this but the routines we use to optimize them are not good enough to to across that divide so in your algorithmic reasoning all that we're really trying to do is to bring these two sides closer together by making changes either in the structure of the neural network or the training regime of the neural network or the kinds of data that we let the neural network see so that hopefully it's going to generalize better and extrapolate better and especially on the kinds of you know classical algorithmic problems that we might see in a computer science textbook and lastly i think i'd just like to address the point about sorting we have a paper on algorithmic reasoning benchmarks that we are about to submit that in europe's dataset track i think it should be public even now on github because that's the requirement for the conference where we have quite a few algorithmic tasks and we're trying to force gnns to learn them and we do have several sorting tasks in there and at least in distribution these graphical networks are capable of imitating the steps of say insertion sort so i will say not all is lost if you're very careful about how you tune them but obviously there is a lot of caveats and i hope that later during this chat we'll also get a chance to talk a little bit about how even though we cannot perfectly mimic algorithms we can still use this concept of algorithmic execution today now to help expand the space of applicability of albert yeah this is absolutely fascinating because this gets to the core of what i think some people point out as being the limitations of deep learning right and charles spoke about this but um i don't think that geometric deep learning would help a neural network learn a sorting function because discrete problems in general don't seem amenable to vector spaces either because the representation would be glitchy or the problem is not interpretive in nature or not learnable with stochastic gradient descent so it would be fascinating if we could overcome these problems using continuous neural networks as an algorithmic substrate do you think we could um i think that it is possible but it will require us potentially to broaden our lens on what we mean by geometric deep learning and this is something we're already very actively thinking about i think one of our co-authors taco cohen actually thought much more deeply about this in recent times but basically the idea is we looked at geometric deep learning from a group symmetry point of view which is a very nice way to describe spatial regularities and spatial symmetries but it's not necessarily the best way to talk about say invariants of generic computation which you would find in algorithms right it's like i have input that satisfies certain preconditions i want to say something about once i push it through this function it should satisfy certain post conditions this is not the kind of thing we can very easily express using the language of group theory however it is something that perhaps we could express more nicely using the language of category theory which is an area of math that i still don't know enough about i'm currently actively learning it but basically in the language of category theory groups are super simple categories that have just one node right you can do a lot more complicated things if you use this more broader abstract language and you know you talk about basically anything of interest there in terms of these commutative diagrams and taco actually recently had a really interesting paper called natural graph networks where they basically generalized the notion of permutation equivariance that you might find in graph nets to this more general concept of natural transformations so now suddenly you don't have to have a network that does exactly the same transformation in every single part of the graph what you actually need is something a bit more fine-grained you just need for all like locally isomorphic parts of the graph you need to behave the same but in principle it gives you a bit more flexibility and i think that this kind of language like moving a bit away from the group formalism would allow us to talk about say algorithmic invariants and things like this i don't yet have any theory to properly prove this but it's something that i'm actively working on and i guess i would say yeah the only question is would you still call this geometric deep learning and in my opinion the very creators of category theory have said that category theory is a direct extension of felix klein's erlangen program so since the founders of the field have already made this connection i would expect that you know it would be pretty applicable under a geometric lens so it seems to come back to it seems to come back from both of your sides to essentially graphs and and working on on sort of graphs to capture on the one side the sort of symmetries that are that you either assume in the problem or that you know or that you want to impose on the other side on the other side you know these computations can may be well represented in in graphs um what's what's so special about graphs in in your estimations is is there something fundamental to it or is it just another way you know we had ben gertzel or so here and i asked him the same question what's so special about graphs and his argument was essentially well it's not about graphs it's simply a good representation of the problem that we can make efficient operations on do you have a different opinion on that is a graph something fundamental that we should look more at than for example a tensor graphs are abstract models for systems of relations or interactions i should maybe specify pairwise relations and interactions and it happens that a lot of physical or biological or even social systems can be described at least at some level of abstraction as as a graph so that's why it is so popular modeling tool in many fields you can also obtain other structures such as grids as particular cases i wouldn't call it fundamental but it is a very convenient and a very common i would say ubiquitous model now what i personally find disturbing and we can talk about it in more detail later on is that if you look at the different geometric structures for gobble that we consider in the book whether it's euclidean spaces or many phones they all have the discrete counterparts so you have a plane and they have you can discretize it as a grid you know manifold we can discretize it as a mesh a graph is inherently discrete and this is something that i find disturbing there is in fact an entire field that is called network geometry that tries to look at graphs as continuous objects so for example certain types of graph that look like social networks what is called scale-free graphs can be represented as nearest neighbor graphs in some a little bit exotic space with hyperbolic geometry so if we take this analogy i think it is very powerful because now you can consider graphs as a discretization of something continuous and then think for example of graph neural networks as certain types of diffusion processes that are just discretized in a certain way and by making potentially possible different discretizations you will get maybe better performing architectures one of the core dichotomies we talk about on street talk is the apparent dichotomy between discrete and continuous and as yannick was saying there are folks out there who um want our knowledge substrate to be discreet but still distributed we call them sub-symbolic folks and and um this network geometry is fascinating as well because you're saying in some sense you can think of there being some unknown continuous geometry so so you're saying there is no dichotomy this is probably a little bit of visual thinking as it happens with every model so i would probably phrase it carefully for some kinds of crafts you can make this a continuous model for others maybe not fascinating well um on onto the subject of vector spaces uh versus discrete you know geometric deep learning is all about making any domain amenable to vector spaces right and indeed artificial neural networks but could these geometric principles be applied to another form of machine learning let's say discrete program synthesis certainly a very important question tim and yeah thanks for asking that i think that there are many ways in which geometric deep learning is already at least implicitly powering discrete approaches such as program synthesis because there is a pretty big movement on these so-called dual approaches where you stick a geometric deep learning architecture within a discrete tool that searches for the best solution so for example in combinatorial optimization a very popular approach recently for neurally solving mixed integer programs is to like have your typical off-the-shelf mip solver that selects variables to optimize one at a time and you know with these kinds of algorithms they're in principle exponential time but if you're lucky or knowledgeable enough about how in what order you select these variables you can actually solve the problem in linear time which is something we would like to strive towards and the exact way in which we select these variables is a bit of a black magic like humans have come up with a few heuristics but they don't always work and whenever you have this kind of setting as long as you're assuming that you're naturally occurring data isn't always throwing the worst possible cases or adversarial cases at you you can usually rely on some kind of modeling technique for example a neural network to figure out which decisions the model should be taking so in this case for example for mip solving deep minds recently published a paper on this where you can treat mip problems as bipartite graphs where you have variables on one side and the constraints on the other and you link them together if a variable appears in a constraint then they run a graph neural network which as we just discussed is one of the flagship models in geometric deep learning over this bipartite graph to decide which variable the model should select next and you can train this either as a separate kind of supervised technique to learn some kind of heuristic or you can learn it as part of a more broader reinforcement learning framework right where the reward you get is how close you are to the solution or something like this so this is one kind of clear way in which you can kind of have this synergy between geometric deep learning architectures and solutions for for example program synthesis but i would just like to offer another angle in which you can think of program synthesis as nothing other than just one more way to do language modeling right because synthesizing a program is not that different to synthesizing a sentence maybe with a more stringent check on syntax and and so forth but you know any technique that is applied to language modeling could in principle be applied for program synthesis and something that we will be discussing i believe later during this conversation one of the flagship models of geometric deep learning is indeed the transformer which we show in our book and elucidate why it can be seen as a very specific case of a graph neural network and that's one of the flagship models of language modeling so basically that's also one more way to to unify like you know just because the end output is discrete doesn't mean that you cannot reason about it using representations that are internally uh vector vector based francoise chalet pointed out that there was this dichotomy so you can embed discrete information into a continuous representation but the manifold needs to be smooth it needs to be learnable it needs to be interpolative in nature so i thought that was why we have these discrete program searches but then you have this exponential blow up but maybe that search space because it's interpretive could be found using stochastic gradient descent if you embed the discrete information into some kind of vector space but um professor bronstein i wanted to throw it back over to you i mean why is it taken as a given that vector spaces are a good thing because everything we're doing here is embedding discrete information into these euclidean vector spaces why are we doing that there are multiple reasons why vector spaces are so popular in representation learning vectors are probably the most convenient representation for both humans and computers we can do for example arithmetic operations with them like addition or subtraction we can represent them as arrays in the memory of the computer they are also continuous objects so it is very easy to use continuous optimization techniques in the back in vector spaces it is difficult for gumball to optimize a graph because it is discrete and requires combinatorial techniques but in a vector representation i just have a bunch of points that i can continuously move in a high-dimensional space using standard gradient-based techniques perhaps a more nuanced question is what kind of structures can be represented in a vector space and a typical structure is some notion of similarity or distance we want that the vector representations preserve the distances between let's say original data points and here we usually assume that the vector space is equipped with the standard euclidean metric or norm and we have a problem from the domain of magic geometry of representing one measuring space in another and unfortunately the general answer here is negative you cannot exactly embed an arbitrary metric in euclidean space but there are of course some results such as bogaine's theorem that for example provides bounds on the matching distortion in such cases and in graph learning spaces with other more exotic geometries such as hyperbolic spaces have recently become popular with for example papers of ben chamberlain my colleague from twitter or max nickel from facebook and uh you can see that in certain types of graphs the the number of neighbors grows exponentially with the radius if you look for example at the number of friends of friends of friends and so on in a social network where you have this small world phenomenon you can see that it becomes exponentially large with the growth of the radius and now when you try to embed this graph in euclidean space it will become very crowded because in the ingredient space the volume of the magic ball grows polynomially with the radius think of the two-dimensional case that we all know from school the area of a circle is by radius squared right the the volume of the ball is exponential by the dimension so we inevitably need to increase the dimension of the embedding to make space for these neighbors in the hyperbolic space the situation is very different because the volume grows exponentially with the radius so it is uh way more convenient uh you to use these spaces for graph embeddings and in fact recent papers show that to achieve the same error in embedding of a graph in a hyperbolic space with let's say 10 dimensions you would require something like 100 dimensional euclidean space of course i should say that magics are just one example of a structure so the general answer to the question whether a vector space is a good model for representing data is as usual it depends you mentioned you mentioned language pittar and maybe to both of you do you think there is a do you think there is a geometry to language itself i mean obviously we know about embedding spaces and you know close things somehow share meaning and so on do you think do you think it goes beyond that because like do you think there is an inherent geometry to language itself and and sort of the meaning of what we want to transmit and and how that relates to each other what you probably mentioned is the famous series of papers from facebook where unsupervised language translation can be done by geometric alignment of the latent spaces in my opinion it's not something that describes geometry of the language it probably describes in a geometric way some semantics of the world and even though we have linguistically speaking very different languages like let's say english and chinese yet they describe the same reality they describe the same world where humans act so it is probably reasonable to assume that the concepts that they describe are similar and also well there are some theories in linguistics about certain universal structures in languages that that are shared even though the specifics are different i think it's interesting to look maybe at non-human communications i i wouldn't probably use the the term language because it's a little bit loaded and probably some purists will be shocked by me saying that for example whales have a language but we are studying the communication of spur whales so this is a big international collaboration called project and um i don't think that you can really model the concepts that whales need to describe and to deal with in the same way as we humans do so maybe a silly example we can say in human languages and probably it applies to every language we can express a concept that something got wet i don't think that a whale would even understand what it means by being wet because the whale always lives in water i would add to that maybe a slightly different view of geometry but you know it's all about the question of how far are you willing to go and still call it geometry based on our proto book at least i tend to think of graph structures also as a form of geometry even though it's a bit more abstract and within language people might not always agree what this structure is like but i think we can be fairly certain that there are explicit links between individual words as in when you use them in different forms syntax trees or just one word precedes another and so on and so forth and you know while we may not be necessarily able to easily say what is the geometric significance of one word what we can look at is what is the local geometry of the words that you tend to use around it and i mean this kind of principle has been used all over the place that has then been extended to graph structured observations generally with models like deep walk and node to vect basically the same idea treat a node's representation as everything that's around it basically the reason why i think that analyzing this local topology of how words are used with each other is very powerful i've reinforced that recently precisely because of the fact i've been delving into category theory because in category theory your nodes are basically atoms you're not allowed to look inside them you assume they're this undivisible unit of information and everything you can conclude about the atoms comes from the arrows between them so using this very simple concept with a few additional constraints like compositionality you can for example tell me what are all the elements of a set even though you've abstracted that set to a single point just by analyzing the arrows between all sets you can tell me what are all the elements inside the set so thinking about this i i do believe that it is possible to reason about geometric you know word to vect for example does this with the assumption that the structure of the are we approaching this at the right level though because people have said for quite a long time that there's a difference between syntax and semantics and you could look at the geometrical structure of spoken language or for example you could look at the topology of the connections in your brain the topology of of connect you know reference frames in your brain is how you actually have learned concepts would looking at the topology of spoken language tell you enough about abstract categories that's a good question um i think that if that kind of information is necessary like if the atoms by themselves won't tell you everything one thing that we actually very commonly do in graph representation learning is assume this sort of hierarchical approach where you have like the ground level with your actual individual notes and then you come up with some kind of additional hierarchy that tells you either something about intermediate topologies in a graph or intermediate structures that you care about in this graph or any abstract concepts you might have extracted and then there's additional links being drawn between these to kind of reinforce the the knowledge that the graph net can capture so i think if you have knowledge of some abstract concepts that are relevant for your particular task you can attach them as additional pieces of information to this topology of course the more exciting part is could we maybe discover them automatically but that is something that i don't think is potentially in scope for this question when uh human interpreters need to translate from one one language to another they often need to deal with uh different sentence structures i think turkish is actually an extreme example where the order of words is completely reversed it implies that you need probably to hold well in computer science terms some kind of a buffer in your brain uh before you can make the translation to another language so it definitely imposes certain biological network structure in the brain another interesting observation that i read somewhere about the the way that people remember certain facts when they speak a certain language so and the particular example that was given is a person can remember a perpetrator of a crime and then give testimony in court and the reason is that in some languages it is more common to use impersonal pronouns and the personal phrases so you can say for example the object was broken and in some languages you would say that somebody broke the object so it appeared that languages were of these more impersonal constructions people speaking these languages i have her time to remember the perpetrator so the the language probably imposes a lot about the the way that we perceive world but it is probably not studied sufficiently but that there may be some fuzzy graph isomorphisms though between the languages it i think there's something really magic about graphs i think that's what we're getting into because your your lecture series inspired me actually uh professor bronstein where you were talking about all the different applications of graphs but um something that a lot of our guests talk about are knowledge graphs right and expert systems and the knowledge acquisition bottleneck were the cause of the abject failure of good old-fashioned ai or symbolic ai systems in the 1980s and many hybrid or neurosymbolic folks today are still arguing that we need to have a discrete knowledge graph um either human designed or learned or evolved or emerged or some combination of those things i just said depending on who you talk to now critically many gopher people think that most knowledge we have is acquired through reasoning not learning right which is really really interesting so by reasoning i mean extrapolating new knowledge from existing knowledge um it feels like graph neural networks could at least be part of the solution here and in your lecture series you you mentioned the work by kramner which was explainable gnns where they use some kind of symbolic regression to to get a symbolic model from a graph neural network so do you think there's some really cool work we can do here there is a little bit of divide in in graph learning literature so people working on graph neural networks and working on knowledge graphs even though at least in principle the methods are similar for example you typically do some form of embedding of the nodes of the graph somehow these are distinct communities probably historically they they evolved in different fields yeah so the paper of canvas this is really interesting because they use crafts to model physical systems for example and body problem when they have particles that interact you can describe these interactions as as a graph and you can use a standard generic message passing functions to to model the interactions now the step forward that they do is they replace these genetic message passing functions with symbolic equations and not only that this allows to generalize better but you also have an interpretable system you can recover from your data the laws of motion right and if you think of how much time it they took historically to people like johannes kepler for example he spent his entire life on analyzing astronomical observations to derive a law that now bears his name that describes the elliptic orbits of planets nowadays with these methods you can probably do it in a matter of seconds or maybe minutes i think the the point that particularly caught my attention and what you asked tim was this interplay between graphs and reasoning and extrapolation how that supports knowledge now when it comes to how critical is this going to be it depends on the environment in which you put your agent like is it a closed environment or is it an open-ended environment where new information and new knowledge can come in in principle at any time this like basically do you want to build a neural scientist or do you just want to build a neural exploiter that like takes all the information available right now and then draws conclusions based on that so if the system is closed world you'll probably be able to get away without like very explicit reasoning especially if you have tons of data because we've seen time and time again that large scale models can kind of pick up on these regularities if they've seen it often enough but if i give you a solution that involves stacking for example n objects and now i ask you to do the same kind of reasoning with two times n objects the way in which we optimize neural networks at least today is typically going to completely fall on its back when you when you do something like this so if you truly want to take whatever regularities you have come across in the the world of the training data and hope to at least reasonably gracefully apply them to new kinds of rules that that come in the future um then you probably want your model to extrapolate to a certain extent and that for this at least my ongoing algorithmic reasoning research algorithms are a very natural area to study under this lens because they trivially extrapolate you write an algorithm that does a particular thing on a set of n nodes you can be you can usually mathematically prove that it's going to do the same thing equally properly maybe a bit more slowly if you give it two times n nodes right this kind of guarantee typically doesn't come that easily with neural networks and we found that you have to very carefully massage the way you train them the kinds of data you feed to them the kinds of inductive biases you feed into them in order to get them to do something like this so if extrapolation is something you truly need then you know i think for artificial general intelligence we're going to want to have at least some degree of extrapolation as new information will become available to our neural scientists just as you follow the era of time um basically for for doing something like this graph neural networks have arisen as a very attractive primitive because there's been a few really exciting theoretical results coming out in recent years saying that the operations of a graph neural network align really really well with dynamic programming algorithms and dynamic programming is a very standard computational primitive using which you can express most polynomial time heuristics so essentially that's a really good you know that's a really good piece of mind result the unfortunate side of it is that it's a best-case result right so you can set the weights of a neural network of a graph neural network to mimic a dynamic programming algorithm more efficiently or with smaller sample complexity but you know there's still a big problem of how do i learn it in a way that it still works when i double the size of my input and that is in a way what algorithmic reasoning has been largely about like we're trying to make that happen it's not easy this if you throw the vanilla graph neural network and just input output pairs of an algorithm it will learn to fit them in distribution the moment you give like ask it to sort in a rate it's twice as big it's going to completely collapse so this is the number one thing that the neurosymbolic people say they say neural networks they don't extrapolate they only interpolate you know it just it it's it's a continuous geometric model learns point by point transforms the data onto some continuous smooth learnable manifold you interpolate between the data points you want to have a smooth you want to have a dense sampling of your data but you're talking about dynamic programming problems these are discrete problems that the structure is discontinuous like how could you possibly learn that within your network well the dynamic programming algorithm could be could have a discontinuous component to it for example if you're searching for shortest paths at some point you will take an arg max over all of your neighbors computed distances and use that to decide what the path is but before you come to the arg max part there is usually some fairly smooth function being computed actually so in the case of shortest path computations you know bellman ford or something like this you say something very simple like i have a value d of s in every single one of my nodes which is initially infinity everywhere and 0 in the source vertex and then at every point i say the distance of my particular node is the minimum of all the distances of my neighbors plus the edge weight right and this kind of function is generally more graceful than than taking an argmax and you can also think of for example if you have to compute expected values or something like this using dynamic programming that's also one example where actually summing is what you need to do across all of your neighbors or something like this so yeah it is true that like across individual steps you may be doing like discrete optimization steps but usually it's propelled by some kind of continuous computation under the hood so that's the part that the graph neural network actually simulates and then the part which does the arg max would be some kind of classifier that you stitch on top of that so in principle it's not yeah it's not too challenging to massage it into a neural network framework so one of the i think one of you mentioned this before brought up transformers and um you know in recent years we've had i think about 10 different papers saying transformers are something there is transformers or rnns transformers are hopfield networks and and and also transformers are or graph neural networks or compute some kind of graph neural networks can you maybe speak a bit to that or transformers specifically graph neural networks or are they just so general that you can also formulate a graph problem in terms of a of a transformer okay that's that's a very good question um i would start off by saying like i don't want to start this discussion just by saying yes transformers or graphic neural networks this is why end of story because i feel like you know that doesn't touch upon the whole picture so let's let's look at this from a natural language processing angle which is how most people have come to know about transformers so imagine that you have a task which is specified on a sentence and you want to exploit the fact that words in a sentence interact right it's not just a bag of words there is there's some interesting structure inside this bunch of words that you might want to exploit when we were using recurrent neural networks we assumed that the structure between the words was a line graph so basically every word is preceding another word and so on and so forth and you kind of just linearly process them with the model like lstm or something but you know basically line graphs as we know are not the way language is actually structured there can be super long range interactions inside language so subjects and objects in the same sentence could appear miles away from each other so using the line graph is not the most optimal way of getting that information in the fastest possible in the fastest possible way so okay there's clearly some kind of non-trivial graph structure what is it well it turns out that people cannot really agree what this optimal graph structure is and it may well be task dependent actually so just consider syntax trees for example like there's not always a unique way of decomposing a sentence into a syntax tree and the exact kind of tree you might wish to use to represent the sentence may be different depending on what is the actual thing that you're solving so okay we have a situation where we know that there's some connectivity between the words but we don't know what that connectivity is so in graph representation learning what we typically do when we don't know the graph as long as the number of objects is not huge is to assume a complete graph and let the graph neural network figure out by itself what the important connections are and if i now stitch an attentional message passing mechanism onto this graph neural network i have effectively rederived the transformer model equation without ever like using this specific transformer lingo so from this kind of angle the fact that it's a model that operates over a complete graph of uh individual words in a way that you know once you've put all the embeddings to them is permutation equivariant this describes the central equations of self-attention that the transformer uses the the part which i think causes a bit of a divide here is the fact that transformers like the model that was originally presented are not just the equations of a transformer they're also the positional embeddings of a transformer and that's the part that sort of gives it a bit more of a sentential structure well actually if you look at these sine and cosine waves that get attached to the individual words in an input to a transformer you will see that you can you can actually derive a pretty good connection between them and the discrete fourier transform which actually turn out to be the eigenvectors of a graph laplacian for a line graph so essentially these positional embeddings are hinting to the model that you are the descent that these words in a sentence are arranged in a particular way and you can use that information but because it's fed in as features the model doesn't have to use any of that information like sometimes bag of words is the right thing to do for example right so essentially the transformer has a bit of a light hint that there's a sentential structure in there in the form of a line graph but you know the the model itself is a permutation equivalent model over a complete graph and from our lens of the geometric deep learning it is effectively a special case of an attentional gnn now i think this positional embedding aspect is a super important one and it could hint to how we might extend these transformers from sentences to more general structures and i think michael you might have a lot more thoughts on that than i do so maybe you can say a bit about that yeah so positional encoding has been done for graphs as well as pattern mentioned in case of a graph we can straightforwardly generalize this sine or cosine positional coordinates that are used in transformers using the eigenvectors of the laplacian there are other techniques you can actually show that you can make the message passing type neural network strictly more powerful than traditional message passing the equivalent vice versa lemon graphisomorphism test by using a special kind of structure aware positional encoding for example if you can count substructures of the graph such as cycles or rings and so on and this way you have a message passing algorithm that that is specialized to the particular position in the graph and um can for example detect structures that the traditional message passing cannot detect so it is at least not less powerful than the vice very element algorithm and we can actually show examples on which vice versa algorithm or traditional message passing fails whereas this kind of approach succeeds so the thing i've always wondered about transformers networks are the position tokens i really don't like them and i want them to go away because it feels very impure doesn't it i think what we really want to learn is some kind of higher order structure in the language and it kind of felt like we were using the position tokens to cheat a little bit so what i'm trying to get across here is that i think the position of a token in an utterance should be invariant i mean clearly in different languages the tokens are in different places in turkish the order is completely reversed and i would like to think that our internal language representation ignores the transmission arrangement right given the particular language and the constraint that we only communicate sequential streams of words however i do appreciate what michael is saying above that the position encodings can actually encode more powerful structures like cycles and rings the key question is do we actually need to have these structures in natural language i don't agree that you want to get rid of them so positional encoding it's a kind of combination of two worlds right so if you consider a graph then you're completely agnostic to the ordering of the nodes this is one of the really key characteristics of graphs and sets more generally that you don't have the ordering of the nodes the situations and the problems where transformers are applied you actually do have an order but you use the graph as a better described to model different long-distance relations between different uh tokens or words in a sentence so you want to incorporate this prior knowledge that these nodes are not in arbitrary order that they have some sentence order and this principle applied more generally you can use positional encoding to tell message passing not to apply exactly the same function everywhere on the graph but to make it specialized for different portions of the graphs or at least make it a possibility and then the the training will decide whether to use this information or not or in which way it's strange because i don't know whether the order is just a function of the communication medium so we transmit the tokens in a sequence and could we then represent them in our brains in a completely different domain where the sequence is no longer relevant well actually a lot of neuroscientists think that our brain is a prediction machine and it's a sequence prediction machine so the sequence is kind of fundamentally important yeah it's and it's also a function of the specific language that you use and as we discussed before there are languages which convey the same meaning with a different sentence structure i want to get a little bit into what you said about essentially what we're doing with these positional encodings is we hint we hint to the model that there is something here which is a big break from sort of the old approach let's say of an lstm to say this is the structure right um so with with the the world of geometric deep learning i often have the feeling people talk about they talk about you know symmetries and we need to exploit these symmetries that are present in the world and there's almost to me two different groups of these symmetries so one group is maybe you would call them like exact symmetries or something like this when i think about alpha fold and i think about like a protein it doesn't i don't care which side is up right like the protein is the same the same protein and there's no reason to prefer any direction over any other direction however if i think of like because people have made this argument for cnns for example they say well a cnn is a good architecture because it's translation and variant right and essentially if we want to do object recognition or image classification translation invariance is like a given but but it's not right it's not a given the pictures that we feed to these algorithms most often the object is in the middle most often you know it's kind of upright like the sky is on top and the floor yes i can hold my camera like this but i don't right so it it seems to be it seems to be in many cases better to not put the symmetries in there until we're like really really really really sure that these are actual symmetries because with more data it seems the model that does not have the prior inside becomes better than the model that does have the prior inside if that prior doesn't exactly match the world is do you think that's a fair characterization of for example why transformers with large data all of a sudden beat classic cnn models or come close to them i think it's it's always this question of the trade-off between how much your your model and uh how much you're learning i remember when i was a student there was this maxim that that machine learning is always the second best solution and maybe nowadays with deep learning showing some remarkable set of successes i'm probably less confident in this statement but it's probably still quite true that the more you know about your problem the better chances that that machine learning will will work for it to me it makes sense uh to model as much as possible and learn what is hard or impossible to to model and in practice of course there is a spectrum of possibilities of how much of these prior assumptions are hardwired into the architecture and usually it's a trade of between the for example computational complexity availability of the data also hardware friendliness and if you think of what happened in computer vision it's probably a good illustration that that convolutional networks have translational environs for example as you mentioned but in many problems you might benefit from other symmetries such as rotations again depends on the application but imagine that you want to recognize i don't know traffic signs when you can also tilt your car and you may ask why in these applications as well cnns um are still so popular and probably one of the reasons is that they map very well to the single instruction multiple data type of hardware architectures that gpus offer and you can compensate for explicitly not accounting for rotational symmetry with data augmentation and more complex architectures and larger training sets and this is exactly what happened a decade ago with this convergence of three trends the availability of compute power right the gpus algorithms that map well to these computational architectures and these happen to be convolutional networks and also very large uh data sets that you can train these architectures on such as imagenet so many of the the choices that become popular in the literature are maybe not necessarily theoretically the best ones so i think in in hardware design there is this phenomenon that is called the hardware lottery when it's not necessarily the best algorithm and the best hardware that solves the problem it's just some lucky coincidence and their happy marriage that makes them successful we keep raising this point about how transformers can be seen in special case of attentional gnns but you know the status quo is that people will use transformers for very many tasks nowadays and you know there may well be a good argument for considering them in this completely separate light and one possible you know explanation or justification for this is the hardware lottery because yes sure transformers perform permutation equivalent operations over a complete graph but they do so in a way that is very very highly amenable to the kind of matrix multiplication routines that we can support very efficiently on gpus nowadays so basically they can be seen as the graph neural network that has won the current hardware lottery even in cases where maybe it will make more sense to consider a more constrained graph especially if you have low data environments or something like this the the potential you know overheads of running a full graph neural network solution with message passing which given its extremely sparse nature doesn't align that well with gpus and tpus nowadays sometimes just using a complete graph neural network is the more economical option when you take all factors into account and that's and also the fact that they use an attention mechanism which is kind of a middle ground between a simple diffusion process on a graph which would just kind of average things together based on the topology and the full-on message passing where you actually compute a full-on vector message to be sent across the edges like it strikes a nice balance of scalability and still being able to represent a lot of functions of interest especially when your inputs are just word tokens right so like you know in a way it's a gnn that strikes a very nice sweet spot and that's probably the reason why it's it's become so it's become so popular in current times now of course there is a chance that hardware and there's actually a pretty high probability that hardware will catch up to the trends in graph representation learning and we will start to see a bit more graph oriented hardware but at least for the time being yeah there's a bit of a combination of what's theoretically making the most sense for your problem and what the hardware that you have right now will support the most easily yeah there is a british startup i think they they have reached recently a unicorn status called graphcore and you can already hear from the name of the company that there is a graph inside they try to develop hardware that goes beyond the traditional paragraphs yeah i'm really interested in the hardware lottery we had we had sarah hooker on massive shout out to sarah and yannick made a video on the hardware lottery paper as well um i mean i'd push back a little bit i think there's also a bit of an optimization and an algorithmic lottery going on i think there's something very very interesting about stochastic gradient descent and the kind of data that we're working with but but this actually gets to my next question which is about why exactly geometry is a good prior and how principled it is so it seems like these geometric prizes are principled and they have utility because they are low-level primitives right they're ubiquitous and natural data but why exactly is it a principled approach to start with things we know which is to say geometric primitives and to work upwards from there you know what would it look like if we went top down instead and what makes a good prior i mean one way to think about it is the actual function space um that you're searching through you know this hypothesis space it's not just about being able to find the function easily in that space or the simplicity of the function you find charlay would say it's the information conversion ratio of that function so you know can you use this function that you found to convert a very small piece of information and experience space into new knowledge or a new ability but how do you find these these functions one of the points that we try to make in the book is the separation between the domain and the group that you assume on the domain the symmetry group so um you might have the same domain like two-dimensional grid or or two-dimensional plane and for example the translation group or the group of rotations and translations or the group of rigid motions that also include reflections so these are completely separate notions and which one to choose depends on the problem the choice of the domain really comes from the structure of your data so if your data comes as an image then of course you use a grid to represent it as the choice of the domain now which symmetry group to use is a more subtle point and it really depends on what you are trying to achieve you can think of for example uh traffic sign recognition when the car drives on the road usually the the the signs will have certain orientation it's very unlikely that you will see it upside down so the really the only in variance or the only kind of symmetry you have is translation so cnn's in this case would work perfectly well so if for example you have histopathological samples so you have a slice of tissue that you need to put under the microscope so you can naturally flip the glass you don't know how it is oriented so reflections are also initial transformation so in the traffic size of course this is not physical unless you see your sign in the back mirror but in this histopathology example it is a natural transformation so the choice of the symmetry group and what makes a good geometric prior is really dictated by the specific problem it's it's often very hard though to actually choose because we often don't really know right coming back to what i said before if we if we actually hit the group correctly and in fact um we've we've sort of seen the the more successful approach and this might be hardware specific but but it seems the more successful approach is often to actually make data augmentations with respect to what we assume are symmetries so to you know i i think of all the the color distortions that we do to images we rotate them a bit we we rescale them and so on it would be definitely possible to build architectures that are just invariant to those things however it seems to be a more successful approach in practice to put this all into the into the data augmentations how do you what's your take on on that how do we choose between putting prior knowledge into augmentations versus putting prior knowledge into the architecture it's not a binary choice it's not either your model or your augment with data one of the key principles that we also emphasize in the book is that of course this uh perfect environs or equivalence is a wishful thinking in many cases you you want to get the property that we call geometric stability you have some transformation that is approximately a group or imagine that you have a video where two objects are moving let's say one car moves left and another car moves right so there is no global translation that describes the relation between the two frames in this video the geometric stability principle tells you that if you are close enough to to an element of the group if you can describe this transformation as an approximate translation then you will be approximately in right or approximately equivalent and this is actually what happens in cnn so this was shown by johan and they use this motivation to explain why convolutional and neural networks are so powerful so roughly speaking if i don't have a translation but for example if i have a omnis digit and you have different styles of the digits you can think of them as warpings of some canonical digit so in this case even though it's not described as a translation and the neural network will not be environed or equivalent to this kind of transformation it will be stable under these transformations and that's why data augmentation works in in some sense that you're extending your environments or equivalence class to approximate in variance and active taco also had a few interesting things to say about data augmentations versus um building the inductive prize into the model this is taco is your preference towards i mean i assume it is towards um creating inductive priors in the architecture around geometry instead of um data augmentation oh that's a good question um i think it depends on the uh uh on the problem like ins if in some cases you don't have a choice so graphs are a great example the group of permutations is n factorial elements if n is large you have a thousand nodes you're never ever going to be able to exhaustively sample that that group and so it's better to just build it and and that's also why no graph neural net uh doesn't respect the symmetry nobody's suggesting you should do that by data augmentation with some other cases um it is uh you know somewhat possible to sample a reasonably dense grid of transformations your group um and uh indeed you know the augmentation is turning out to be very important in in uh unsupervised learning and self-supervised learning uh techniques so i am actually uh i look very positively to towards that i don't think it's uh it's wrong to to put in this knowledge using data augmentation but in some cases like let's say when you're on classifying medical data like cells in in a dish histopathology images or something you just know for sure there's translation and rotation symmetries the cells don't have a natural orientation and in those cases i do think it makes sense to build it into the architecture for for the simple reason that if you build it into the architecture you're guaranteed that the network will be equivariant not just at the training data but also at the test data so you never have that the network would make the correct classification for your test data when you have it in one orientation but when you rotate it it suddenly does something different which can happen if uh even when uh you present your training images in all possible orientations uh so so i think for that reason um equivariance does tend to work better in in those cases where there's an exact symmetry in the data we have actually demonstrated that empirically where for example in a in a medical imaging problem of uh detecting lung nodules in three-dimensional ct scans we we started off with a convolutional network uh with the data augmentation pipeline was completely tuned with state of the art method at the time and we simply replaced all the convolutions by group convolutions that respect the rotational symmetries as well as the translations and we get a very significant uh improvement in performance so that goes to show that in practice often data augmentation can't get you all the way so for the cases where there's an exact symmetry i thought or where the group of symmetries is very large i think um uh building it into the network is the way to go for the foreseeable future uh but there are many cases where uh where augmentation is also uh uh well where implementation is the way to go that's that's here i'm really interested in this notion that could these symmetries actually be harmful i mean i know zhuan for example spoke about the three sources of error of machine learning models and one of them is the approximation error and normally when we get signals they come to us in a contrived form don't know they they might uh be projected onto a planar manifold and um you know the sky is always up for example does it really help us having these geometric uh symmetries as primitives in the model yeah that's a that's a good question uh the the simple answer is if your problem doesn't have the exact symmetry then at least in the limit of having infinite data uh building in equivariance is going to be harmful and it's better to to learn the true structure of of the data this approximate symmetry which you should be able to pick up from the from the data alone um so that that's one thing that that still means in a low data regime uh it can be very useful even when the symmetries is approximate um but i would also say that sometimes in machine learning the we have a tendency to put too much faith into the evaluation metrics and data sets so we say we want to solve computer vision and what we mean is we want to get a high score on imagenet and certainly it's true in imagenet the images tend to appear in upright position and then and they are photographed by by humans so the key objects are kind of in the center most of the time etc so these are biases that you could exploit and you might stop yourself from exploiting them uh if you build in the the the symmetry but that that's only a problem if you if you sort of uh put on your blinders and you say imagenet uh accuracy is the only thing that counts right you might very well think if you want to to build a very general uh vision engine it is useful that it still works if suddenly uh the robot falls over and has to look at the world upside down uh so so the symmetry can still be there in principle even if it's not there in practice in your your data set um and you know then there's uh there's maybe a robustness versus computational efficiency trade-off so yes maybe you're willing to acknowledge if your robot falls over you still want it to work so you want that rotation equivalence but then again if we don't have to process the images upside down in the 99 of cases where the images are upright uh we gain some computational efficiency so there's a trade-off there uh and uh yeah there i don't think there's a right or wrong answer uh it's it's it's something you have to look at on a case-by-case basis when i put this to professor bronstein as well he also said that you folks were looking at trying to remember how he described it i think he said there was a kind of representational stability which allowed for approximate symmetries so it's not necessarily that you're going for these precise symmetries you're actually looking for a little um sort of margin of robustness going to moving into approximate symmetries that you might not have explicitly captured i i agree i think that's also a very important uh yeah philosophy and approach to take the group think of it as somehow embedded in a larger group like most of the geometrical symmetries that we think about are somehow a subgroup of diffiomorphisms uh and then if you if you say i don't want invariants with or equivariance but some kind of stability or smoothness to elements in the group plus small diffiomorphisms for instance uh you uh yeah you you might get some of the generalization benefit without unduly uh limiting the capacity of your of your model is there a hope though that we can we can get this approx because i see i see your point i think is that or one of one of the points is i think is that if we program like a symmetry into the architecture it will be rather fixed right we make an architecture translation invariant it's it's going to be fully translation invariant are there good ways to bring approximate invariances into the space of the architectures that we work with yeah i mean i have a very quick answer and maybe not too satisfying but one very simple one if you think of neural network blocks as like components that implement different symmetries and then you think of like a calculus of these blocks as you know building your deep learning architecture one very simple representational tool that we can use to allow the model to use the symmetry but also not use the symmetry is the skip connection so essentially you could have a model that processes for example your graph data using a particular connectivity structure that you want to be invariant to and you can also use say a transformer that processes things in a completely permutation invariant way over the complete graph and you can just shortcut the results of one model over the other model if you want to give also the model the choice to ignore the previous one so maybe not a very you know a detailed and satisfying answer but that's one simple way in which we could do something like this and in some of our more recent algorithmic reasoning blueprint papers we do exactly this because one very important thing that we're trying to solve is apply classical algorithms to problems that would need them but the data is super rich and it's really hard to you know massage it into the abstractified form that the algorithm needs for example you want to find shortest paths in a road network a real world road network you cannot just take all the complexity of changing weather conditions changing diffusion patterns on the on the roads and the roadblocks and all these kinds of things and turn that into this obstructified graph with exactly one scalar for each edge so you can apply dijkstra or something like this like it's just not feasible without losing a ton of information so what we're doing here is we make this high-dimensional neural network component that simulates the effects of dijkstra but we're also mindful of the fact that to compute say the expected travel time there's more factors at play than just the output of a shortest path algorithm right there could well also be some flow related elements maybe just some elements related to the current time of day human psychology what not right so we start off by assuming the algorithm does not give the complete picture in this high dimensional noisy world so we always as default as part of our architecture incorporate a skip connection from just you know a raw neural network encoder over the algorithm so in case there's any model free information that you want to extract without looking at what the algorithm tells you you can do that so maybe i don't know yannick if that answers your question about approximate symmetries but that's that's the kind of the vibe i got when i heard the question i mean that's a very that's very practical answer for sure that you know you can actually get out there it even opens the possibility to you know having maybe multiple kinds of skip connections and what not you know having dividing up your symmetries into individual blocks that are run in parallel maybe um but that brings me to maybe another thing we've we've talked about you know there are symmetries you want to incorporate them into your problem and so on and we've also talked about the symbolic regression beforehand to maybe parse out symmetries of the underlying problem what are the current best approaches if we don't know the symmetries so we have a bunch of data we suspect there must be some kind of symmetries at play because they're usually are in the world right and they if we knew them we could describe our problems in very compact forms and solve them very efficiently but we don't often know so what are the current state of the art when i don't know the symmetries how do i discover what group structure is at play in a particular problem i don't think that there is a single approach that solves this problem in a satisfactory manner and one of the reasons why because the problem is ambiguous so maybe an example think of objects mostly translate horizontally but you also have a little bit of vertical translation what is the right symmetry structure to model is it a one-dimensional translation group or a two-dimensional translation group do we want to absorb the vertical the slight vertical motions as the noise and deal with it as a data augmentation or you want to describe it in the structure of the group that you discover so there is no single answer so it's you cannot say that one is correct and another one is wrong yeah i think and this was kind of where i was going with the question of how how principled are the symmetries and the symmetries seem to be hierarchical just in the same way that geometries are hierarchical you were saying that for example the um projective geometry is kind of subsumes euclidean geometry but um i had a little thought experiment so imagine i gave you a large data set produced by a recursive fractal pattern now nature is full of fractals trees rivers coastlines mountains clouds seashells and even hurricanes so let's say i i didn't tell you the simple rule which produced this pattern now what kind of regularities would you look for in the model that you built i mean it seems obvious that there would be an expanding scale symmetry which might resemble the original rule but it feels like there'd be plenty of other emergent abstract symmetries which are not obviously related to the simple rule which produced the pattern i mean yanni was just saying when you look at computer vision you see a kind of um regularity or invariance to color shifts for example so are fractals are good analogy for physical reality and should we be looking for the low-level primitive regularities which i think you're advocating for or should we be looking at more abstract emergent symmetries which appear in the 90s there was a famous paper by michael barsley on fractal coding and they claimed really unbelievable compression ratios for natural images and the way it worked was to try to reassemble the image from parts of itself and possibly of course you can take parts and subject them to some geometric transformation so roughly if you have a page of pixels uh you can approximate it as another page taken from somewhere else in the image that you translate rotate and scale and then the image was represented as an operator that makes such a decomposition and this operator was constructed in a special way to be contractive and then they used the banana fixed point theorem that you can apply this operator to any image so you can start with the noise for example completely random image and you have the target image emerge after a few iterations so that that this iterative scheme will converge to the fixed point of the operator which is the image itself and it was actually used in the industry uh well microsoft and carter encyclopedia i don't know how many viewers are old enough to remember it but the main issue was it was the difficulty to build such operations the compression was very asymmetric it was very easy to decode you just take any image and apply this operator multiple times but it was really very hard to encode it in fact some of these constructions that showed remarkable compression ratios were constructed semi by hand i should say that in more recent times in computer vision for example the group of mikhail irani from the weizmann institute in israel use similar ideas for super resolution and image denoising where your you can build the clean or higher resolution image from bits and pieces of the image itself so it's a single image denoising or super resolution and what you do is you try to use similarities across different positions and scales that's absolutely fascinating i mean it it i spend a lot of time thinking about this because you know my intuition is that deep learning works quite well because of these strict structural limitations of the data which is produced by our physical world right and would you say that physical reality is highly dimensional or not um if it's highly dimensional is it because it emerged from a simple set of rules or relations like we were just talking about because because i think what you're arguing for is that it could be collapsible in some sense probably the term dimension is a bit frivolously used here but i would say that it's probably fair to say that at some scale many physical systems can be described with a small number of degrees of freedom parameters that that captured the system and as we are talking i'm sitting in a room i'm surrounded by probably a quadrillion of gas molecules in the air that fly through the room and collide with each other and the walls of the room so at the microscopic level the dimension is very high so it's absolutely intractable if we were to model each molecule and how it collides i will have a huge number of degrees of freedom and uh yet if we zoom out we can model this system statistically and that's exactly the main idea of thermodynamics and statistical mechanics and this macroscopic system is surprisingly simple it can be described by just a few parameters such as temperature and the example fractals that you brought up before essentially show that you can create very complex patterns with very simple rules that are applied locally in a repeated way this might be a question for you peter the geometric blueprint works brilliantly in the ideal world where we can compute all of the possible group actions but graph neural networks for example you know the permutation group is factorial in size which means we need to rely on heuristics like graph convolutions so um how much better would graph neural networks be if we could compute all of the permutations i mean are you happy with these heuristics in general so that is a very good question and yes so so let's just start from stating the obvious if you want to explicitly express every possible operation that properly commutes with the graph structure and in that sense is a graph convolution you would not be able to represent that properly as a neural network operation because you have to store in principle a vector for every single element of the permutation group so unless your graph is super tiny that is just not going to work so on one hand this is a potentially annoying result on another hand it is also exciting because we know that even though we ended up like doing most of our graphic neural network research in this very restricted regime of i'm going to define a permutation invariant function over my immediate neighbors and that will as a result translate into a permutation equivalent function over the whole graph even though most of our research has happened in that particular area we know from this result that there actually exists a huge wealth of very interesting architectures beyond that and i think one of the potentially like earliest examples that have demonstrated that there exists this wealth of space is actually one of jean bruno's earlier papers on the graph fourier transform that you know analyzing from a pure signal processing angle they have shown that you can represent basically every proper graph convolution as just uh you know parameterizing its eigenvalues with respect to the eigenvectors of the graph laplacian so but the big issue that kind of limits us from going further in this direction right now is the issue that michael highlighted of geometric stability so basically a lot of these additional graph neural networks that do something more interesting than one hop spatial message passing pay the price in being very geometrically unstable so the graph fourier transform in its most generic form will have every single note in the graph be updated based on whatever is located in any other node in the graph very conditional on the graph topology so if you imagine any kind of approximate symmetry any kind of perturbation either in the node features or the edge structure of the graph this you basically don't have any protection against that like that error is going to immediately propagate everywhere and as a result you'll end up with a layer that theoretically works really well but in practice is very unstable to these kinds of numerical or inaccuracy issues one thing that's also very important to note is that often in graph neural networks we have this subtle assumption of we have the graph and we're using this graph that's given to us but who guarantees that the graph that's given to you is the correct one actually very often we estimate these graphs based on very very weird heuristics ourselves so basically all of these kinds of perfection assumptions are what might limit the applicability of these kinds of layers but that being said i find it comforting that these layers exist which means that there are meaningful ways to push our research forward to potentially discover new you know wonderful basins of geometric stability inside these different you know layers that may not just do one hop message passing so that's my take on this like it gives me it gives me faith that there's more interesting things to be discovered that being said it is pretty tricky to find stable layers in that vast landscape um yeah michael do you have some thoughts on this as well i know you've worked quite a bit on these geometric stability aspects i just wanted to add one one thought about it that essentially locality is a feature not a bug in many situations and in convolutional neural networks actually if you look uh again historically in the early architectures like alexnet they started with very large filters and the few layers were relatively few layers i think something like five or six and nowadays what you see is very small filters and hundreds of layers one of the reasons why you can do it is because of compositionality properties so you can you can create complex features from from simple primitives so in some other cases like like manifolds there are deeper geometric considerations why you must be local so related to what is called the injectivity radius of the manifold on graphs well maybe we like a little bit the theoretical necessity to be local besides of course the computational complexity but in many cases it is actually a good property because many problems do not really depend on distant interactions so if you think of social networks probably most of the information comes from your immediate neighbors is there some sort of let's assume i you know i have a graph and i have my my symmetries um my groups that i i suspect there are in the problem or i want to be invariant to is there like can you give us a bit of a practical blueprint of how would i build a network that you know takes this as an input and applies this how would you go about this you know what would be the building blocks that you choose the orders and so on is there overarching principles in how to do i don't think that there is really a general recipe so it's problem dependent but maybe one example is applications in chemistry the basic structure that we have in graph is permutation in variance this has to do with the structure of the graph itself it says nothing about the structure of the features you might also have some secondary symmetry structure in the feature space in case of molecules for example you might have a combination of features some of them are geometric so it's actually not an abstract topological graph it's a geometric graph a molecule is a graph that lives in three-dimensional space and so the features are the positional coordinates of the nodes as well as some chemical properties such as atomic numbers now when you deal with the molecule you usually don't care about how it is positioned in space you want to be equivalent to rigid transformations and therefore you need to treat accordingly the geometric coordinates of the nodes of this graph and this is actually what has been successfully done so when you do for example virtual drug screening architectures that do message passing but in a way that is equivalent to these rigid transformations actually are more successful than than generic graph neural networks also this principle was exploited in the recent version of alpha fold where they i think they call it point environment attention which is essentially a form of a latent graph neural network or a transformer architecture with equivalent message passing yeah i'd like i just like to add one more point to this to this conversation which is maybe a bit more philosophical but it relates to you know this aspect of building the overarching symmetry discovering procedures which i think would be a really fantastic thing to have in general and you know i hope that some component of a true agi is going to be figuring out like making sense of the data you're receiving and figuring out what's the right kind of um what's the right kind of symmetry to bake into it i don't necessarily have a good opinion on what this model might look like but what i do say is just looking at the immediate utility of the geometric deep learning blueprint we are like i think very strictly saying that we don't want to use this blueprint to propose you know the one true architecture rather we we make the argument that different problems require different specifications and we provide a common language that will allow say someone who works on primarily grid data to speak with someone who works on manifold data without necessarily thinking that you know you know common somebody might say convenience are the ultimate architecture someone else might say gnns are the ultimate architecture and in some ways they could both be right and they could both be wrong but this blueprint kind of just provides a clear delimiting aspect to to these things just like in the 1800s you had all these different types of geometries that basically lived on completely different kinds of geometric objects right so hyperbolic elliptic and so on and so forth and what kleinzerlangen program allowed us to do was among other things reason about all of these geometries using the same language of group invariants and symmetries right but in principle the specifics of whether you want to use a hyperbolic uh geometry or whether you want to use an elliptic geometry partly rests on your assumption what the main do you actually live in right when you when you're doing these computations so i think just generally speaking i think that having a divide is a potentially useful thing as long as you have a language that you can use to index into this divide if that makes sense it does make sense but i have a feeling that some people um could benefit from geometric deep learning and they're unawares i mean um because i don't want you guys to motivate geometric deep learning in general because i think you know a lot of deep learning with a structured prior is already geometric deep learning as you folks demonstrated in your blueprint you know like rnns and cnn for example so you know like it or not we're already all using geometric deep learning but some of the esoteric flavors of geometric deep learning particularly on irregular meshes they seem a little bit out there don't they i mean it's it's possible that many people could benefit from this but they just don't know about it yet i was thinking that for example if i had a lidar scanner on my phone and the result is a point cloud which is not particularly useful but i would presumably transform it into a mesh which would be more useful but is it possible that loads of data scientists out there are sitting on data sets that they could be thinking about geometrically but they're not manifolds are exotic it's probably in the eyes of the beholder and well in machine learning probably they are to some extent exotic but but joking apart manifolds are a convenient model for all sorts of data and uh the data might be dimensional but still have a low intrinsic dimensionality or can be explained by a small number of parameters or degrees of freedom and this is really the premise of non-linear dimensionality reduction and for example the reasons why data visualization techniques such as tsne at all work and maybe the key question as you're asking is how much of the manifold structure of the continuous manifold you actually leverage and in the tsne example the only structure that you really use is local distances so if you think of a point cloud of course you can deal with it as a set but if you assume that it comes from sampling of some continuous surface you can probably say more then this is for example what we tried to do in some of our works on geometric deep learning in applications in computer vision and graphics and measures are one way of thinking of them is as graphs and stereotypes where we have additional structure to leverage so it's not only nodes and edges but also faces and in fact measures are what is called simplicial complexes as to the practical usefulness computer vision and graphics are obviously the two fields where geometrically depending on meshes is important and just to give a recent example of a commercial success um there was a british startup called the ai it was founded by my colleague and friend yasonas kokonos i was also one of the investors and we had a collaboration on 3d hand reconstruction using geometric neural networks and the company was acquired last year by snap and this technology is already now part of snap products so you see it in the form of some 3d avatars or virtual and augmented reality applications that snap is developed yeah professor bronstein i saw that you were doing some really cool stuff with the higher order simplicial coverings and graphs and actually i was going to call out your recent work on diffusion operators and graph rewiring that there are so many cool things that we that we can do to graphs to actually enable a lot of this um analysis but there was a question from my good friend zach jost who's one of our staff members here and he says what do you think is the most important problem to solve with message passing graph neural networks and what's the most promising path forward probably we first need to agree about terminology and to me message passing and here i agree with better is just a very generic mechanism for propagating information on the graph now traditional message passing that is used in graph neural networks uses the input graph for this propagation and we know right as we discussed that it is equivalent to a device for a lemon graphisomorphism test that has limitations in the kinds of structures it can detect now there exists topological constructions that go beyond graphs such as simplicial and cell complexes that you mentioned and what we did in our recent works is developing a message passing mechanism that is able to work on such structures and of course you may ask whether we do encounter such structures in real life so first of all we we do the meshes that i mentioned before are in fact simplicial complexes but secondly what we show in the paper is that we can take a traditional graph and lift it into a cell or simplify complex and probably a good example here is from the domain of computational chemistry the graph neural network that you apply to a molecular graph would consider a molecule just as a collection of nodes and edges atoms and chemical bonds between them but this is not how chemists think of molecules they think of them as structures such as for example aromatic rings and with our approach we can regard rings as cells so we have a special new object and we can do a different form of message passing on them and we can also show from the theoretical perspective that this kind of message passing is strictly more powerful than the device for a lemon algorithm do you see entirely new problems opening up that we wouldn't even have let's say we wouldn't even have dared to touch before you know in let's say we simply have our classic neural networks or whatnot or even our classic graph message passing algorithms do you see new problems that are now in reach that previously with none of these methods were really let's say better than random guessing it's a very interesting question that i think i'll answer from from two angles because you could like there could be like some long-standing problem that you knew about and wouldn't dare to attack and now maybe you feel a bit more confident to attack it there's also the aspect of uncovering a problem because when you start thinking about things in this particular way you might realize hang on to make this work i made some assumptions and those assumptions actually don't really hold at all in principle so how do i make things you know a little bit better so i'll try to give an example for both of those so in terms of a problem that previously i don't think was very easy to attack and now we might have some tools that could help us attack it better i have a long-standing uh interest in reinforcement learning actually when i started my phd i spent six months attacking a reinforcement learning problem with one super tiny gpu and that was at that time a massive time sink actually deepmind ended up scooping my work sometime after that and i quickly moved to things that were more you know doable with the kind of hardware that i had at the time but you know i always had a big interest in this area and after joining deepmind i started to contribute to these kinds of directions more and more and i think that basically the there are a lot of problems in reinforcement learning concerning data efficiency so when you have to learn how to meaningfully act and do stuff which is actually a fairly like low dimensional signal compared to the potential richness of the trajectories that you have to go through before you get that useful signal like long-term credit assignment all these kinds of problems i feel like we can start to get more data efficient reinforcement learning architectures by leveraging geometric concepts and also algorithmic concepts so to give you one example of this we have some months ago put out a paper on the archive called the executed latent value iteration network or excelvin where we have captured the essence of an algorithm in rl which perfectly solves the rl problem so the value iteration algorithm assuming you give it a markov decision process will give you the perfect policy for that markov decision process so it's a super attractive algorithm to think about when you do rl big caveat right you need to know all the dynamics about your environment and you need to know all the reward models of the environment before you can apply the algorithm so this obviously limits its use in the more generic deep reinforcement learning setting but now with the knowledge of the underlying geometry of the graph of states that the mdp induces and the algorithmic reasoning blueprint we actually taught the graph neural network which aligns super well with value iteration actually we taught it to in a nicely extrapolating in a reasonably extrapolating way on a bunch of randomly sampled mdps learn the essence of the value iteration computation and then we stitched it into a planning algorithm in a deep reinforcement learning setting and just by like training this pipeline end-to-end with a model free loss we were able to get interesting returns in atari games much sooner than some of the competing approaches so it's a very small step it still requires you know hundred thousand two hundred thousand iterations of uh playing before you get meaning some meaningful behaviors start to come out but it's a sign that we might be able to move the needle a bit backwards and not require you know billions and billions of transitions before we start to see meaningful behavior emerge and i think that's very important because in most real world applications of reinforcement learning you don't have a budget for billions of interactions before you have to already learn a meaningful policy so that's one side i think and just generally in reinforcement learning graphs appear left right and center not just in the algorithms but also in the structure of the environment and these kinds of things so i think that's one area where geometric deep learning could really help us you know get better behaviors faster not necessarily solve it better than the standard dprl but you know get the better behaviors and fewer interactions and as for one problem that we have uncovered through this kind of observational lens you know as i said often in graph representation learning we assume innocently that the graph is given to us whereas very often this is not the case so this divide has brought about this new emerging area of latent graph learning or latent graph inference where the objective is to learn the graph simultaneously with using it for your underlying decision problem and this is a big issue for neural network optimization because you're fundamentally making a discrete decision in there and if the number of nodes is huge you cannot afford to start with the n squared approach and then gradually refine it so currently the state of the art in many regards of what we have here is to do a k nearest neighbor graph in the feature space and just hope that that gets us most of the way there and usually this works quite well for getting you know interesting answers because you know if you have a decent dish enough k n graph you will cover everything that you need reasonably quickly but you know then there that raises the issue of what if the graph itself is a meaningful output of your problem what if you're a causality researcher that wants to figure out how different you know parts of information interact to them they probably wouldn't be satisfied with a k nearest neighbor graph as an output of this system so yeah i feel like there's a lot of work to be done to actually scalably and usefully do something like this and i i don't have a better answer than than what i just said in state of the art so a potential open problem for everybody in the audience today absolutely you touched on some really interesting things that i think causality is a huge area and that we could be looking at graphs on and also we had dr tom zahavi from deepmind one of your colleagues and and he said that you know he looks at metal learning and also diversity preservation in in agent-based learning but he thinks that reinforcement learning is just about to have its image in it moment where we can discover a lot of the structure in these problems which is fascinating i would like to bring up one application where maybe a quantitative improvement that is afforded by germanic deep running can uh lead to a qualitative breakthrough and this is probably of structural biology alpha fault is one such example for correctly geometrically modeling the problem you get a breakthrough in the performance so it's indeed an imagenet moment that happened in this field and now once you have uh sufficiently accurate uh prediction of 3d structure of proteins it suddenly enables a lot of interesting applications for example in the field of drug design so potentially entire pharmaceutical pipelines we invented with the use of this technology and the impact can be extraordinary it's so true i mean and professor bronson when i was watching your lecture series it blew me away when you were talking about all of the applications i think yannick said a minute ago that um it's almost as if some of these applications are just so ambitious that the thought wouldn't even have crossed our mind that we might be able to do it before like for example being able to predict facial geometry from a dna sequence so we might be able to look at an old dna sequence and actually see what that person looked like that would have been unimaginable just a few years ago so that that's incredible i i'm really interested in your definition of intelligence right and whether you think neural networks could ever be intelligent um douglas hofstadter for example he wrote the famous book godel escherbach it was a pulsier prize winning book in the 1970s but he made the argument that analogy is the core of cognition he said that analogies are a bit like the interstate freeway of cognition they're not little modules on the side or something like that and i think that in a way analogies are also symmetries right so when i say that someone is firewalling a person it means that they don't want to talk with that person it's a symmetry between the abstract representation of a real network firewall and an abstract social category so does this require a different neural network architecture or could geometric deep learning already deliver the goods right it's almost as if it's just a representation problem i think it's a very important question one which well i i cannot claim to have the right answer to and my definition is i guess a little bit skewed by the the specific research that i do and the engineering approaches that i do but i think in large i agree with the idea of analogy making and maybe i would take it a step further right where you have a particular set of knowledge and conclusions that you've derived so far a set of primitives that you can use once new information comes in to figure out how to recompose them and either discover new analogies or just discover new conclusions that you can use in the next step of reasoning and you know it it just feels really amenable to a kind of synergy of as daniel kahneman puts it system one and system two right you have the perceptive component that feeds in the raw information that you receive as your input data transforms it into some kind of abstract conceptual information and then in the system two land you have access to this kind of reasoning procedures that are able to take all these concepts and derive new ones from hopefully a nice and not very high dimensional set of rules and this is why i believe that if we that in terms of like moving towards the an architecture that supports something like this i think we have a lot of the building blocks in place with geometric deep learning especially if we're willing to as i said kind of broaden the definition of geometric deep learning to also include category theory concepts because that might allow us to reconcile algorithmic computation as well into the blueprint so the idea is you know you have this i mean there's no need to talk at length about all these great perceptive architectures so i think we're already at a point of if we show our agi lots and lots of data it's going to be able to pick up on a lot of the interesting things just by observing like you know self-supervised learning unsupervised learning is already showing a lot of promise there but then the question is where i think we still have quite a bit of work to do is once we have these concepts let's even assume that they're perfect what do we do with them how do we meaningfully use them and i think the reason why i believe there's a lot of work to be done there is because one of the very key things that i do on a day-to-day basis is teach graph neural networks to imitate algorithms from perfect data so i give them exactly the abstract input that the algorithm would expect and i asked them hey simulate this algorithm for me please and it turns out that that is super super hard well it's super easy to do it in distribution but you're not algorithmic if you don't extrapolate and that's i think one of the big challenges that we need to work towards addressing will geometric deep learning be enough to encompass the ultimate solution i have a feeling that it will but you know i don't i don't necessarily just based on the empirical evidence we've been seeing in the recent papers that we've put out but yeah i don't i don't have a very strong theoretical reason why i think it's going to be enough yeah i'm fascinated by this notion that intelligence isn't mysterious as we might think it is it's a it's a receding horizon and it might in the end be disappointingly simple to mechanize actually if you take the term literally intelligence come from the latent world that means to understand and i think what is meant by understanding is really a very vague question and probably if you ask different people they will give you different definitions i would define it as the faculty to abstract information and in particular information that is obtained in one context the ability to use it in in other contexts this is what we usually call learning the way that this information is abstracted and represented might be very different in a biological neural network in our brain versus an artificial intelligence system so if you hear some people saying that that the brain probably doesn't really do geometric computations my answer to that would be that we don't necessarily need to imitate exactly the way that the brain works we just need probably to try to achieve this high level mechanism that is able to abstract information and applied as knowledge to different problems the definitions of artificial intelligence that are being used like the famous turing test i find is very disturbing that it's very anthropocentric and it is actually probably very characteristic of the human species more broadly and this way by judging what is intelligent and what is not we might potentially rule out other intelligent species because they they're very different from us i may be obsessed with sperm whales because i'm working on studying their communication they are definitely intelligent creatures but would they pass the turing test probably not it's like subjecting a cat to a swimming test or a fish tote or climbing on a tree so i would just like to add to michael's great answer one quote that i think is is very popular and like applies really well in this setting the question of whether computers to think can think is about as relevant as whether submarines can swim when you built submarines you weren't necessarily trying to copy fish you were solving a problem that was fundamentally slightly different so could be relevant in this case as well we we spend quite a lot of time on this podcast talking about you know whether we should have an anthropocentric conception of intelligence so you know a corporation is intelligent right and i'm starting to come around to the view of embodiment and and thinking that there is something very um human-like about our particular flavor of intelligence but maybe there is a kind of pure intelligence as well and this was the end of my conversation with taco cohen one of the really interesting things is you're getting on to um some of the work that you've done are being able to think of group convolutions on um homogeneous objects like spheres for example but also you moved on to irregular objects like any mesh and you looked into things like fiber bundles and and and local and convolution because these are objects i think you said a homogeneous object is where you can't perform some transformation to get from one place of the object to the other part of the object so what work did you do there on those irregular objects yeah that's a that's a great question so you know when we think of convolution we're sort of putting together a whole bunch of things uh namely this idea of locality so typically our filters are local but that is um that's a choice ultimately convolution doesn't have to use a local filter although in practice we know that works very well and there's the idea of weight sharing between different positions and potentially also between different orientations of the filter and um as i mentioned uh before the this this weight sharing really comes from the symmetry um so the fact that you use the same filter at each position in your image when you're doing a two-dimensional convolution that's because you want to try to to respect the translation symmetry acting on the on the plane um in the case of a general manifold or or mesh you typically not have a global symmetry uh if you think of a mesh representing a human figure or let's say it's some complicated protein structure uh there may not be a global symmetry or sometimes in the case of say a molecule might have some six-fold rotational symmetry this symmetry may not be transitive as it's called meaning you cannot take any two points on your manifold and map one to the other using a symmetry in the case of a sphere you can do that any two points in the sphere are related by rotation so we say a sphere is a homogeneous space but this let's say this uh protein uh shape is not homogeneous even if it has some some kind of symmetry um so in that case if you try to motivate the weight sharing via symmetry you try to take your filter put it in one position move it around to a different position using the symmetry you're not going to be able to hit all positions so you're not going to get weight sharing globally and you know that just is what it is if you say i have a signal on this manifold uh and i want uh uh to respect the symmetries well if there are no global symmetries there's nothing to respect so you get no constraints you can just use arbitrary linear map now it turns out there are certain other kinds of symmetries called gauge symmetries that you might still want to respect and in practice what respecting gauge symmetry will do is will put some constraints on the filter at a particular position so for example that might have to be a rotationally aquivariant filter but it doesn't tie the weights of filters at different positions um so if you want that as well then you can maybe motivate it via some kind of notion of local symmetry i have something on a local symmetry groupoid to motivate that in my in my thesis uh but uh there isn't uh there isn't uh a very principled way to to motivate weight sharing on general manifolds actually between different locations yeah i'm just trying to get my head around this so so you're saying because the whole point of that we're trying to achieve here is to have a parameter efficient neural network you know that uses local connectivity and weight sharing as we do with let's say a planar cnn whereas when you have an irregular object it's very very difficult to do that so you're saying in some restricted domain you can do it let's say if you have a let's say rotation equivalence but you can't do the other forms of of weight sharing i'm just trying to get my head around this because um with a graph convolution on your network for example it seems like you can abstract the local neighborhood this node is connected to these other nodes and potentially that could translate to a different part of the irregular mesh so why can't you do it more than than you suggested i think if you want to be precise you just have to say what are the symmetries that we're talking about here and in a graph the most obvious one is the global permutation symmetry so you can reorder the nodes of a graph and really any graph neural net whether they're taught they're coming from an equivariance perspective or not all graphical nets in existence that uh that have been proposed they they respect this permutation symmetry and typically this happens through let's say in in the most simple kind of uh graph convolutional nets like the ones by uh kit fan welding for example there's a sum operation some messages from your neighbors and some operation does depend on the order of the summons so it doesn't depend on the order of the neighbors and that's why the whole thing becomes equivalent uh so so that's a global symmetry that that all graph networks uh respect on that though could you not create a local let's say if there was a local graph isomorphism and so i have an irregular object but it has a local isomorphism could i not use something like a gcn a local version of it to capture that isomorphism yeah uh so actually this was uh something we we proposed to do in our paper natural graph networks so this paper really has two aspects to it one is the naturality as a generalization of equivariance i can talk about that uh but another key point was that we can not just develop a global natural graph network as we call it but also a local one and what the local one will do is it will look at uh certain local motifs so maybe uh if you're analyzing molecular graphs one motif that you often find is uh you know aromatic ring or something some some ring with let's say six uh corners various other kinds of little small grass motifs um and these motifs might appear multiple times in the same molecule or across different molecules and so what the this method is doing is it's essentially finding those using some kind of graph isomorph local graph isomorphism uh uh and then making sure that whenever we encounter this particular motif uh we process it in the same way uh with ie using the same weights and if the local motif has some kind of symmetry like this this aromatic ring you can rotate it six times and it gets back to the origin or you can flip it over so that's the symmetry of this graph structure or an automorphism of this graph and then the weights will also be constrained by this automorphism group this group of symmetries of the local motif uh and various other other authors also have i think even um uh michael bronstein and students have have developed methods based on on similar uh similar ideas awesome taco it's been such an honor having you on the show and actually you're coming back on the show in a few weeks time so we don't want to spoil the surprise but um looking on this proto book that you've written with the other folks what's the what's the main thing that sticks out to you as being like the coolest thing in the book i think uh there's any one particular thing uh what excites me uh is uh to put some order to the chaos uh of the the the zoo of of architectures and to see actually that there is uh something that they all have in common um and i i really think this can help uh you know new people who are new to the field to to learn more quickly uh to get an overview of of all the things that are out there um and uh i also think that uh this is the start of what at least one way in which we can take the black box of deep learning which uh often is is is viewed as completely inscrutable and actually start to open it and start to understand uh how the pieces connect uh which can then perhaps inform future developments uh that are uh guided by both empirical results and an understanding of what's going on amazing thanks so much taco thanks for having me it's been a pleasure juwan thank you so much for joining us this has been amazing okay no thank you so much tim it was a very fun and uh uh best of luck and uh i think um let's maybe get in touch vintage thank you very much it's very nice to be talking to you today about these completely random topics

Info

Channel: Machine Learning Street Talk

Views: 119,357

Rating: undefined out of 5

Keywords:

Id: bIZB1hIJ4u8

Channel Id: undefined

Length: 213min 23sec (12803 seconds)

Published: Sat Sep 18 2021