#036 - Max Welling: Quantum, Manifolds & Symmetries in ML

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] qualcomm ai research is hiring for several machine learning openings so please check out their careers website if you're excited about solving the biggest problems with cutting-edge ai research and improving the lives of billions of people [Music] today we got to speak with one of our heroes in machine learning professor max welling it was good the questions were really fantastic actually and i've never done this with the three of you but having a team of three people asking questions is really it's a good idea and of course you're really smart people knowing what you're talking about so that went really well i think needs three brains to match yours we asked max some of your favorite questions from reddit hi max when will you be changing your last name to pooling max has pioneered the discipline of non-euclidean geometric deep learning so what is actually geometric deep learning it's the idea of performing deep learning or machine learning more generally but let's say deep learning on data that is not euclidean in some sense so not a nice change structure for audio or a planar structure for images but perhaps a sphere or a graph or something more exotic like some kind of manifold with arbitrary curvature you might want to model weather patterns or social interaction data there are many types of data out there that are non-euclidean actually if you've been playing with graph neural networks then you've already been doing non-euclidean or geometric deep learning so to make this work you just need to abstract some concepts so the euclidean distance or your neighborhood it becomes a function of connectedness just like on a social graph am i connected to john does john know bob it's simple as that actually this kind of abstraction works in many areas of mathematics which max will get into today as well as making neural networks work on non-euclidean data the other thing that max has really pioneered is this idea of recognizing symmetries in different manifolds so in this blank slate paradigm that we have now in neural networks we're essentially wasting the representational capacity of the neural network because we're just learning the same thing again and again for example in a fully connected neural network we would have to learn the dog in the top right corner and the top left corner because there's no translational symmetry and it was exactly this reason why convolutional neural networks were so powerful because they introduced this concept of translational weight sharing so you had this filter that you could shine over the entire planar manifold and it meant that those parameters could be reused and you could learn concepts in different parts of the visual field it was an incredible breakthrough imputing this kind of knowledge into a deep learning model this is called an inductive prior it means that we can take some prior knowledge about how things in the world works and we can impute them into our models it makes our models more sample efficient and it makes them generalize better when it comes to sophisticated inductive priors max welling is the king when we think about ai and its capability to actually help us to enrich our lives we know we need to first help machines see and understand like humans do take this drone collecting data in 3d or this autonomous vehicle with cameras covering a 360 degrees view current deep learning technology can analyze 2d images very well but how can we teach a machine to make sense of image data from a curved object like a sphere and because we want this processing to happen on the device itself for reliability immediacy and privacy reasons how can we achieve this in a power efficient manner it turns out we can do this by applying the mathematics behind general relativity and quantum field theory to deep learning our neural network takes in data on virtually any kind of curved object and applies a new type of convolution to it we can move the shape around and the ai will still recognize it this is just one example of the exciting research we're doing at flokomaya research to shape ai in the near future anyway it turns out that these symmetries are absolutely everywhere if you wanted any further proof of how useful these kind of equivariants and symmetries and manifolds can be look no further than the recent announcement from deepmind alpha fold it will change everything deepmind solves 50 year-old grand challenge the game has changed so proteins are the structures that fold in a given way the result of this year's competition came out and they looked something like this namely every entry here you see is a team participating in that competition of protein folding prediction and there is one team which is deepmind's system alpha fall 2 which completely dominates all the others to the point where the problem is now considered to be solved by the way if this is not a great meme template i don't know what is just saying just saying they say a folded protein can be thought of as a spatial graph this here attention based okay so i'm going to guess uh for sure that they've replaced this convnet with a transformer style with an attention layer or multiple attention layers i would guess this is a big transformer right here so there was a really interesting article that came out called alpha fold and equivariance by justass dupodas and fabian fuchs i'm so sorry fabian i don't know how to pronounce your name but it does sound like a swear word justice and fabian etau comment on the announcement from deepmind and they said in short this module is a neural network that iteratively refines the structured predictions while respecting and leveraging an important symmetry of the problem namely that of roto translations at this point deepmind has not yet published a paper so we don't know exactly how they address this however from their presentations it seems possible that part of their architecture is similar to the se3 transformer what's the se3 transformer lo and behold our friend max welling has had his hands all over it so uh in the abstract it says the se3 transformer a variant of this self-attention module for 3d point clouds and graphs which is equivalent under continuous 3d roto translations equivariant is important to ensure stable and predictable performance in the presence of nuisance transformations of the data input right by the way you might be wondering what se3 is let's have a quick look at the wikipedia page we are getting into group theory which is quite an abstract concept in mathematics but the euclidean group which is se3 it talks about all of the the symmetries or the group transformations that can be applied to euclidean data to preserve certain properties namely let's say the euclidean distance between two points well these are things like translations and rotations and reflections very interesting that you can kind of abstract one level up in mathematics and that's what group theory is the other comment i want to make is that all of these folks are independently amazing i've watched presentations by most of them so that there's fabian fuchs daniel warrell volker fisher fantastic by the way when we look at fabian's about me page he's a machine learning phd student at oxford university his research topic is learning invariant representations simply put where most of deep learning is concerned with finding the important information in an input he focuses on ignoring harmful or irrelevant parts of information this can be important to counteract biases or to better leverage structure in the data structure and the data that's interesting that's quite a cool point actually because if you think about it you could naively if you're doing a vision classifier you could naively just look at all of the pixels and what they are or if you're being smart about it you go one level up and you look for the hidden structure in the data and that is precisely what he's talking about things like the symmetries that are inherent in pretty much every type of data okay so one last thing deepmind released an official powerpoint deck on alpha fold too and it talks about they're on a long term mission to advance scientific progress now here are some of the protein examples now they specifically call out inductive biases for deep learning models where this is exactly what we're talking about so clearly convolutional neural networks are one such bias which has the translational weight sharing it talks about graph networks and recurrent networks and indeed attention networks which is very much a generalization of pretty much all of the others they say that they are putting their protein knowledge into the model so physical insights are built into the network structure not just the process around it and these these biases reflected their knowledge of protein physics and geometry and you can see here that there are residues in a protein so they're modeling topologically which residues are connected to which other residues in this kind of 3d space they specifically call out here on the structure model page that they are building a 3d equivariant transformer architecture so anyway if this doesn't motivate you that symmetries and manifolds are an exciting idea in deep learning i don't know what will so clearly max has been in this game for a long time now back in 2004 with kingma he invented the variational bayes auto encoder it's only recently that well relatively recently that max has been focusing in on deep learning clearly like any other field also machine learning is subject to fashion right and so there is a five to ten year cycles where people get really excited about a certain topic either because the theory is very beautiful or it just works really well i started in the biographical models and independent component analysis was the talk of the day and the support vector machines and basically non-parametric methods and then became bayesian methods and non-parametric bayesian methods and now it's all about deep learning so what you see is that the field is subject to these sort of fashions and i think it's fine because we zoom in a new very promising tool and then we work it out and we get the most out of it max is a vice president at qualcomm so clearly he thinks that computation is going to be absolutely critical for the future of artificial intelligence but having said that he also thinks that we need to be more efficient with our hardware tomorrow than we are today that's just a reality that we all have to accept so the more compute we throw at it the bigger we make our models somehow the better they perform and we don't know precisely why that is but we do know that they will use increasingly more energy to do the computations for us and at some point that's just not a viable economic model anymore we'll see a continuation in making deep learning and machine learning more energy efficient so there's a really interesting interplay between priors experience and generalization we want to have machine learning models that generalize really well to things that they haven't seen during training if you move them into a new orientation or in a new situation in a context and that's what we think of when we say artificial general ai which means like not just something you train on one specific topic and then you ask it to do that and it does it very well but if you then move it into a new context it just completely fails that's narrow ai so humans are clearly much more flexible if we learn something in one context and then we get put into a new context that we've never seen before so we can still do very well and so we want our agents our artificial agents also to have this property max is also a huge proponent of generative models he thinks that generative models might be the future of artificial intelligence so funnily enough i think max and carl fristen that we had on a couple of episodes ago i think they would see eye to eye basically what everybody else in the scientific community does which is write down a model of the world which we call a generative model which is how do i imagine that the world that i'm seeing in my measurement apparatus could have been generated by nature we all have the matrix going on inside our heads we are running simulations of reality and we're kind of integrating over the expected value of those simulations this is just something that we do all the time that seems to be the real trick for intelligence at least in humans so our ability to generate the world max also thinks that we need to be learning causal relationships in our models causal relationships have this really interesting property that they generalize better so max comes up with this wonderful example of a certain color of car in the netherlands might be associated with a higher accident rate but that probably wouldn't generalize very well to other countries because it's just a colloquialism whereas male testosterone levels that's a causal factor and that's going to generalize far better to other countries so try to figure out what the true physics of the world is what causes what and if you have this causal structure of the world you understand much more about the actual world and then if you move it to a new context you can generalize a lot better in this new context at this stage max has bet on so many winning horses that you've got to wonder how the hell does he do it so we ask him what his secret is it's incredibly hard to predict what will become well known sometimes you just happen to be working on something that takes off like a rocket when we did things like vae or graf neural nets it didn't feel at all like this was going to be a big hit when we read some of the research from max's students we were just blown away sometimes we got to just remind ourselves that these are fairly young folks that are in their early 20s that you know they've just come out of university how is this even possible i've been very blessed with being able even with my industry funding to provide this level of freedom to the students and i think this is really key so one of the things we asked max was how does he select his research directions one of the interesting things is that he's a physicist right so many of the things that he's been doing are straight out of his operating playbook from the physics world so things like symmetries and manifolds and even quantum like symmetries have this this deep feeling right symmetries pervade basically all theories of physics and they have this profound impact on how you formulate the mathematics of a theory especially when it becomes almost mysterious right quantum mechanics is almost mysterious how on earth is quantum mechanics possible the fascinating thing here as we discussed on our gpt3 episode is that many of these roads actually lead back to computation itself how does the brain compute things also feels like a very deep question right how do we even compute things what is computation even and does the universe compute its solution what does it mean to be predictable can you predict can you compute faster than the universe can compute one of the key concepts that we talk about in the show this evening is the bias variance trade-off nothing comes for free there is no machine learning without assumptions you have to interpolate between the dots and to interpolate means that you have to make assumptions on smoothness or something like that these prior assumptions will help you transfer from one domain to another domain one of the topics we've been discussing a lot on machine learning street talk recently is this notion of how far can we take data driven approaches will they take us all the way to agi or is it just like building a tower and trying to get closer to the moon perhaps we could generate more data with data augmentation or even a simulator perhaps we could use data more efficiently with machine teaching or active learning or some kind of controller on how we train the model but ultimately how far can we really go the big question in some sense over time is can we simply take the data-driven approach and extend it all the way to agi so max tells us about all the different schools of thought in the ai community and of course one interesting school of thought is the likes of gary marcus and waleed sabha that we had on the show a few weeks ago these people think that we need to have an explicit model of the world and then there's on the other side which is more the classical ai sort of community which is no no that's going to be ridiculous you will never be able to do that you really need to imbue these models with the structure of the world in the show max tells us where he's placing his bets but we're not going to spoil the surprise so as we said before max is extremely well known for creating these inductive priors and putting them into machine learning models helping them generalize better and be more sample efficient the whole endeavor of machine learning is defining the right inductive biases and leaving whatever you don't know to the data if you put the wrong inductive bias into things will things can actually deteriorate we talk about hinton's capsule networks they tell you well we'll just keep the abstract nature of what we want which is some stack of things that transform in some way that we can vaguely specify and then we ask it to learn all these things we talk about professor kenneth stanley's greatness can't be planned book and also sarah hooker's the lottery paper the thing that both of these ideas have in common is that they posit that we are locked in by the decisions of our past and i do feel very strongly that as a field we need to open up so we we should value original ideas much more than we currently do so professor ken of stanley has a fascinating take on this he thinks that we should be treasure hunters we should find interesting and novel stepping stones that might lead us somewhere interesting he thinks we should do this in all aspects of our lives so we all want to monotonically increase our objectives and what we should be is treasure hunters yes science should be about exploration not exploitation how do we extend this to peer review in science ironically having a consensus peer review encourages group think and convergent behavior if we genuinely want to have an exploratory divergent process we should almost optimize for people disagreeing with each other in the peer review process i think the reviewing in our community is far too grumpy i'm continuously amazed when i read these old papers from let's say schmidt uber and like the first rl papers that just came up with a bit of an idea and then they had a bit of toy data and writing that's a paper and and it's cool there's a dichotomy between on the one hand having a stamp of approval having a paper published and presenting about it and on the other hand having a continuous stream of research which is peer-reviewed online and with some accountability yeah we i think we really need to disrupt the field a little bit quantum machine learning is a bit of a mystery to most people i feel including myself and even though i learned something in this conversation paradoxically it's more of a mystery than before the conversation crucially max thinks that quantum computing will hugely impact the machine learning world in the future so you can think of quantum mechanics as another theory of statistics in some sense right essentially quantum neural networks have nothing to do with particles necessarily or physics it's applying the math behind quantum mechanics to machine learning and you building neural networks as layers of functions of these quantum operations that forward propagate some signal uh as max describes really nicely in this conversation this is the counter-intuitive part which is you can have a probability for an event or an amplitude for an event and then for you have an amplitude for another event and you would think that if there's two probabilities for that event to happen then the the probability that events should grow but in quantum mechanics they can cancel and then the probability is suddenly zero that the event happens so this seems bizarre but nature has chosen this theory of statistics anyway i i really felt like an eli 5 here instead of calculating with probabilities you calculate with something like the square root of probabilities and thus events that can only stack in classical probability theory can all of a sudden cancel each other out and that gives rise to really interesting math we talk about max's recent quantum paper that just got released and so that was a paper that we recently pushed on the archive which is a quantum deformed neural networks which we basically first say okay what if we would take a normal neural net and implement it on a quantum computer and then we slightly deform it into something where states get entangled so by doing it in this particular way we could still run it efficiently on a classical computer what this paper here did was to build a particular type of neural network of quantum neural network that can under the correct assumptions be simulated efficiently on a classical computer but also once we have a quantum computer it can release its full power basically if you want to do classical predictions does it actually help to build a neural network that can run efficiently on a quantum computer that can do these predictions much better can you write down maybe even normal classical problems more conveniently in this quantum statistics i found the conversation with max to be extremely helpful here and he does a great job of explaining what's going on max has another exciting paper app probabilistic numeric convolutional neural networks it's a paper by mark finzee roberto bonderson and of course max welling and it looks at what you can do with computer vision models if you move away from the assumption of discretely sampled pixel grids and move to a continuous representation that's more like what an actual object in the real world projected on a screen behaves like the observation is when we write down a a deep learning algorithm let's say on the foreign image then we sort of treat the image as pixels and we think that's the real signal that we are looking at but you can also ask yourself what if i remove every second pixel now actually i have a very different neural network but should i have a very different neural network or what if the pixels are actually quite randomly distributed in the plane just some random places where i do measurements maybe more on the left upper corner and and fewer in the left lower corner the predictor should behave in a certain consistent way and so of course then you come to realize that really what you're doing with the pixel grid is sampling an underlying continuous signal so to get away from this assumption of this discrete even sampling they use these objects called gaussian processors to model the data and a gaussian process it's basically a universal function approximately like a neural network but it gives you a measure of uncertainty and the reasons you might want to do this are many but in short it allows you to average over every possible model that describes your data and gives you a better result in doing so you can start to do really interesting things like sub pixel sampling or work with very sparse locations but in order to do that you need to reconceptualize a lot of the familiar operators that work on our linear algebra representations such as like the the convolutional translation operation of our weights the way they got around this was super interesting so there's a very interesting tool which is called the gaussian process it basically interpolates between dots but in places where you don't have a lot of data you create uncertainty because you don't know what the real signal is what does it mean to do a convolution on this space the most interesting way to describe that is by looking at it as a partial differential equation so they reframe this transformation as a differential equation that could just be parameterized calculated out in a closed form and directly applied to the parameters of the model that means you don't need to like do any sampling or anything like that you literally just calculate this thing apply it it will be worth going into the differential equation stuff by itself but it gets very complicated very quickly needless to say it generalizes to not just translation but also things like rotations and scaling but the way that they really did this was by finding very clever representations boiled everything down to normal distributions or almost everything could just be done in closed form which things have been done with gaussian processors in the past but they're typically computationally expensive so if you can do all these updates without constant recomputation then that's a huge computational advantage the paper does some really cool things some of the benefits are now that first of all of course you can work on a unstructured set of points doesn't have to be a grid and you can even learn the positions of those points so you cannot direct the observations in places where you really need to do observations in order to improve your prediction so it turns out that all of this can be remapped back onto the quantum paradigm i must admit i'm almost gutted that i didn't study physics at university physics seems to be one of the most robust scientific disciplines and the folks are just so smart because it's really really difficult and what i notice is that it's very very difficult for external folks to get anything published in the physics world but there's an asymmetry the reverse isn't true loads of these physicists are coming into the machine learning world and they're just implementing all of these things whether it's symmetries manifold topology chaos it's really really interesting to see this unfold we also get a take from max about gpt3 and so you say gpt3 isn't very good maybe but it's a receding horizon right i had a chat with my old colleague from microsoft elia karmanov about 18 months ago he introduced me to max welling's work it absolutely fascinated me ever since and guess what elia left microsoft and he went to qualcomm hey tim how's it going ilya it's going great how are you i'm good different country different job different universe it seems but i'm doing pretty well ilya and i used to be work colleagues at microsoft uk and i left microsoft about a year ago and actually you left as well didn't you elia yeah we had a joint pack it was like you have to keep both of us or we leave indeed now um ilya and i made a youtube video just over a year ago and it was all about maxwelling's work with taka kohen uh all about symmetries and and manifolds and this work was hugely inspiring for me how did you discover it i discovered it because my colleague matthew and i whom you also interviewed and you should follow up with uh robert we were iclear and we saw tacos talk about spherical cnns which was a bit late already into his work which started with group equivalent convolutions and i think both of us just thought it was really cool it was our favorite talk for the day because it was so different and it felt like it was setting up a different stream of research it wasn't necessarily about chasing sota it was just about really improving taking what makes convolutions great and making them even better and that was awesome oh amazing well we made that video together on machine learning dojo and i must admit it was hugely inspiring for me and i reached out to max welling about two months ago and he actually came onto our podcast we interviewed him yesterday but yeah this it all came from you and you know you introduced all of this stuff to me and i've been going through some of max's work with some of his recent students and it's just incredible it is because he came from the physics world and all of this knowledge that he has around quantum and symmetries and topologies and manifolds that's his operating playbook and he's just taken it into the machine learning world and he's just been executing on it max is involved in a lot of papers as you would expect and a few of them are really fascinating yeah one of the things we spoke about was just how he nurtures his phd students because some of these papers are just incredible and presumably these students have gone from nothing to producing that level of research in a very short period of time but presumably this was one of the reasons why you decided to apply for qualcomm yeah it i i was chasing something that was publishing papers in the field of computer vision and it's one of the places in europe perhaps zurich is another location where you have this kind of research i thought it was extremely different and a super interesting research area so to speak to get into fantastic and what are you working on at the moment we have just submitted actually our paper to cvpr this morning the deadlines in a few days so that's pretty good i think and then maybe after that as well we have a few more uh topics in video basically uh self training how to improve representation learning it's a mix of knowledge distillation and still training and then also we have some interesting work with radio signals so it's like video in the sense that it's from that we extract the spatial and temporal signal but it's extremely different to video and that also makes it super fun amazing when i was discussing machine learning with ilia at microsoft we were fascinated by 3d convolution on your networks and i3d and video action detection and i know you are working on 3d segmentation and a whole bunch of cool things like that but anyway i would love to get you on the show in the next few weeks to talk about some of your research and for those of you in the comments if you want to have more from ilia let us know are you going to give us a demonstration of your front lever okay single leg when ilia comes on the show properly we're going to be doing a front lever competition that's pretty good so not only is ilia a specialist in machine learning he also absolutely smashes it in the body weight game uh no that wasn't smashing it that was after a climbing session it's actually really cool i i met this guy here who's a calisthenics instructor called solly and he just started climbing and yeah so we met up and we went climbing this morning and he was crazy good as you would expect and he gave me some tips on my front lever as well he was saying i should work more on the tucked instead of the single leg uh so hopefully you'll see much better than that in the future amazing amelia thank you so much for coming on the show we look forward to um interviewing you in in a few weeks time yeah thanks for having me and thanks a lot for interviewing max i'm like super excited to see that in a few days anyway i really hope you've enjoyed the show today this has been such a special episode for us because max welling is is literally one of my heroes so um anyway remember to like comment and subscribe we love reading your comments we really do actually we're getting so many amazing comments in the comments section so keep them coming and we will see you back next week welcome back to the machine learning street talk youtube channel and podcast with my two compadres alex stenlake and yannick culture and today we have someone who doesn't really need any introduction at all clearly one of the most impactful researchers in the ml world and has as near as makes no difference 40 000 citations he's on the executive board at eurips he's a research chair and full professor at the amlab university of amsterdam and co-director of the cuva lab and delta lab max welling max is a strong believer in the power of computation and its relevance to machine learning which is one of the reasons why he holds a vice president position at qualcomm he thinks the fastest way to make progress in artificial intelligence is to make specialized hardware for ai computation he wrote a response to rich sutton's the bitter lesson but essentially agrees with him in the sense that one should work on scalable methods that maximally leverage compute but max thinks that data is the fundamental ingredient of deep learning and you can't always generate it yourself like an alphago which amounts to an interpolation problem much of max's research portfolio is currently based on deep learning he thinks it's the biggest hammer that we've produced thus far and we witness his impact every single day he thinks that agi is a possibility and it will manifest in a forward generative and causal direction there's a really interesting cross-pollination story here max has a physics background he did a phd in physics he knows all about manifolds and topologies and symmetries and quantum and actually this has been his operating playbook he's brought all of these incredible concepts in from the physics world to machine learning now there's a fundamental blank slate paradigm in machine learning experience and data currently rule the roost but max wants to build a house on top of that blank slate max thinks that there are no predictions without assumptions no generalization without inductive bias the bias variance trade-off tells us that we need to use additional human knowledge when data is insufficient i think it's fair to say that maxwelling has pioneered many of the most sophisticated inductive priors and deep learning models developed in recent years an example of an inductor prior is the cnn which means we can model local connectivity weight sharing and equivariance to translational symmetries in gridded vision data this is imputing human domain knowledge into the architecture it makes the model significantly more robust and sample efficient assumptions are everywhere even fully connected networks assume that there is a hierarchical organization of concepts and even further assumptions about the smoothness of the underlying function we're estimating max and many of his collaborators for example taco cohen took this idea so much further they introduced rotational equivariance and then they built models which would work extremely efficiently on non-geometric curved manifolds meshes or even graphs max wants to reduce the need for data in deep learning models increasing the representational fidelity of neural networks subject to discretization and sampling errors and improving the computational techniques to process them more efficiently max has recently put out two new papers quantum deformed neural networks and probabilistic numeric convolution on your networks which we'll be talking about today anyway max it's an absolute pleasure welcome to the show thank you very much tim for a very nice introduction it almost sounded like it's not me but it was a lot do you feel that this is this it describes you not maybe accurately but do you feel like there's a parts of your work that are overly well known and there may be parts of your work that you wish would be more well known it's hard to say it's overly well known because of course it's very enjoyable and you can make a big impact but what i can say is that it's incredibly hard to predict what will become well known of course if you could predict it you would only write papers with like gazillions of citations when we did things like the vae or graph neural nets it didn't feel at all like this was going to be a big hit and some of these things are being singled out and they fly and precisely what makes this these papers fly is you know that's a big puzzle in a way and some other papers you can be very proud of and it takes so much time to actually get published it's a huge uphill battle you think why do the reviewers not understand better what we really want to do here and then yeah and so they i guess there's a lot of good work which disappears into oblivion and from many people and yeah it's mysterious but anyway your hits definitely seem to be more than your misses you're a prolific researcher yourself but you've nurtured some of the best and brightest minds across in not just deep learning but like the wider machine learning field how do you consistently do that is it fantastic mentorship or is it more finding the right spark in a student and nurturing that yeah that's a really good question and i should say that i've been extremely blessed by all these fantastic students right from the beginning but i do think there is something to nurturing talent so i think what doesn't work is to basically tell to be very constrained to a particular topic sometimes you see this happen if you write a grant proposal and then the grant proposal is about topic a and then really the student starts at topic a but figures out after a couple of months that they don't really like topic a and they want to move on to b and it's just very painful then to say no no you cannot do that you have to be doing a and so i've been very blessed with being able even with my industry funding to provide this level of freedom to the students and i think this is really key so the other thing which i find really key is that the relationship you have with the student is very important first of all it changes over the years which is also very beautiful so you start off with much more guidance and towards the end you should actually not be doing any supervision you should just having a conversation at that point on equal footing and you see about halfway through a phd like it's like a flower that opens and then now they get it suddenly right now they get it and they go and they have a wild huge interesting ideas in all directions and they can write all these papers and stuff so that's a beautiful moment when that happens and the other thing i think is that i i think of supervision as nudging in the sense that i have a big a lot of experience and where is where is the interesting stuff to be found right where is the next wave that we can get people enthusiastic about what are the important questions to address in the community and things like this so that's where my experience lies now i'm not doing a lot of coding myself in fact i'm just all doing almost zero coding which i regret so that's life and the other thing is that even in terms of math it's limited right i write but most maybe two pages of math to verify something or to compute something quickly but not like a lot of math anymore i just try to keep up with literature mostly and the students do though so they do the hard work literally so they really should they should do all that work and it's this interesting relationship where you have a discussion where you say i think you know this is an important direction an interesting direction and here are some other things which are connected to it very intuitively right so you may want to look there and then a good student will just pick up these ideas and we'll run with it and then come up with new ideas and then you could say it is maybe be careful about this direction don't don't go too deep maybe this is more an interesting direction stuff like that but even there i've learned to be very careful and if a student comes up with a good idea and intuitively i think that's actually not a great idea this is going to be a dead end that i'm not going to tell the student that very soon i'm just going to certainly leave the student about a month to explore that idea for sure and i've been surprised right i've been surprised and basically it turned out it was a great idea and i was wrong and so i've been very careful with these things too so i feel it's a very careful dance between the student and the supervisor with not too much direction also it's a very personal so some students like more direction and other students like less direction but i think it is a bit of an art i've i've learned to appreciate that this is a little bit of an art to have the right type of relationship with students yeah but of course it's all about them they are the ones that need to shine in in the end after four years and they need to get the good jobs and become famous in terms of that guidance and specifically what you said with respect to this direction might be interesting these are the interesting research directions is this something that you just have to develop or do you have some general can you give some high level patterns that you've observed throughout the years where you see recurring things and and you say ah that's another one of those probably like short-term hypes or yeah have you observed some general patterns there yeah so there's two things right so there's some things where i think why what is the big deal why is everybody chasing this particular direction so that's can you predict what the crowd will follow that's one thing seems pretty hard the other one is to find directions which maybe on longer time skills are impactful and interesting and for the second one it is deeply intuitive and it's very hard to figure out precisely what it is what features there are but for me i have to get a sense that there is some something very deep going on that i want to pursue like uh for instance so i clearly in physics so if you can think about gage symmetry like symmetries have this this deep feeling right symmetries pervade basically all theories of physics and they have this profound impact on how you formulate the mathematics of a theory and so there's something very deep about symmetries and and about you know manifolds and doing things on curved spaces and so that's i could sort of naturally drawn into this thing not now it's more quantum mechanics and there's something very deep and especially when it becomes almost mysterious right quantum mechanics is almost mysterious how on earth is quantum mechanics possible if you dive a little bit into this phenomenon of the two-slit experiment where you have these individual photons which which go over two paths and if it's a wave that's perfectly fine they can interfere with each other but now these photons can go one by one and somehow they have to be aware of this other possibility that they could have taken to interfere with that other possibility i just think that's crazy what's going on here and so i'm naturally drawn into sort of these kinds of mysteries in some sense yeah and there's plenty more and the other one is also computation clearly right how does the brain compute things also feels like a very deep question right how do we even compute things what is computation even and does the universe compute its solution what does it mean to be predictable can you predict can you compute faster than the universe can compute and so there's all these very deep questions about computation as well that you're going to ask but there's a mixture between things that are attractive in that sort of mysterious sense there's something very deep that needs to be pursued and things which are also highly practical which is sometimes it's also a lot of fun to work on something that where you can actually make a big impact for instance speed up mri imaging with a factor of 10 so now suddenly you can actually both image and radiate cancer at the same time which could have a huge impact in the future right now and feeling that level of impact is also quite exciting i think amazing so i wanted to um frame up some of the work that you've done around symmetries and manifolds it's absolutely fascinating the prevailing idea is that we are wasting the representational capacity of neural networks because we're essentially learning the same thing many times and your work absolutely pioneered this starting with sort of rotational equivariance on on cnns and then moving on to meshes and graphs and and different types of topology it's absolutely fascinating but philosophically the modus operandi in deep learning is this blank slate idea this idea that if we look at data and nothing else then we can learn everything we need to presumably not in a very sample efficient way transformers seems to be going in this direction in the natural language processing world that we just ingest infinite amounts of data and we can learn everything we need to and we spoke to a good old fashioned ai person walid sabha last week and and his argument was that the information is not in the data he was arguing that we have a kind of ontology or knowledge built into us which we can use to disambiguate information that we receive so fundamentally speaking do you believe that we can be data driven and can you introduce some of the work you've done with some of these priors in deep learning yeah so this is a very fundamental debate clearly but i think it's not all that black and white right so there is a basically at the core of machine learning there is basically trade-offs the the buy is variance trade-off for instance it clearly expresses this right the first thing i want to say there is no machine learning without assumptions it's just basically you have to interpolate between the dots and to interpolate means that you have to make assumptions on smoothness or something like that so the machine learning doesn't exist without assumptions i think that's very clear clearly it's a dial right so you can have on the one hand you can have problems with a huge amount of data it has to be available clearly and there you can dial down your inductive biases you can basically say let that the data do most of the work in some sense and let me make my prior assumptions quite minimal and with minimal i think i'm interested in a smooth mapping right the mapping needs to be smooth like that's a very minimal assumption but the disadvantage of that is if you don't put any prior assumptions is that if you need to take whatever you've learned into a new domain where this model wasn't learned it will very quickly break down because these prior assumptions will help you transfer from one domain to another domain and causality does play a big role here but we can talk about that later and then on the other hand there is basically what everybody else in the scientific community does which is write down a model of the world which we call a generative model which is how do i imagine that the world that i'm seeing in my measurement apparatus could have been generated by nature and and that's that you can put a lot of intuitive knowledge there because you could think the world is described by a pde or some kind of generated model so so people in our community often call this probabilistic programming models created by probabilistic programs or graphical models but they are highly intuitive highly interpretable and because they describe the generative process they are often also causal because you can think of these variables and one causes the other variable to happen etc and because they are causal they really generalize very well which means that if i train you know someone in one context let's say i learn to drive in the netherlands i'm driving on the right side on the road i have particular kind of traffic signs etc so now i can take whatever i've learned sort of these rules or whatever i've learned and now i can move to another country where you drive on the left-hand side of the road completely different traffic signs and i can still survive so this is typically something that the the the purely data-driven methods have a much harder time doing this sort of generalization so i think this it's basically a trade-off now it's it's the big question in some sense over time is can we simply take the data-driven approach and extend it all the way to agi but there's people on one side of the fence that are claiming that this is possible right of course we also need to amplify computation right so we're just going to build faster and faster computers that can digest more and more data and at some point we'll just have agi emerge out of this kind of process and then on the other side which is where the classical ai sort of community which is no no that's going to be ridiculous you will never be able to do that you really need to imbue these models with the structure of the world which which i take as how does physics work how does the world work can i tell you something about how data really gets generated in this work this will cut down the number of parameters to learn dramatically and it because i'm following causality i can now basically generalize and create agi in this way and so it's going to be very interesting how this is going to play out and now to be honest so i feel that i'm slightly in the camp of you really need to put generative information into your models but i've been continually surprised by what's happening on the other side of course a lot lots of my work is also on the other side in the sense that gpt3 you know is completely 100 data driven and did we expect that it would do so well no so here's another big surprise right and so that's i think that's the fun part but it kind of doesn't do well though it doesn't have any reversibility so if you ask it how many feet fit in a shoe or we did the example last week so the corner table wants another beer it doesn't know that the corner table is a person because that's missing information we would fill in those gaps but it does raise the question though of the dichotomy between memorization and compute and the guy we were speaking to last week just said that even if you had an infinite amount of memory and the the data's just not there you couldn't do it when you were responding to rich sutton and you you actually spoke about all the different schools of thought in machine learning so you said compute driven versus knowledge and model driven or data driven and symbolic or statistical and white box or black box and generative and discriminative the generative thing is fascinating because our brains it's a bit like we've got the matrix we've got a simulation going on behind the scenes haven't we we're always thinking about all these potential situations and possibly integrating between them yes i do agree that seems to be the real trick for intelligence at least in humans so our ability to generate the world at least at a symbolic level we don't generate like high resolution videos in our brain but we do generate objects and interactions between objects and sort of how things will play out and this will also help us imagine things like what would have happened if i would have done this so now i can play out this alternative world and say that was bad let me not do this now so i think that is going to be a key so that's a generative part of the modeling because you can generate you understand how the world works the physics of the world works and so you can generate possible futures to me i feel that's going to be a really important part of intelligence and i do agree that it's for me also very hard to see that you can generate enough data to cover all corner cases it's just very tough if you do it in the wrong direction which is the discriminative direction but again i have been surprised by how good these models really are and so you say gpt3 isn't very good maybe but it's a receding horizon right people may have not thought this was true or bet on something like gpt3 before it appeared and then it appeared and people were extremely impressed and then of course some people poke it and say but it doesn't understand this and this and then excitement goes away again a little bit but it is a bit of a receding horizon but i have generally be very impressed also with for instance the fact that we can now generate faces of people that that don't exist we can create billions of faces that don't exist on this planet and they look absolutely realistic would i have expected this no probably not so no and then of course there's alphago and things like this which we also wouldn't have expected right before it happens let me play a bit of devil's advocate with respect to building priors into models it it's of course like some of the easiest priors we can think of are let's say translation invariants in a cnn you can also extend this to rotational invariance and so on but if we look at a true practical problem we say yes it makes sense that there is rotational invariance in the world however on the imagenet data set like for a real practical problem the sky is usually up and the object is usually in the center it's not like to the side it's usually in the center so in a way it seems like if we actually hit the true invariance that the world adheres to it's certainly beneficial but if we even slightly deviate if we build in a different invariance it seems like there is a level of accuracy and if we want to get past that these invariants seems to be hurting do you have a sense of can it be counterproductive or when is it counterproductive to build in such invariances it's a very good question and so this goes to the point of the bias variance decomposition again so if you hit the right bias then it can be beneficial if you you know impose the wrong bias then it's going to hurt you and this is a well-known trade-off so of course the whole endeavor of machine learning is defining the right inductive biases and leaving whatever you don't know to the data and then basically learning to focus your models on the data that you're actually seeing but i agree if you put the wrong inductive bias in it things will things can actually deteriorate now i should say here that for the rotation invariance or equivalence things are not as bad as you might think so you said if you just have slightly wrong inductive bias then it hurts but that happens to be not so much the case because there's objects inside images that do have if you turn a cat upside down or a tree upside down we still recognize it as a tree in some sense and it does give you a sort of robustness to to certain transformations on these objects that you would otherwise maybe try to model by data augmentation and stuff like that now for the sky maybe you're right that's similarly in the digits a six and a nine you know you will start to confuse a six and a nine if you build in rotation equivariance right and so there it will actually hurt but has been surprisingly robust actually because basically because you also cut down on the number of parameters and by cutting down on the number of parameters you will you can actually help the system generalize better so the inductive bias doesn't have to be perfect and it can still help could we touch on the dichotomy between the work you've done and capsule networks for example as well as the sample efficiency thing for example with translational equivariance it means that you can you can move the dog and then the response map the dog has moved as well and much of that is about allowing neural networks to learn patterns more easily because they can map in every single layer so with capsule networks that's still a blank slate philosophy so you don't explicitly say what the capsules are whereas with with your approaches you explicitly define the priors with capsules it seems to be defined by the data you give it so if you train a capsule network on mnist data it might inadvertently learn that one of the capsules is how bendy the stroke width is on the seven or it might learn that there's a rotation on the car because you've given it lots of rotated versions of the same car but it seems quite arbitrary and the algorithm is hideously inefficient and what's much more exciting to me is the kind of baked in prize that you've designed in the encoder stage so could you draw the dots up between those two approaches yeah i think you actually you said it quite right so one is a much more constrained system than the other one but the actual representations that we put in our hidden layers in both cases are very similar they are stacks vectors and these vectors transform under certain operations so if i rotate the input then there is some operation on the stack of vectors which is they do rotate in the x y plane but they also permute in the sort of vector dimension and so that we tell it very explicitly how to transform we just say under these transformations you have to transform like this and we can do this because these these geometric transformations we know them that they appear in the real world but so it's also constraining because there is many other transformations that either we don't know precisely what the mathematics for the representations looks like or for instance like groups that are that are not compact maybe we have looked at scaling but it's already more of a stretch but there's of course you can do many other types of transformations that don't even have to be groups that could be other types of transformations like lighting changes or whatever if you wanted to incorporate all of these you would have to build the mathematical representation theory for each of them and then it would actually also explode in a number of feature maps that you would have to maintain and it's not a very practical approach so this works up to the transformation groups that we understand and that are everywhere around us if we want to go beyond it then basically something like capsules are very nice because they tell you well we'll just keep the abstract nature of what we want which is some stack of things that transform in some way that we can vaguely specify and then we ask it to learn all these things and we are actually ourselves also looking at these sort of more relaxed notions of equivariance where we don't tell the system precisely how to change we just want this to emerge automatically and again here the connection with the brain is very interesting in the brain we do seem to have all sorts of filters which are related by not only by rotations but all sorts of other transformations and they are topographically organized so they are right the ones that are related like a slightly rotated version is sitting right next to the other one in your brain and so presumably your brain have figured this out by just looking into the world for a long time and it's organized all these filters that way and it's known that if you if you prevent let's say a cat from seeing then it will not come up with this nice organization so you really have to get that by looking into the world a lot and that's super fascinating and i think that's where some of our research is is being directed now can we learn from how this happens in the brain is there a connection between these topographic maps and equivariance somehow and and between capsules and all of these things and i believe that it is a good strategy to to take the general ideas agri variants and then slowly relax it and let the system learn more and more uh so these capsule networks they've been a bit hyped when they were not really developed but named first by jeff hinton and he has this concept or at least had it at the beginning that it's some sort of like an inverse rendering pipeline so the sort of the capsule networks do some sort of they take in the world and they inverse render it into these capsules how much do you agree with that type of formulation it seems what you've described is more of a forward way of looking at capsules where we have these invariances yeah yeah so i don't i i very much agree with this idea that you have smaller things and they can be used in multiple ways but you have to align them in a particular way so that they build something at the higher level and obviously you can invert that idea too in order to start at something very abstract and then generate certain things this way and there is a lot of work now actually going into equivariant generative models for instance equivalent flows a prime example is for instance in physics and what's called quantum chromodynamics there's there's a theory that has a huge number of symmetries called gauge symmetries and if you transform these quarks in a way in a particular way the physics doesn't change you will have exactly the same observations right but still you need these all these symmetries to to conveniently describe this model and so now when you generate you if you want to generate quark fields or something like this right then gauge fields then you can generate all these symmetries and it's not very helpful because you generate one configuration but you then if you generate all these sort of equivalent things which are only you know different by symmetry then you haven't really done much so understanding how to generate with this equivariance in it is actually a big topic of research in many groups i think uh you know the daniel razanda and israel and deepmind has done a lot of work and there's physicists at mit and who doing work and we are with with a group of students and and physicists at amsterdam we're also looking at these types of questions so that's i guess the inverse problem where also what interference is playing an increasingly important role not many people are working on capsules i feel they've fallen out of the favor of the public because i don't know they're maybe hard to implement or they don't really work as advertised let's say do you have general thoughts about capsule networks i think with many of these things there is an underlying intuition which is correct so i haven't really worked myself in trying to implement them and so what's what you often see in this field is that there is an intuition about how something should work and often that's that is the correct intuition especially when it's coming from jeff hinton it is very likely to be the correct intuition now then there's the next step which is how do you make something practically implementable and these days that means that you have to run it super fast right you have to be able to implement it in gpus all these kinds of constraints otherwise you will be so much slower than just an ordinary cnn and you will basically not be able to train as long as a cnn and you cannot train as many parameters as an ordinary cnn and you will not beat it and if you don't have the bold numbers hard to publish and so that might impede progress in something like this but then what happens is you wait and then for five or ten years and then the computers have become faster and then people go back to these ideas and then i think oh that was actually very interesting let me try again and then suddenly things start to work now that's of course the story of deep learning more generally speaking right because we had you know neural networks like a long time ago right in the 80s it was actually quite popular to work on these things but they didn't quite take off because we didn't have the compute power and maybe also not the data to really train them well and it's only when we took them out of the the closet again and said hey man this thing actually works if you throw a whole bunch of gpus at it that's when people then became popular again and so you something that this might well happen again with with capsules or is something like capsules in and about five to ten years do you have an other than capsule networks are there things that right now we are not looking back on but that would you know be worthy of of a revisit oh yeah that's very tough but i'm personally looking at things like ica and topographic ica so i think there's an interesting body of ideas there of course i risk now to mow away the grass before my own feet but okay let me let me entertain that and then probably marker random fields and things like this will probably make a comeback at some point or graphical models more generally will probably make a comeback or maybe that integrated with deep learning and and some people have already attempted going in that direction energy based models have made a comeback already so yeah it is often going back to older ideas and there's probably a lot more that other people can sort of name i do have a prediction maybe for the future that people really haven't looked at yet in my opinion is going to be quantum things so i think many people don't actually know understand the language of quantum mechanics or mathematics and i do think that first of all that language is very interesting it's a bit different than our normal probabilities it's like square roots of probabilities and with the advent of quantum computers which at some point will come we as a community will have to dive into that and maybe make it part of our curriculum in university and then i think that will become that will start to boom could i just quickly introduce before we get to quantum i don't know if you read sarah hooker's the hardware lottery paper and this is fascinating for you of course working at qualcomm but her idea was that there are certain things that cause inertia or friction in the marketplace of ideas so is it a meritocracy of ideas or do the previous kind of hardware decisions and hardware landscape does it enslave us you know ideas succeeded they're compatible with the hardware and the software at the time and this is what she called a hardware lottery and she says the machine learning community is exceptional because the pace of innovation is so fast it's not like in the hardware world which you all know so well where it costs so much money to develop new hardware and the cost of being scooped is so high but i wanted to just come at this from i don't know how you see this right so with capsule networks the reason they're so slow is because it's a kind of sequential computing paradigm and no amount of hardware is going to solve that but there are entirely different paradigms of hardware like quantum which could potentially change the game but how much could they change the game is there still some limit on it there's always a limit clearly but i think the the situation is maybe slightly more subtle which is that working in a hardware company i can also see the other side of the coin a little bit so there is also a race in the hardware companies to build asic designs like which is specialized hardware to run the latest and the greatest machine learning algorithms which are being developed so it's not just which machine learning algorithms work well on the current hardware that us being enslaved to the hardware there is actually a feedback which where now the companies are trying to build asic to first of all run the convolutions very efficiently and soon we'll have probably transformers run very efficiently and so that's the fascinating thing of course it's hard to get your paper published perhaps if if you're ahead of the game too much which i see a little bit in the machine learning community which is if you look at the papers which are published in the physics community they work with images of four by four pixels that's what they can do because otherwise you need a quantum computer obviously to run your algorithm and it's being looked down upon a lot by the machine learners basically saying what do we you know why is that interesting and i do feel very strongly that as a field we need to open up so we we should value original ideas much more than we currently do and i don't know you know you can probably have a whole conversation on where this is coming from i think the reviewing in our community is far too grumpy i think people if it's not completely finished polished paper then you know they'll find a hole somewhere and they start pushing on it and i think you should look also at the sort of more holistic how original is this idea right can you be excited about the originality and the creativity of the idea that went into that and trust that maybe it takes the community a couple of years to further develop this and and and some things will die and that's fine but let all these flowers grow in a way and yeah so i i do feel a little bit that sometimes is a bit negative and i and and that's maybe where some of their friction is coming from that yeah just on that it's fascinating we're talking to kenneth stanley on monday and he wrote a book greatness can't be planned and his big thing is exactly what you've just said that we have this convergent behavior in so many of our systems whether it's science or academia and it's because of this objective obsession so we all want to monotonically increase our objectives and what we should be is treasure hunters yes science should be about exploration not exploitation exploitation is one step away you already know how to build the bridge we don't seem to have this paradigm at the moment even when you submit your paper to be reviewed there's a consensus mechanism isn't there because you need to have multiple accepts from people and science advances one funeral at a time yes do you think this is a huge problem yeah i think it is a big problem but i think also we will probably i think it's a big problem because it will hold us back and it will also hold very brilliant students back so what do i advise my students now i advise them to have a mixed model a mixed sort of policy which says on the one hand you work on some papers which are easy to score on things that are very popular in the community and then on the other hand you work on things which might be you know huge innovations that are much more uncertain they might fail but then there might also be really big innovations and that way you get your papers and you can become famous but at least also you work on things which are highly risky but in fact it's a bit cynical to have to do that it would be much nicer if there would be much more appreciation for just originality and but i do also believe there is a solution to this so i think we are in a sort of a local minimum as a community in this sense but i think there's a way out and one way out which is basically and this has been already proposed long time ago i think yamacon and yarsha benjamin were also talking about this and we are actually trying to implement this for a beijing deep learning workshop is to sort of to throw papers on the archive and not necessarily submit to conferences and to have a open reviewing of that and to give people reputation in dc so if you do if you give a good review you can publish your own review or stated in your cv and people can rate your review and if you do poor reviews you'll get horrible ratings and then your reputation will come down so there is some kind of way that you can probably design is that people are incentivized to give good reviews and to actually use these reviews as a half a paper that you can also be proud of and then good things will come up right they will at some point people will point to interesting ideas maybe we need some kind of recommender to make sure it's a bit unbiased in the sense that it's not only the famous people that will get their papers exposed but also less famous people so we need to have sort of maybe build a recommender around something like that every so now and then a conference comes by and it sort of harvests in this field of sort of papers let's say that one they're all reviewed they have great reviews i'll take one or two more reviews anonymously and i'll then publish and then i invite you to present your paper in our conference that to me sounds like a much more natural way to proceed i also find it very demotivating for my students who have these ideas maybe this is the worst part so you're a student you're working on this thing which is not completely mainstream and then you get rejected two or three times from a conference right this is so demotivating for a student to then continue right at this case at least you just you push this on the archive and you engage with the community around your paper and that's a much more it's much less demotivating than these constant rejections from the big prizes right the nurif's paper or the html paper that everybody wants zeitgeist the idea at the moment a lot of people are talking about this how can we improve peer review like in every field people are moving towards this open review model but collectively as a research community we don't really have the collaboration tools at this point in time to take advantage of it open review is willing to implement this actually so yeah i think it will happen it's a good system to pivot back into new ideas and sort of exciting concepts that are coming out machine learning at the moment you mentioned quantum computing is this paradigm that's really critical that's not really well understood by the machine learning community would you be able to give our listeners like the five-minute spiel about quantum probability how it differs from the probabilities that we're used to yeah so you can think of quantum mechanics as another theory of statistics in some sense right so in ai for everything we can't totally observe we write down probabilities of things happening but of course underlying there is processes but we just don't observe everything and so we describe it by probability now on quantum mechanics it's very similar to taking the square root of a negative number in some sense it's like you let me put another way so it's very much like taking the square root of a probability which can actually become negative so minus one squared is one right okay or let's say -2 squared is 4 4 is your probability and -2 could be your quantum amplitude this thing can be negative and the bizarre thing is that if you describe a system by these quantum amplitude these square roots then they can cancel which is this this is the counterintuitive part which is you can have a probability for an event or an amplitude for an event and then for you have an amplitude for another event and you would think that if there's two probabilities for that event to happen then the probability that events should grow but in quantum mechanics they can cancel and then the probability is suddenly zero that the event happens so this seems bizarre but nature has chosen this theory of statistics anyway and so it behooves us to look into this more so first question is can you write down maybe even normal classical problems more conveniently in this quantum statistics and here i always remind myself when i first learn complex numbers so when you learn to solve the damped oscillator equation you can do it in a complicated way or you can go to complex numbers and then suddenly it gets very easy to do it and so you can imagine that there is things to compute in classical statistics that are actually either shortcuts by using quantum mechanics somehow and and so the first thing that we've tried to do with quantum mechanics in deep learning is to say can we just design an architecture that would be naturally a natural fit to this quantum mechanical description of the world but we still want to be able to run it on a classical computer we just want to describe this we just want to harvest these new degrees of freedom that we have from quantum mechanics and so that was a paper that we recently pushed on the archive which is a quantum deformed neural networks which we basically first say okay what if we would take a normal neural net and implement it on a quantum computer and then we slightly deform it into something where states get entangled and this entanglement is another strange as a phenomenon in in quantum mechanics where you can create states which you cannot really create classically superpositions of states and and so by doing it in this particular way we could still run it efficiently on a classical computer but it's just a very different beast than a normal neural network so that's already to me very interesting and then of course the big prize the big bonus is that if you adhere to this way of describing what's happening is that there is the opportunity to be able to run things very efficiently on a quantum computer so now you can design your neural network in such a way that classically actually it will be very hard to simulate it but then on a quantum computer you could potentially simulate it very efficiently and and of course we don't have quantum computers so it's very hard to actually prove your point but that also what makes it somewhat exciting in that paper specifically you make you make lots of references and connections to the bayesian way of doing machine learning could you what's the connection there because it seems different both are i agree both are statistics and you already mentioned the square roots of probabilities but how do you connect the sort of uncertainty quantification in the bayesian way with how particles move quantum mechanics is not necessarily about particles so you can just you can write quantum mechanics on just like states you can just write down as a number of classical states like i say a sequence of zeros and ones and there's an exponential number of these states and then you can say classically i can only be in one of these states but in quantum mechanics i can be in any linear combination of these states which is a bigger space now what we did in that paper was to say we can treat both the world state as well as the parameter state as a quantum we describe it by a quantum wave function and then we entangle these different states which is similar to saying that i take my state classical state x i multiply it by a matrix of parameters and i get a new state out so here the analogy would be i have my quantum superposition of classical states i have a quantum superposition of parameter states and then there are some process where we get get entangled together and then i do a measurement which is now a function of both the parameters as well as the inputs and you train it to give you measurements that with high probability give you the answer that you want so that would be the training process now there is actually a very precise way in which you can relate bayesian posterior inference in quantum mechanics but that's a fairly technical story but there is a using density matrices there is a fairly precise way in which you can say i have a state described by a density matrix and if i do a measurement i condition on something and i renormalize and stuff like that so that's possible so there are two things like first of all the quantum neural network formulation can be very slow on a classic computer but fast on a quantum computer on the other hand people do run like bayesian inference on classical computers what makes the quantum neural networks that much harder to compute yeah it's this entanglement issue so in so but classically i agree there is an analogy in classical uh statistics where this looks very similar which is for instance if i have a exponentially large state space and i write down a probability distribution over all of these possible states where they have a number a positive number that sums to one for each one of these exponentially large states and if i ask you now compute an average of a function over this probability distribution you can't do it because there's an exponentially large number of things that you would have to sum and so we have ways to deal with it which is sampling from these distributions or variational approximations in any way we have to approximate this state of affairs now in quantum that's it's fairly similar so there's you you face it a similar exponential problem and you can also do approximations to get around that and but the interesting part is that in quantum mechanics you can for instance do a measurement and and a measurement is something that you know that is against a physical thing and it's not very hard to do but it will be an operation which looks like sampling something down to a particular classical state again and it does look like the sampling operation that we do in sort of artificially in probability theory but it's also true that quantum computers can in principle compute things that classical computers can't compute and they can they can actually compute it much faster whether that actually maps to the things that we are interested in is not so clear so that's it's not at all clear right now that we will actually build quantum neural networks that are generalizing a lot better on classical problems right if you want to do classical predictions does it actually help to build a a neural network that can run efficiently on a quantum computer that can do these predictions much better that's not known but that's what makes it exciting in my opinion because you can try to do it now there's also functions that you can't even do classically you have to do quantum mechanically but i don't know how relevant they are for ai fascinating can we conceivably say that at least one let's say applications or way for these neural networks or for the quantum neural networks to come in is in in the place where right now we have these renormalization problems let's say uh big word embeddings or yeah as you mentioned things like variational inference any anywhere where you have a partition function that you let's say have to sample to compute now we potentially introduce this new way of doing this yes i would say that is a different set of problems so there is some sampling algorithms which can be sped up by quantum sampling algorithms but i think the maximum speed up is like a square root so it's not insignificant but it's also not exponential okay right so you can do something in square root time of what a normal classical computer could do and then there's these very interesting stories where people thought that they could do things much faster on a quantum computer but then somebody thought really hard about it and they then invented actually a quantum inspired classical random algorithm which would do about the same speed or close the close at least so it's very uncertain precisely what we can speed up but that's what makes it interesting right it's especially if you can predict what's going to happen in some sense it's just a matter of executing right but if you don't know if they're what they're what the low-hanging fruit is and if there is low-hanging fruit and what the possible benefits and benefits are the possible bonus that you can get by doing these things then it gets really interesting in my opinion amazing now might be a good time to talk about your other paper that's just come out max which is probabilistic numeric convolution on your networks and this was also with mark finzee who we just discovered this morning just brought out a really interesting paper about equivariance on light groups so that might be a potential digression later but this works really fascinating because it's in the setting of irregularly sampled data and we use these gaussian processes to represent that and we can continuously interpolate between them in this convolutional setting absolutely fascinating could you give us the the elevator pitch yeah first let me say again that mark finchy was an intern at qualcomm and and roberto bondison was is the other person who was also working with me on the quantum stuff so those are my collaborators in this project and of course mark did the bulk of the work for this paper so he should deserve much of the credit for it but here's the observation that we had the observation is when we write down a a deep learning algorithm let's say on the foreign image then we sort of treat the image as pixels and we think that's the real signal that we are looking at but you can also ask yourself what if i remove every second pixel now i actually have a very different neural network but should i have a very different neural network or what if the pixels are actually quite randomly distributed in the plane with just some random places where i do measurements maybe more on the left upper corner and and fewer on the left lower corner the predictor should behave in a certain consistent way and so of course then you come to realize that really what you're doing with the pixel grid is sampling and underlying continuous signal so then we just started thinking how do you best deal with this so how do you how can you build this in and so there's a very interesting tool which is called the gaussian process it basically interpolates between dots but in places where you don't have a lot of data you create uncertainty because you don't know what the real signal is so you basically get some kind of interval which says okay i think the signal is somewhere in this interval with 95 cert you know certainty but i don't know precisely where now the mean function is a smooth actual continuous function and then the next step will say okay what what does it mean to do a convolution on this space this is the new gaussian process interpolated space and what we found is that the most interesting way to describe that is by looking at it as a partial differential equation and so this ties back into another really interesting line at work which was started by david duvano and authors on thinking of a neural network as od as an ordinary differential equation so here we're talking about a pde basically because we have spatial extent and so we are looking at sort of derivatives and second order derivatives in in the plane basically which which we apply on the continuous function so this is literally what people do when they solve a pde is that they have some operator which is consists of derivatives which they apply to the function and then they have a time component which evolves this thing forward in time basically and it turns out that's a very natural way to describe a convolution you can also add symmetries in a very natural way by looking at that operator that sort of moves things forward and making sure it's invariant under certain transformations we had a bit of trouble really handling the non-linearity that falls that happens then so we had to then project it back onto something that would then again easily handle by a gaussian process etc so we had to do some work there but in the end this thing was now actually very general and interesting tool which is apply a gaussian process apply pde apply non-linearity repeat and then in the end collect all your information and make a prediction and it so some of the benefits are now that first of all of course you cannot work on a unstructured set of points doesn't have to be a grid and you can even learn the positions of those points so you can now direct the observations in places where you really need to do observations in order to improve your prediction so it basically becomes a numerical integration procedure where you can learn where to move your integration points and what i also found very fascinating is that this same paradigm can be mapped on again onto a quantum paradigm where you can think of that pde that evolves now as a schrodinger equation that sort of evolves like a a wave function so it maps very nicely also again to a quantum problem and that's what we are working on now something that's really fascinating that keeps coming up again and again in these sorts of research programs is the matrix exponential like it's our connection to groups and algebras or like group representations and algebras and of course we use it to evolve our odes and pves i guess as a physicist you've probably got a deeper appreciation of this particular object but it's something that's still quite alien to a lot of people i know that work in applied machine learning what's the significance of the matrix exponential why does it connect all these really fundamental objects to things like lead groups and stuff like that yeah so it's interesting that we actually just got a paper accepted in nureps on this and it's called the convolution exponential and you can look it up and emil bomb is the main author and generator of that idea and yeah so it i guess it because it is the solution to the ode or the pde right so if you write down something that's very fundamental that is a first order differential equation which is d dt the derivative with respect to t of a state is some operator times that state then the solution of that thing will be the state over time is the matrix exponential times t times the state so that's i think where it comes from and so one other way to look at it is that in physics it's called the greens function so it's it's basically the solution to this ode so you can think of a neural net as basically we tend to describe it as a discrete you know map from one point to another point but if you think of it as a continuous process which is what we learned from the ode description of a neural net if you think of it as a continuous process then it's really you can just think of that convolution this map you can just think of it as the matrix exponential solution to this to this ode in math literacy you call this the greens function so you can think of a convolution basically as the greens function of a partial differential equation i think that's where the why this feels like a very fundamental object in some sense so in in a talk you gave recently on the future of graph neural networks you were talking about a number of ideas from physics that hadn't really made it into machine learning among them things like renormalization chaos and holography would you care to unpack these ideas a little bit and tell us where you see the future is in these ideas yeah so the reason i mention these because um i think there's a lot of really cool ideas in physics which are still remain unexplored but there is more and more physicists who are moving into the field and some of these ideas are actually you know being worked out as we speak so i recently saw about two papers on renormalization so randomization is something in in physics which basically you start with a system with a whole lot of degrees of freedom like say particles moving around or something like this and then you coarse grain the system slowly and what means is that by coarse graining you zoom out and you build an effective theory of the underlying theory in the same sense as thermodynamics is an effective theory of statistical mechanics where basically all the particles are now removed but you now have an effective sort of description of your world this is the same as what happens in neural nets right the neural nets we talk about pixels at the bottom layer and maybe edge detectors and things at the very top of it we're talking about objects and relations between objects which are aggregated emergent properties from this neural net and ideas from randomization theory might very nicely apply to this particular problem and indeed have already been applied with some success the other one which you mentioned was chaos and i think there is a very nice connection actually with chaos theory going back to work i did long time ago which i called herding in particular you can think of sampling from a particular distribution you can do it either by the typical way is first of all you can think of it as a dynamical system as a stochastic dynamical system and you think of it as there's a you're at a particular point and then you propose to go somewhere and then you accept or reject a particular point and as you jump through the space and you collect your the points that you jumped to then you look at that collection and that collection should then actually distribute according to the probability distribution that you're sampling from now that's a stochastic process but if you think very hard about that in fact it's a deterministic process even if you try to make it stochastic and the reason is that every time you know you're doing a whole bunch of calculations and so now and then you call a random number generator but the random number generator really is a pseudo-random number generator it is also a deterministic calculation that you're doing so the whole thing end-to-end is just a deterministic calculation but because you're calling the pseudo random number generated it looks very stochastic but truly it is a chaotic process and so you should really be able to describe this system by chaos theory and the theory of nonlinear dynamical systems now what i've been working on with my postdoc and roberto because it's called kahil what we've been working on is thinking about let's make it a little bit less chaotic so let's make this actually a deterministic system which is maybe at the edge of chaos and again this is one of these very deep questions that's in my head so i think so there's there is something very interesting and deep here which is if you do if you try to do a computation on the one hand you want to store information things that you've calculated and for that things need to be stable on the other hand you want to transform information because that's what a calculation is right and so there you want to be in this sort of more chaotic domain and it turns out that the best place to be is at the edge of two things right you can go to the right a little bit and be more stable and go to the left a little bit and you can transform things and compute things and so i also think that when you're trying to sample or in you know you sampling can be equated with learning if you're basing about things because in learning is basically sampling from the posterior distribution and that's same as learning you can if you can design samplers that are not completely chaotic as the ones that we described now but they're more structured and less chaotic and more deterministic moving through the space you can learn a lot faster and i find that and then you can actually start to map it on to sort of complexity theory notions if you think of this sampling from a discrete set of states what kind of properties do the sequences that i generate have what is the entropy of the sequences that i'm generating for instance or what kind of substructures is it for instance going to be periodic or are there periodic substructures inside of it or all these things and these are studied by the theory of chaos and nonlinear dynamical systems so connecting these two fields feels to me like a very fundamental thing to try and do and some people have tried a few things people have looked at well if you look at a neural net there's an iterated map you map things to hidden layers and if you think of that iterated map and think of it as is that map chaotic being on the edge of chaos is the best thing you shouldn't be completely non-movable because then everything you put in is going to be mapped to the same point very uninteresting you also shouldn't be super chaotic because of whatever you put in you're going to some random point in space and that's not very predictive so you need to be at this intersection space between chaos and non-chaos and then you can do interesting computations so this is the same idea right so to me that's exciting because now suddenly a whole field of exciting mathematics is cracked open and you can start to use all these tools in machine learning awesome thank you very fantastic now might be a good time to go over to reddit we asked reddit for questions and the top rated question is by tsa hi max when will you be changing your last name to pooling so actually there is a paper that a colleague of mine wrote and i think they had an operator which they called instead of pooling you could you could do uh a welding operator so and instead of changing my name i i propose that we just change the operators that we use and change to willing operators that's wonderful in the thread on reddit there were a few variations as well so maybe max power and someone asserted that pooling is your brother but anyway red portal says the conventional approach for analyzing continuous convolution would be fourier analysis what was the rationale behind investigating continuous convolutions using probabilistic numerics that's a good question so to me fourier analysis it it's true that you can i guess i could still do a fourier analysis right because the gaussian process you can decompose in terms of its fourier waves and then it's the primal versus the dual view of uh sort of any sort of kernel method so i could certainly go to the fourier domain and do my calculations in the fourier domain in quantum mechanics this is just another basis you just think of this as another basis you know not only quantum mechanics in any signal processing sense and it's true that a convolution is easier there because just multiplication on the other hand convolutions are very efficient in modern software packages for gpus so sometimes it's also not necessarily faster to do that but it's a good suggestion and maybe something nice happens when you go to fourier space and i just didn't explore that fantastic we've also got jimmy the ant lion says hi max i notice your co-authors come from a physics background can you explain why there are so many ex-physicists in deep learning yeah so that's interesting i think there's just a lot of physicists and a fraction of those physicists is looking for other for greener pastures and i myself on one of those that i was looking for greener pastures and they bring a really good toolbox so if you've done physics you're you have just a very good mathematical toolbox but also very good intuition about pdes and how the world works and symmetries and all these kind of things you bring and i think in some sense physics is also a bit of a container right if you do physics you can still do anything else afterwards in some sense and i think just there's just people who are naturally interested in ai of course ai became very popular at some point and so you have automatically people flock into that into that field but yeah in general they're smart people so i guess it's nice to work with them maybe just circle back and and close the loop to the beginning and we were talking about the research community and kind of the machine learning research field i i loved what you suggested and as i understand this is not fully your suggestion but the suggestion of let's say having a more open review kind of system where a review could be as powerful as a paper itself i've been screaming for this for a few years now and could i ask if you if you ever have the chance to propagate this what do you think of the idea of having a continuous research like this paper notion that we have now i think it's so outdated and once my paper is published i have no incentive to update that thing what if we do research in in this much more continuous way and then there's comments and then in response to the comments everything changes and so on yeah no it's very good point it's it's so this is indeed exactly part of this idea that we are trying open review to implement yeah but it's the idea is that in open review you have a conversation with your reviewers and it's nice if the reviewers are not anonymous they're just you just have your conversation and other people can even contribute to the project in a more open sciency way but it is also nice and now and then to present your work and so that's why i say so now and then a a conference might come in and harvest papers and just invite people to present their work in a sort of slightly more formal way and maybe put a stamp of approval on it and say this conference has published or this particular paper with some independent reviews and we think it's a great paper and so you get that stamp so it and i guess there should also be a way to close off a particular project to move on to a new project but i also have the same view as you have as this being a far more continuous process where you know if you didn't get picked this time next time somebody some conference will come by and and pick you out it's much more like a marketplace where ideas go around conferences come in and ask you to publish things and it's just you then present it and then you can just continue with your research or stop it and go to a new piece of work or something like this so yeah i i share that vision basically that's it's amazing i'm continuously amazed when i read these old papers from let's say schmidt uber and like the first rl papers that just came up with a bit of an idea and then they had a bit of toy data and right and that's a paper and and it's cool do you have any do you have any kind of thoughts about or recommendations for the new generation of researchers that are now flooding the fields of how can we get to a better field what kind of tips would you give the yeah we i think we really need to disrupt the field a little bit and so i think we i think the new i think it's particularly tough for new researchers because it's the acceptance rates for these conferences are very low and it it feels like much of your future career depends on getting papers in there and it's a fairly random process as well so i think we just need to disrupt the field and there's enough people with influence who want that so it's just a matter of actually executing on it and so that's what we do it now for the bayesian deep learning workshop that we are organizing this we want this to be a an off split from neribs it was very popular workshop there and somehow we got rejected this year and we thought okay we'll just do it ourselves we do have actually a meet-up but then next year we want to be our stand-alone conference but for that conference we want to implement this plan and so we are working with open review to actually implement this for us and jaren galla is working hard to try to actually roll this out and we talk to joshua banjo about it and he's very supportive and there's a whole lot of people who are supportive about it but so if this can help to make this a popular model then that will be a fantastic result of this interview but i think people should just push for it and just say okay i'm just fed up with the current way of doing things we should really change things and just shout out and say this is what we want and let's go for it awesome amazing professor max welling it's been an absolute honor and a pleasure to have you on the show thank you so much for joining us today it was great with the three of you asking questions that works really well fantastic thank you so much thank you amazing it was good the questions were really fantastic actually and i've never done this with the three of you but having a team of three people asking questions is really it's a good idea and of course you're really smart people in knowing what you're talking about so that went really well i think needs three brains to match yours [Laughter] anyway i really hope you've enjoyed the show today this has been such a special episode for us because max welling is is literally one one of my heroes so um anyway remember to like comment and subscribe we love reading your comments we we really do actually we're getting so many amazing comments in the comment section so keep them coming and uh we will see you back next week

Info

Channel: Machine Learning Street Talk

Views: 14,682

Rating: 4.968051 out of 5

Keywords: machine learning, deep learning, max welling, tim scarfe, yannic kilcher, qualcomm

Id: mmDw5glry9w

Channel Id: undefined

Length: 102min 32sec (6152 seconds)

Published: Sun Jan 03 2021