Miles Cranmer - The Next Great Scientific Theory is Hiding Inside a Neural Network (April 3, 2024)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so uh I'm very excited today to talk to you about uh this idea of kind of interpreting neural networks to get uh physical Insight which I view as as kind of a new really kind of a new paradigm of of doing science um so this is a this is a work with huge number of people um I can't individually mention them all but um many of them are here at the flat IR Institute so I'm going to split this up I'm going to do two parts the first one I'm going to talk about kind of how we go from a neural network to insights how we actually get insights out of a neural network the second part I'm going to talk about this polymathic AI thing um which is about basically building massive uh neural networks for science so my motivation for this line of work is uh examples like the following so there was this paper led by Kimberly stachenfeld at Deep Mind uh a few a couple years ago on learning fast subgrid models for fluid turbulence um so what you see here is the ground truth so this is kind of some some box of a fluid uh the bottom row is the the the Learned kind of subgrid model essentially for this this simulation um the really interesting thing aart about this is that this model was only trained on 16 simulations but it it actually learned to be more accurate than all traditional subgrid models at that resolution um for fluid dynamics so I think I think it's really exciting kind of to figure out how did the model do that and and kind of what can we learn about science from this from this uh neural network uh another example is so this is a work that uh I worked on with Dan too and others on predicting instability in planetary systems so this is a this is a centuries old problem you have some you know this this compact planetary system and you want to figure out when does it go un stable um there are literally I mean people have literally worked on this for centuries um it's a fundamental problem in chaos but this this neural network uh trained on I think it was maybe 20,000 simulations um it's it's not only more accurate at predicting instability but it also seems to generalize better to kind of different types of systems um so it's it's really interesting to think about okay this these neural networks they've um they've seemed to have learned something new how can we we actually use that to advance our own understanding so that's that's my motivation here so the traditional approach to science has been kind of you have some low dimensional data set or some kind of summary statistic and you build theories to describe that uh low-dimensional data um which might be kind of a summary statistic so you can look throughout the history of science so maybe Kepler's Law is an empirical fit to data and then of course Newton's law of gravitation was required to explain this and another examples like Plank's law so this was an actually an empirical fit to data um and quantum mechanics was required uh partially motivated by this to um explain it so this is this is uh kind of the the um the normal approach to building theories um and of course some of these they they've kind of I mean it's not only this it also involves you know many other things but um I I think it's really exciting to think about how we can involve interpretation of datadriven models in this process going to vary generally so that's what I'm going to talk about today uh I'm going to conjecture that in this era of AI where we have these massive neural networks that kind of seem to outperform all of our traditional the the um we might want to consider this approach where we use a neural network as essentially compression tool or some kind of uh tool that that pulls apart common patterns um in uh a data set and we build theories not to describe the data directly but really kind of to describe the neural network and what the neural network has learned um so I think this is kind of a exciting new approach to I mean really really science in general I think especially the physical sciences so the the key Point here is neural networks trained on massive amounts of data with with very flexible functions they they seem to find new things that are not in our existing Theory so I showed you the example with turbulence you know we can find better subgrid models just from data um and we can also do this with the planetary Dynamics so I think our challenge as scientists for those problems is distilling those insights into our language kind of incorporating it in our Theory I think this is this is a a really exciting way to kind of look at these these models so I'm going to break this down a bit the first thing I would like to do is just go through kind of what what machine learning is how it works um and then talk about this this uh kind of how you app apply them to different data sets Okay so just going back to the very fundamentals uh linear regression in 1D this is I would argue if you don't really have physical meaning to these parameters yet it is a kind of type of machine learning um and so this is a it's these are scalers right X and Y those are scalers 0 51 scalar parameters linear model you go One Step Beyond that and you get this shallow Network so again this has 1D input X 1D output y but now we've introduced this layer so we we have these linear models so we have three hidden neurons here and they pass through this function a so this is called an activation function and what this does is it gives the model a way of uh including some nonlinearity so these are called activation functions the the the one that most people would reach for first is the rectified linear unit or reu essentially what this does is it says if the input is less than zero drop it at zero greater than zero leave it um this is a very simple way of adding some kind of nonlinearity to my flexible curve that I'm going to fit to my data right um the next thing I do is I have these I have these different activation functions they have this this kind of joint here at different different points which depends on the parameters and I'm going to multiply the output of these activations by number so that's that's kind of the the output of my kind of a layer of the neural network um and this is going to maybe change the direction of it um change the slope of it the next thing I'm going to do is I'm going to sum these up I'm going to superimpose them and I get this is the output of one layer in my network so this is a shallow Network essentially what it is it's a piecewise linear model okay and the the joints here the parts where it kind of switches from one linear region to another those are determined by the inputs to the the first layers activations so it's it's basically a piecewise linear model okay it's a piecewise linear model um and the one cool thing about it is you can use this piecewise linear model to approximate any 1D function to arbitrary accuracy so if I want to model this function with five joints I can get an approximation like this with 10 joints like this 20 like that and I can just keep increasing the number of these neurons that gives me better and better approximations um so this is called the universal approximation theorem so it's it's that my uh shallow neural network right it just has one one kind of layer of activations I can describe any continuous function um to arbitrary Precision now that's not I mean this alone is not uh that exciting because like I can do that with pols right like I don't I don't need like the neural network is not the only thing that does that I think the exciting part about neural networks is when you start making them deeper so first let's look at what if we had two inputs what would it look like if we had two inputs now these activations they are activated along planes not not points they're activated along planes so for this is my maybe my input plane I'm basically chopping it along the the Zero part and now I have these 2D planes in space okay and the next thing I'm going to do I'm going to scale these and then I'm going to superimpose them and this gives me ways of representing kind of arbitrary functions in now a 2d space rather than just a 1D space so it gives me a way of expressing um you know arbitrary continuous functions okay now the cool part oops the cool part here is when I want to do two two layers okay so now I have two layers so I have this this is my first neural Network this is my second neural network and my first neural network looks like this okay if I consider it alone it looks like this my second um neural network it looks like this if I just like I cut this neural network out it looks like this okay when I compose them together I get this this this shared um kind of behavior where so I'm I'm composing these functions together and essentially what happens is it's almost like you fold the functions together so that I experience that function in this linear region and kind of backwards and then again so you can see there's there's kind of like that function is mirrored here right it goes goes back and forth um so you can make this analogy to folding a piece of paper so if I consider my first neural network like like this on a piece of paper I could essentially Fold It draw my second neural network the function over that that first one and then expand it and essentially now I have this this uh function so the the cool part about this is that I'm sharing I'm kind of sharing computation because I'm sharing neurons in my neural network um so this is going to come up again this is kind of a theme we're we're doing efficient computation in neural networks by sharing neurons and it's it's useful to think about it in this this this way kind of folding paper drawing curves over it and expanding it um okay so let's go back to the physics now neural networks uh right they're efficient Universal function approximators you can think of them as kind of like a type of data compression the same neurons can be used for different calculations uh in the same network um and a common use case uh in in physical sciences especially what I work on is emulating physical processes so if I have some my my simulator is kind of too expensive or I have like real world data my simulator is not good at describing it I can build a neur neural network that maybe emulates it so like I have a neural network that looks at kind of the initial conditions in this model and it predicts when it's going to go unstable so this is a this is a good use case for them um and once I have that so maybe I have this I have this trained piecewise linear model that kind of emulates some physical process now how do I take that and go to uh interpret it how do I actually get insight out of it so this is where I'm going to talk about symbolic regression so this is one of my favorite things so a lot of the interpretability work in uh industry especially like computer vision language there's not really like there's not a good modeling language like if I have a if I have a model that classifies cats and dogs there's not really like there's not a language for describing every possible cat there's not like a mathematical framework for that but in science we do have that we do have um oops we do have a very good uh mathematical framework let me see if this works uh so in science right so we have this you know in science we have this very good understanding of the universe and um we have this language for it we have mathematics which describes the universe very well uh and I think when we want to interpret these datadriven models we should use this language because that will give us results that are interpretable if I have some piece-wise linear model with different you know like millions of parameters it's not it's not really useful for me right I want to I want to express it in the language that I'm familiar with which is uh mathematics um so you can look at like any cheat sheet and it's uh it's a lot of you know simple algebra this is the language of science so symbolic regression is a machine learning task where the objective is to find analytic Expressions that optimize some objective so maybe I uh maybe I want to fit that dat set and uh what I could do is basically try different trees so these are like expression trees right so this equation is that tree and I basically find different expression trees that uh match that data so the point of symbolic regression I want to find equations that fit the data set so the symbolic and the parameters rather than just optimizing parameters in some model so the the the current way to do this the the state-of-the-art way is a genetic algorithm so it's it's kind of um it's not really like a clever algorithm it's it's uh I can say that because I work on it it's a it's it's pretty close to Brute Force essentially what you do is you treat your equation like a DNA sequence and you basically evolve it so you do like mutations you swap one operator to another maybe maybe you crossbreed them so you have like two expressions which are okay you literally breed those together I mean not literally but you conceptually breed those together get a new expression um until you fit the data set um so yeah so this is a genetic algorithm based search uh for symbolic regression now the the point of this is uh to find simple models in our language of mathematics that describe uh a given data set so um so I've spent a lot of time working on these Frameworks so piser symbolic regression. JL um they they work like this so if I have this expression I want to model that data set essentially what I'm going to do is just search over all possible Expressions uh until I find one that gets me closer to this ground truth expression so you see it's kind of testing different different branches in evolutionary space I'm going to play that again until it reaches this uh ground truth data set so this is this is pretty close to how it works uh you're essentially finding simple Expressions that fit some data set accurately okay so what I'm going to show you how to do is this symbolic regression idea is about fitting kind of finding models symbolic models that I can use to describe a data set I want to use that to build surrogate models of my neural network so this is this is kind of a way of translating my model into my language you could you could also think of it as like polom uh or like a tailor expansion in some ways the way this works is as follows if I have some neural network that I've trained on my data set whatever I'm going to train it normally freeze the parameters then what I do is I record the inputs and outputs I kind of treat it like a data generating process I I try to see like okay what's the behavior for this input this input and so on then I stick those inputs and outputs into piser for example and I I find some equation that models that neural network or maybe it's like a piece of my neural network so this is a this is building a surrogate model for my neural network that is kind of a a Pro imates the same behavior now you wouldn't just do this for like a standalone neural network this this would typically be part of like a larger model um and it would give you a way of interpreting exactly what it's doing for different inputs so what I might have is maybe I have like two two pieces like two neural networks here maybe I think the first neural network is like learning features or it's learning some kind of coordinate transform the second one is doing something in that space uh it's using those features for calculation um and so I can using symbolic regression uh which we call symbolic distillation I can I can distill this model uh into equations so that's that's the basic idea of this I replace neural networks so I replaced them with my surate model which is now an equation um you would typically do this for G as well and now I have equations that describe my model um and this is kind of a a interpretable approximation of my original neural network now the reason you wouldn't want to do this for like just directly on the data is because it's a harder search problem if you break it into pieces like kind of interpreting pieces of a neural network it's easier because you're only searching for 2 N Expressions rather than n s so it's a it's a bit easier and you're kind of using the Neal Network as a way of factoring factorizing the system into different pieces that you then interpret um so we've we've used this in in different papers so this is one uh led by Pablo Lemos on uh rediscovering Newton's law of gravity from data so this was a this was a cool paper because we didn't tell it the masses of the bodies in the solar system it had to simultaneously find the masses of every all of these 30 bodies we gave it and it also found the law um so we kind of train this neural network to do this and then we interpret that neural network and it gives us uh Newton's law of gravity um now that's a rediscovery and of course like we know that so I think the discoveries are also cool so these are not my papers these are other people's papers I thought they were really exciting so this is one a recent one by Ben Davis and jial Jinn where they discover this new uh blackhole Mass scaling relationship uh so it's uh it relates the I think it's the spirality or something in a galaxy in the velocity with the mass of a black hole um so they they found this with this technique uh which is exciting um and I saw this other cool one recently um they found this cloud cover model with this technique uh using piser um so they it kind of gets you this point where it's a it's a fairly simple model and it's also pretty accurate um but again the the point of this is to find a model that you can understand right it's not this blackbox neural network with with billions of parameters it's a it's a simple model that you can have a handle on okay so that's part one now part two I want to talk about polymathic AI so this is kind of like the complete opposite end we're going to go from small models in the first part now we're going to do the biggest possible models um and I'm going to also talk about the meaning of Simplicity what it actually means so the past few years you may have noticed there's been this shift in indust industrial machine learning to favor uh Foundation models so like chat GPT is an example of this a foundation model is a machine learning model that serves as the foundation for other models these models are trained by basically taking massive amounts of General diverse data uh and and training this flexible model on that data and then fine-tuning them to some specific task so you could think of it as maybe teaching this machine learning model English and French before teaching it to do translation between the two um so it often gives you better performance on Downstream tasks I mean you can also see that I mean Chad gbt is uh I've heard that it's trained on um GitHub and that kind of teaches it to uh reason a bit better um and so the I mean basically these models are trained on massive amounts of data um and they form this idea called a foundation model so um the general idea is you you collect you know you collect your massive amounts of data you have this very Flex ible model and then you train it on uh you might train it to do uh self supervised learning which is kind of like you mask parts of the data and then the model tries to fill it back in uh that's a that's a common way you train that so like for example GPT style models those are basically trained on the entire internet and they're trained to predict the next word that's that's their only task you get a input sequence of words you predict the next one and you just repeat that for uh massive amounts of text and then just by doing that they get really good at um General language understanding then they are fine-tuned to be a chatbot essentially so they're they're given a little bit of extra data on uh this is how you talk to someone and be friendly and so on um and and that's much better than just training a model just to do that so it's this idea of pre-training models so I mean once you have this model I I think like kind of the the the cool part about these models is they're really trained in a way that gives them General priors for data so if I have like some maybe I have like some artwork generation model it's trained on different images and it kind of generates different art I can fine-tune this model on like studio gibli artartwork and it doesn't need much training data because it already knows uh what a face looks like like it's already seen tons of different faces so just by fine tuning it on some small number of examples it can it can kind of pick up this task much quicker that's that's essentially the idea now this is I mean the same thing is true in language right like if I if I train a model on uh if I train a model just to do language translation right like I just teach it that it's kind of I start from scratch and I just train it English to French um it's going to struggle whereas if I teach it English and French kind of I I teach it about the languages first and then I specialize it on translation um it's going to do much better so this brings us to science so in um in science we also have this we also have this idea where there are shared Concepts right like different languages have shared there's shared concept of grammar in different languages in science we also have shared Concepts you could kind of draw a big circle around many areas of Science and causality is a shared concept uh if you zoom in to say dynamical systems um you could think about like multiscale Dynamics is is shared in many different disciplines uh chaos is another shared concept so maybe if we train a general model uh you know over many many different data sets the same way Chad GPT is trained on many many different languages and and text databases maybe they'll pick up general concepts and then when we finally make it specialize to our particular problem uh maybe they'll do it it'll find it easier to learn so that's essentially the idea so you can you can really actually see this for particular systems so one example is the reaction diffusion uh equation this is a type of PD um and the shallow water equations another type of PD different fields different pdes but both have waves so they they both have wav like Behavior so I mean maybe if we train this massive flexible model on both of these system it's going to kind of learn a general prior for uh what a wave looks like and then if I have like some you know some small data set I only have a couple examples of uh maybe it'll immediately identify oh that's a wave I know how to do that um it's it's almost like I mean I kind of feel like in science today what we often do is I mean we train machine learning models from scratch it's almost like we're taking uh Toddlers and we're teaching them to do pattern matching on like really Advanced problems like we we have a toddler and we're showing them this is a you know this is a spiral galaxy this is an elliptical galaxy and it it kind of has to just do pattern matching um whereas maybe a foundation model that's trained on broad classes of problems um it's it's kind of like a general uh science graduate maybe um so it has a prior for how the world works it has seen many different phenomena before and so when it when you finally give it that data set to kind of pick up it's already seen a lot of that phenomena that's that's really the of this uh that's why we think this will work well okay so we we created this collaboration last year uh so this started at flat iron Institute um led by Shirley ho to build this thing a foundation model for science so this uh this is across disciplines so we want to you know build these models to incorporate data across many different disciplines uh across institutions um and uh so we're we're currently working on kind of scaling up these models right now the final I think the final goal of this collaboration is that we would release these open-source Foundation models so that people could download them and and fine-tune them to different tasks so it's really kind of like a different Paradigm of doing machine learning right like rather than the current Paradigm where we take a model randomly initialize it it's kind of like a like a toddler doesn't know how the world Works um and we train that this Paradigm is we have this generalist science model and you start from that it's kind of a better initialization of a model that's that's the that's the pitch of polymathic okay so we have results so this year we're kind of scaling up but uh last year we had a couple papers so this is one uh led by Mike mccab called multiple physics pre-training this paper looked at what if we have this General PD simulator this this model that learns to essentially run fluid Dynamic simulations and we train it on many different PDS will it do better on new PDS or will it do worse uh so what we found is that a single so a single model is not only able to match uh you know single uh single models trained on like specific tasks it can actually outperform them in many cases so it it does seem like if you take a more flexible model you train it on more diverse data uh it will do better in a lot of cases I mean it's it's not unexpected um because we do see this with language and vision um but I I think it's still really cool to uh to see this so um I'll skip through some of these so this is like this is the ground truth data and this is the Reconstruction essentially what it's doing is it's predicting the next step all right it's predicting the next velocity the next density and pressure and so on and you're taking that prediction and running it back through the model and you get this this roll out simulation so this is a this is a task people work on in machine learning um I'm going to skip through these uh and essentially what we found is that uh most of the time by uh using this multiple physics pre-training so by training on many different PDS you do get better performance so the ones at the right side are the uh multiple physics pre-trained models those seem to do better in many cases and it's really because I mean I think because they've seen you know so many different uh PDS it's like they have a better prior for physics um skip this as well so okay this is a funny thing that we observed is that so during talks like this one thing that we get asked is how similar do the PDS need to be like do the PDS need to be you know like navor Stokes but a different parameterization or can they be like completely different physical systems so what we found is uh really uh hilarious is that okay so the bottom line here this is the air of the model uh over different number of training examples so this model was trained on a bunch of different PDS and then it was introduced to this new PD problem and it's given that amount of data okay so that does the best this model it's already it already knows some Physics that one does the best the one at the top is the worst this is the model that's trained from scratch it's never seen anything uh this is like your toddler right like it's never it doesn't know how the physical world Works um it was just randomly initialized and it has to learn physics okay the middle models those are pre-trained on General video data a lot of which is Cap videos so even pre-training this model on cap videos actually helps you do much better than this very sophis phisticated Transformer architecture that just has never seen any data and it's really because I mean we think it's because of shared concepts of spaciotemporal continuity right like videos of cats there's a you know there's there's a spaciotemporal continuity like the cat does not teleport across the video unless it's a very fast cat um there's related Concepts right so I mean that's that's what we think but it's it's really interesting that uh you know pre-training on completely unrelated systems still seems to help um and so the takeaway from this is that you should always pre-train your model uh even if the physical system is not that related you still you still see benefit of it um now obviously if you pre-train on related data that helps you more but anything is basically better than than nothing you could basically think of this as the default initialization for neural networks is garbage right like just randomly initializing a neural network that's a bad starting point it's a bad prior for physics you should always pre-train your model that's the takeaway of this okay so um I want to finish up here with kind of rhetorical questions so I started the talk about um interpretability and kind of like how do we extract insights from our model now we've we've kind of gone into this regime of these very large very flexible Foundation models that seem to learn general principles so okay my question for you you don't have to answer but just think it over is do you think 1 + 1 is simple it's not a trick question do you think 1 + 1 is simple so I think most people would say yes 1+ 1 is simple and if you break that down into why it's simple you say okay so X Plus Y is simple for like X and Y integers that's a simple relationship okay why Y is X Plus y simple and and you break that down it's because plus is simple like plus is a simple operator okay why why is plus simple it's a very abstract concept okay it's it's we we don't necessarily have plus kind of built into our brains um it's it's kind of I mean it's it's really uh so I'm going to show this this might be controversial but I think that Simplicity is based on familiar we are used to plus as a concept we are used to adding numbers as a concept therefore we call it simple you can go back another step further the reason we're familiar with addition is because it's useful adding numbers is useful for describing the world I count things right that's useful to live in our universe it's useful to count things to measure things addition is useful and it's it's it's really one of the most useful things so that is why we are familiar with it and I would argue that's why we think it's simple but the the Simplicity we have often argued is uh if it's simple it's more likely to be useful I think that is actually not a statement about Simplicity it's actually a statement that if if something is useful for problems like a b and c then it seems it will also be useful for another problem the the the world is compositional if I have a model that works for this set of problems it's probably also going to work for this one um so that's that's the argument I would like to make so when we interpret these models I think it's important to kind of keep this in mind and and and really kind of probe what is simple what is interpretable so I think this is really exciting for polymathic AI because these models that are trained on many many systems they will find broadly useful algorithms right they'll they'll they'll have these neurons that share calculations across many different disciplines so you could argue that that is the utility and I mean like maybe we'll discover new kind of operators and be familiar with those and and and we'll start calling those simple so it's not necessarily that all of the uh things we discover in machine learning will be uh simple it it's uh kind of that by definition the polymath models will be broadly useful and if we know they're broadly useful we might we might might get familiar with those and and that might kind of Drive the Simplicity of them um so that's my node on Simplicity and so the the takeaways here are that I think interpreting a neural network trained on some data sets um offers new ways of discovering scientific insights from that data um and I I think Foundation models like polyic AI I think that is a very exciting way of discovering new broadly applicable uh scientific models so I'm really excited about this direction uh and uh thank you for listening to me [Applause] today great U so three questions one was the running yeah when it's fully built out is to be free yeah please use your seat mic yeah and three you're pretty young okay so I'll try to compartmentalize those okay so the first question was the scale of training um this is really an open research question we don't have the scaling law for science yet we have scaling laws for language we know that if you have this many gpus you have this size data set this is going to be your performance we don't have that yet for science cuz nobody's built this scale of model um so that's something we're looking at right now is what is the tradeoff of scale and if I want to train this model on many many gpus is it is it worth it um so that's an that's an open research question um I do think it'll be large you know probably order hundreds of gpus uh trained for um um maybe a couple months um so it's going to be a very large model um that's that's kind of assuming the scale of language models um now the model is going to be free definitely we're we're uh we're all very Pro open source um and I think that's I mean I think that's really like the point is we want to open source this model so people can download it and use it in science I think that's really the the most exciting part about this um and then I guess the Third question you had was about the future um and how it changes uh how we teach um I mean I guess uh are you are you asking about teaching science or teaching machine learning teaching science I see um I mean yeah I mean I don't know it depends if it if it works I think if it works it it might very well like change how how science is taught um yeah I mean so I don't I don't know the impact of um language models on computational Linguistics I'm assuming they've had a big impact I don't know if that's affected the teaching of it yet um but if if you know scientific Foundation models had a similar impact I'm sure I'm sure it would impact um I don't know how much it probably depends on the success of the models I I have a question about your foundation models also so in different branches of science the data sets are pretty different in molecular biology or genetics the data sets you know is a sequence of DNA versus astrophysics where it's images of stars so how do you plan to you know use the same model you know for different different form of data sets input data sets uh so you mean how to pose the objective yes so I I think the most I mean the most General objective is self-supervised learning where you basically mask parts of the data and you predict the missing part if you can you know optimize that problem then you can solve tons of different ones you can do uh regression predict parameters or go the other way and predict rollouts of the model um it's a really General problem to mask data and then fill it back in that kind of is a superset of uh many different prediction problems yeah and I think that's why like language models are so broadly useful even though there train just on next word prediction or like B is a masked model thanks uh can you hear me all right so um that was a great talk um I'm Victor uh so uh I'm actually a little bit uh worried and this is a little bit of a question whenever you have models like this um you said that you train this on many examples right so imagine you have already embedded the laws of physics here somehow like let's say the law of ration but when you when you think about like this c new physics we always have this question whether we are you know actually Reinventing the wheel or like the uh the network is kind of really giving us something new or is it something giving us uh or it's giving us something that you know it it learned but it's kind of wrong so in sometimes we have the answer to know you know which one is which but if you don't have that let's say for instance you're trying to discover what dark matter is which you know something I'm working on how would you know that the networ is actually giving you something new and not you know just trying to set this into one of the many parameters that it has I see um so okay so so if you want to test the model by letting it ReDiscover something then I don't think you should use this I think you should use the scratch model like from scratch and train it because if you TR if you use a pre-train model it's probably already seen that physics so it's biased towards it in some ways so if you're rediscovering something I don't think you should use this if you're discovering something new um I do think this is more useful um so I think a like a a misconception of of uh I think machine learning in general is that scientists view machine learning for uninitialized models like randomly initialized weights as a neutral prior but it's not it's a very uh it's a very explicit prior um and it happens to be a bad prior um so if you train from a a randomly initialized model it's it's kind of always going to be a worse prior than training from a pre-train model which has seen many different types of physics um I think I think we can kind of make that statement um so if you're if you're trying to discover new physics I I mean I mean like if it if you train it on some data set um I guess you can always verify that it that the predictions are accurate so that would be um I guess one way to to verify it um but I I do think like the fine-tuning here so like taking this model and training it on the task I think that's very important I think in language models it's not it's not as emphasized like people will just take a language model and and tweak the prompt to get a better result I think for science I think the prompt is I mean I think like the equivalent of the prompt would be important but I think the fine tuning is much more important because our data sets are so much different across science the back that the symbolic lied the dimensionality of the system so are you introducing also the funing and transfer learning a way en uh yeah so so the symbolic regression I mean I would consider that it it's not used inside the foundation model part I think it's interesting to interpret the foundation model and see if there's kind of more General physical Frameworks that it comes up with um I think yeah symbolic regression is very limited in that it's bad at high dimensional problems I think that might be because of the choice of operators um like I think if you can consider maybe High dimensional operators you you might be uh a bit better off I mean symbolic regression it it's uh it's an active area of research and I think the hardest the biggest hurdle right now is it's uh it's not good at finding very complex symbolic models comp so um I guess uh you could it depends like on the dimensionality of the data um I guess if it's very high dimensional data you're always kind of um like symbolic regression is not good to high dimensional data unless you can have kind of some operators that aggregate to lower dimensional uh spaces um I don't yeah I don't know if I'm answering your question or not okay I wanted to ask a little bit so like when you were showing the construction of these trees each generation in the different operators I think this is related to kind of General themes of the talk and other questions but often in doing science when you're learning it you're presented with kind of like algi to solve problems like you know diagonalize hilon or something like that what how do you encapsulate that aspect of doing science that is kind of the almic side soling problem rather right please use your mic oh yeah uh yeah so the question was about um how do you incorporate kind of more General uh not analytic operators but kind of more General algorithms like a hamiltonian operator um I think that I mean like in principle symbolic regression is it's part of a larger family of an algorithm called program synthesis where the objective is to find a program you know like code that describes a given data set for example so if you can write your operators into your symbolic regression approach and your symbolic regression approach has that ground truth model in there somewhere then I think it's totally possible I think like it's it's uh it's harder to do I think like even symbolic regression with scalers is uh it's fairly it's fairly difficult to to actually set up an algorithm um I think I don't know I think it's really like an engineering problem but the the the conceptual part is uh is totally like there for this yeah thanks um oh sorry okay um this this claim uh that random initial weights are always bad or pre-training is always good I don't know if they're always bad but um it seems like from our experiments it's we've never seen a case where pre-training um on some kind of physical data hurts like the cap video is is an example we thought that would hurt the model it didn't that is a cute example weird I'm sure there's cases where some pre-training hurts yeah so that that's essentially my question so we're aware of like adversarial examples for example you train on Mist add a bit of noise it does terrible compared to what a human buo what do you think adversarial examples look like in science yeah yeah I mean I don't I don't know what those are but I'm sure they exist somewhere where pre-training on certain data types kind of messes with training a bit um we don't know those yet but uh yeah it'll be interesting do you think it's a pitfall though of like the approach because like I have a model of the sun and a model of DNA you know it's yeah yeah I mean um I don't know like um I guess we'll see um yeah it's it's hard to it's hard to know like I guess from language we've seen you can pre-train like a language model on video data and it helps the language which is really weird but it it does seem like if there's any kind of Concepts it does if it's flexible enough it can kind of transfer those in some ways so we'll see I mean there's I mean presumably we'll find some adversarial examples there so far we haven't we thought the cat was one but it wasn't it it helped
Info
Channel: Simons Foundation
Views: 181,946
Rating: undefined out of 5
Keywords:
Id: fk2r8y5TfNY
Channel Id: undefined
Length: 55min 54sec (3354 seconds)
Published: Fri Apr 05 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.