Predicting the rules behind - Deep Symbolic Regression for Recurrent Sequences (w/ author interview)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there today we'll look at deep symbolic regression for recurrent sequences by stefan dascoli pierre alexandre camini guillaume lump and francois this is another paper where the main part will be an interview with the first author stefan and i'll just briefly introduce the paper right here for 10-ish minutes or so so if you want to just skip to the interview feel free we'll go over the paper just so that you know what's going on and there is also an interactive demo online where you can try it out and it's a good place to start at what this paper is trying to do so in this paper the authors care about symbolic regression to number sequences they have a model for integer and float number sequences in this case this is an example for an integer sequence so you can enter any sequence right here you can see that the sequence that is already entered is the fibonacci sequence and you enter as many terms as you want obviously the more you enter the more success probability the model is going to have um and what the model will do down here is it will predict an expression you can see it correctly predicts the expression for the fibonacci sequence saying that the current element is the last plus the last last element and it will predict the next terms for you and it will extrapolate the sequence that you've input so you can do any any that you want so i'm going to go one i'm very bad at coming up with stuff on the spot two one three one four one five let's see if it can get that so as soon as you exit from the model it will yeah look at that so the quotient which is not even sure what that operation is but ah so it divides it divides the sum of the last elements maybe by the last element it figured it out somehow it is it is not really good at like if conditions and this is one thing we're going to talk about in the interview but you can see it correctly predicts the next sequence right here so give that a try this is this pinpoints exactly what this paper does it does symbolic regression for recurrent sequences recurrent sequences are sequences of numbers that can be somehow expressed as a logical rule as a function of the last elements of the sequence there is like most sequences can be expressed like this for example they give a bunch of examples right here one two one two four seven eleven sixteen so you can see that it's always sort of plus one plus two plus three plus four plus five and so on or this function right here these are simply the squares so the recurrence relation actually isn't a recurrence relation at all but it is also a special case of a recurrence relation or this formula right here it can get very complicated they have a bunch of examples right here of recurrence relations as you can see they can go pretty complicated uh to express something like the final digit of n times n plus one divided by two or the final do two digits of two to the n or some maximum or anything like this so the goal of the model is that you input a sequence like this and then the model will output this recurrence relation it will not output the numbers directly of the sequence of the following numbers that's what they would call a numeric model and they also train one as a baseline but the model would actually output exactly the formula itself and then you can use the formula to produce the next elements now the good thing is we've all seen what happens if you train a numeric model on like a bunch of data points let's say these are your input data points you train a numeric model on that it will perform pretty well on the data you give it but as soon as you go like outside of that data as soon as you extrapolate too much away from the support base of the training data without very strong inductive biases it will sort of do whatever you can't really predict it what it will do where there is no training data that's why also deep learning relies on on lots of training data in covering a lot of the input space whether that's called extra or interpolation or whatnot we'll we'll leave it at that but if you have a symbolic regression and the symbolic regression actually predicts the correct formula to match this sequence right here like think oh this is just a sine wave then you can extrapolate indefinitely right and because you have the correct symbolic formula you'll be right you know uh in all places so potentially this is a very very strong method for certain types of problems this paper considers this a sequence to sequence problem so it considers transformer stacks and this is i guess along the classic transformer stack of you have an encoder and a decoder stack the encoder stack gets fed with the input sequence as numbers so here one one two three five and so on that is the input sequence it is fixed and then the output sequence is the formula that you want to predict and they predict the formula in reverse polish notation of the prefix tree of the formula so they have an example down here for example the cosine of 3x can be expressed as this as cosine of multiplying 3 by x so you would you would sort of load it onto the stack and then work your way down the stack in in this reverse reverse polish notation measure so that would be cosine of mole of 3 of x or whatever that formula is and then you try to train your transformer to auto aggressively predict first the first token without seeing those tokens and then once you have the first token you want to predict the second token given the input and the first token there is like there's multi-head attention in here like uh there is cross attention over here there's self attention in here as well your regular transformer stack so this is classic sequence to sequence problem the only question is how do you obviously encode the input and the output the output we've already discussed and they have a very detailed description of how they produce the data so what they do is they take a bunch of operators you can see them in this table and they make random formulas from those operators they have a bunch of constraints on these formulas but essentially they make random a data set out of just random formulas so first of all they sample the number of operators between one and a maximum number in this case that would be 10. 10 is the maximum number of operators and then they build a unary binary tree with that many nodes so they for example they would sample two operators right here like there or three a relu a sub and a a mod and then they would build a unary binary tree so relu then that is a unary thing right so it only has one input so sub that's a binary operation so it needs two inputs um here let's say mod that again needs two inputs so the second step is to sample the nodes of the tree from the list of operators okay that's what we've already done we've combined steps one and two sample the recurrence degree between 1 and dmax dmax is 6. so we're maximum allowed to look back six elements into the past this is kind of a markov condition you can say your recurrence relation can only look back six items that's kind of a limit but most sequences that humans could come up with don't refer back to the seventh last element right there is usually a way to express it in forms of either the current index or their last few like three or four elements at max then they sample the leaves of the tree so the leaves of the tree are either a constant with probability p constant these all these probabilities are one-third and they stress very much that hyper parameter settings are not very crucial in this way they sample the leaves of the tree so either it is a constant or the current index or one of the previous terms of the sequence so let's do that so we'll say here we sample the previous term which is u n minus two here we sample the index which is n and here we sample a constant which is three so that would result in the formula relu of u n minus 2 minus and then n mod 3. that would be the formula for this then they need to sample initial terms of the sequence so in with the formula you also need to decide you know how the initial terms the initial terms since we go back two elements we need probably at least two elements at the beginning of the sequence so let's call that one and two that's we also need to sample that from a distribution you can see here that's just a uniform distribution from negative 10 to 10. and then what's the last sample the sequence length and compute the next l terms so now we say okay how much leeway do we want to give the model to infer the sequence let's say we want to give it five elements and now we use the formula to calculate the next three terms right here all right i tried it it didn't work out but it is a rather complicated sequence i have to say but now you see how this stuff is sampled so um you see how the formulas are made they just define a maximum depth of maximum length and so on and then they just sample random data from that they create a data set the data set would be this one right here this would be the input and the output to predict would be the formula in reverse polish notation it's a sequence to sequence task and that's it now during inference they can do a beam search they can input again the sequence they can output different formulas different they can start out different formulas and then they can do a beam search and check which of the formulas actually match the input sequence that they have already and they can discard or rank down formulas that don't match the input sequence on the first few terms so that is an additional benefit they have from this symbolic regression ultimately they will end up with a formula that probably fits the input terms and hopefully is simple enough and the simplicity comes from the data set since shorter sequences are more likely to be sampled in longer sequences the model is implicitly biased towards easier formulas which kind of plays into ocam's razor so that's it that's the method they create a data set massive data set they train on random formulas train train to predict them from the initial terms and then they evaluate it as i said they also have float sequences but i won't go into that too much notably they do outperform this numeric model the numeric model simply tries to learn the number two number sequence just directly without going to the symbolics so as you can see the symbolic method is better when evaluating on in distribution sequences when evaluating on out of distribution sequences and here's a question of how do you even do that there is this database of integer sequences and after a bunch of filtering you end up with a validation set of 10 000 sequences this validation set are human made number sequences like the fibonacci sequence or anything essentially that you where humans can come up with some sort of logic of how the sequence is generated on this data set they don't perform as well as the numeric model as you can see right here so the numeric model outperforms the symbolic model but there are good reasons why that might be and we also discussed this in the interview lastly they also make do experiments with robustness to noise which are also very interesting in that they can even uh suffer from a bit of noise if they train with the noise and so the model is even a bit robust they can still do symbolic inference which classically if you have a symbolic system these are usually not that robust to noise because it's it's more like hit or miss but if you train appropriately you can handle that also interesting is that they encode the numbers not as continuous values in the transformer but actually as tokens so at least for the first 10 000 numbers they are all their own tokens so the number 19 and the number 20 they're just two tokens but it turns out that if you train the model then in the embedding space the tokens will actually form a sort of continuous not necessarily line but a continuous manifold in the embedding space which is really cool to see that the model even though you give the numbers as different tokens it learns to map them out according to their numerical values they also have investigations into the similarities between embeddings and they uncover some interesting structures where similarities are also according to the numbers like common denominators and so on and they give a bit of evidence that there seems to be kind of a natural base for mathematical operations of multiples of 6 and 12 and they say that you know 6 is a natural base for reasoning reminiscent of much earlier explanation by other people and you might you might know the this cult of people i don't even know what they they're called but this cult of people that says we should just switch to base 12 because it makes everything easier so there might actually be you know stuff behind that or it might just be a artifact of how we do math who knows they experiment a bunch of stuff with expression simplification and so on but the model seems to be quite robust to any of these modifications i think this is really interesting work in that symbolic inference i believe can lead us forward and tackle problems of of extrapolation that we aren't necessarily going to be doing with these numeric models that we currently have obviously this has its own limitations and its own biases built in uh most notably how you construct the data set is very very crucial to how the model is then going to perform but it is interesting to see that you can train it like this and essentially it's a you know it's a it's a free free training data because you can just generate it by yourself so without further ado i want to jump directly into the interview because we go over the important aspects of the paper again let me know if you like inter like interview content like this i think it is super duper helpful and the interview was very fun i hope you find that as well all right see ya welcome everyone uh today i have with me right here stefan duskoli who is the first author of the paper deep symbolic regression for recurrent sequences stefan welcome thank you very much for being here yeah pleasure bad timing to have covered but i'll try my best to yeah i hope this goes i hope this goes over uh relatively smoothly for you um but yeah so this paper i have to say it gathered quite some hype online right and uh because symbolic mathematics is something that is still still even though computers are very good at math per se at numerix symbolics is something that has been maybe in the human domain a little bit more especially this kind of sequence guessing right it seems to be a very very human thing something you would do maybe in high school to try to like figure out some sequence and figure out the rules behind it uh what what sort of what prompted you to go into this direction in the first place like why do you why do you think this is a fruitful direction or you know what made you come up with an idea i know there's some previous work uh but you know why this yeah so as you say i mean uh this kind of problem is very common like iq tests so that was definitely one of the motivations so originally this project was born from uh francois and guillaume who have been both working on on papers for so basically deep learning for symbolic math uh for a couple years and um what they've been exploring is several directions the first one of them was a paper in 2019 called deep learning for symbolic regression where they basically did symbolic to symbolic manipulations basically just integrating functions solving odes and stuff and then more recently francois has been working on a numeric numeric task involving math which is basically doing linear algebra so taking a matrix and then outputting its inverse or stuff like that and so a natural continuation of this was to start from numeric data and go to a symbolic formula and that's basically a symbolic regression which means you take a function uh you only see its values and you have to try and infer the expression of the function and uh indeed it's kind of surprising that the this has been studied quite a lot for quite quite a few decades actually this this this symbolic issue this symbolic regression question um especially with like genetic algorithms and stuff like that but there hasn't yet been in the machine learning literature paper working on sequences and as you said it's a very common uh setup for us humans and so this is originally the motivation and so francois came to discuss with uh me and pierre alexander alexander is more from the reinforcement learning background which is also relevant to sequences because you have basically a sequence of states and for me it's because i came from the physics background and this is also symbolic regression is useful also for physics for like inferior laws etc so yeah that's kind of how we got together cool excellent and just so we're we're clear to anyone the kind of sequences we talk about we have a bunch of examples uh right here so that would be for example here the final the final digit of n times n plus one divided by two that's kind of the formula of all possible pairwise connections in a group of of n points um or is that n times n minus one yeah the sum of integers okay um okay and and from that we just want the final digit so this the sequence here is zero one three six zero five one eight six five that is it is it is i would i would call it pretty complicated if you just gave me this as a human but there is some kind of a rule behind it right that i can figure out and that's the type of sequences you would you would consider this one is actually a good example it's kind of hard to recognize for us and if you look at the formula that the model gave us you can actually figure out why uh it predicted that formula it's u n minus 1 plus n and the reason for that is that n n plus 1 divided by 2 is the formula for the sum of integers and so the way it built this formula is just to take previous turn add n and then take the modulus respect to 10 because that gives you the final digit so it's kind of a clever thing that you know would be kind of um hard to figure out for us yeah so if you if you could um maybe give the the pitch of your model itself like the pitch of your paper itself um just before we we get into more of the details it's always super interesting to hear from the people themselves describing something like a brief pitch of what you did here uh yeah so i think our starting point was uh less ambitious than what it came to so we originally just started off from the uh this sort of thing that um that is quite popular from math lovers which is the oeis database so the online encyclopedia of integer sequences where you have all sorts of sequences you can play around with them you can try and guess the next is quite fun to play around with and um the idea was to try and build a model which could complete the sequences so sort of understand the logic behind these sequences so originally we only started off with integer models so we only wanted to predict integer sequences and um and we actually realized that that was pretty easy uh pretty quickly we managed to get a model working on integer sequences and uh so we then started to think about can we do the same thing for float sequences which are a bit more challenging because you have more freedom in the expressions you can build you have more operators you have um you know cosines and exponentials that come in and and so this is how we sort of i'd say it was a lot of serendipity really in the in this work we started off with this integer sequence problem and then we figured out things as we were going on uh so as you can see on the two tables you have there the uh constant approximation thing which we may discuss a bit later was one of the fun side effects of of trying to guess sequences is that you actually the model actually learns to do stuff it wasn't trained for and so yeah i'd say the goal of the paper isn't to provide a you know a model which is useful for real-world data it's not going to be able to predict you know uh stock market or weather forecast etc it's more of a like proof of concept of what you can do with transformers in terms of math and you specifically restricted yourself to to recurrent uh sequences and it i think it's important to point out sort of what like what kind of inputs does your model take and what kind of outputs does your model give right because a formula like like these they are you know written down uh in many ways there's there's ambiguities and i i would guess the inputs are these numbers right here right so a model gets this as an input and then it's somehow has to predict the corresponding formula so this is the training data is also like this how does it take the input and in what form does it output stuff okay so those are like the two two big questions so maybe we can start with the the inputs uh so that's actually quite a tricky question how do you feed in these these inputs to the model because um you know typically deep learning models uh don't don't take like if you think of a sequence which is like an exponential you're going to have very huge numbers uh if the exponential has a positive sign and very small numbers if the exponential has a negative sign and so if you just feed these kind of values into a deep learning model it's not going to learn much especially that here we're dealing with a transformer because essentially what we want to output is a mathematical formula which is just like basically a language and so this is why we use transformers and so transformers need to take in embeddings and so we need somehow to represent our input numbers as embeddings and that's complicated because of course integers just like reals are an infinite set so you have to sometime somehow find them find a way to encode them as a fixed vocabulary and so this is where we really have to distinguish our two setups we basically have two different transformers one for integer sequences and one for float sequences so the integer model what it does is basically it writes numbers in a base b representation uh so for example for the number like yeah exactly like here 325 you could imagine writing it as three to five in which case you only need ten tokens which are numbers actually it turns out that it's better to use a long um a larger base because if you do use a larger baseball you're gonna have a bigger vocabulary but you're gonna have shorter sequences and typically you know transformers have quadratic complexity they struggle a bit with very long sequences uh which is why yeah we we prefer to use a large base here we use ten thousand as our base yeah so this is this would be base 30 and obviously in base 10 000 i think it's important to note that every single number from zero to 9999 is its own token right the model has no inherent knowledge of you know three comes after two and four comes after three and so on all of this has to be learned it seems exactly it it seems so weird to say you know uh it is better to make the model learn essentially the entire ordering of 10 000 numbers rather than you know providing that as some sort of a just to make the sequence a bit shorter right yeah it's funny did you ever think of going with continuous values right because the first my first intuition would be that i feed the actual number right and then it's it's implicit like it's it's in the number that two is larger than one and three is larger than two exactly yes so that's what's really interesting is that that is one approach and actually we had a couple discussions on this like how can we feed in our inductive bias on numbers directly into the model and well i mean the problem with this is that here we're dealing with like just one dimensional vectors in some sense transformers need you know high dimensional vectors as inputs and it's not obvious how you represent these numbers in in a high dimension um you know because the as i was saying just before the problem is that these numbers have very vastly different scales and you know deep learning models usually take normalized inputs and so it's not obvious how you would so what you want to do is basically map these numbers you have onto a sphere and it's not obvious how you would encode you would put these numbers on the sphere and so one very simple way is just to put them randomly on the sphere and let the model uh decide all by itself how to put them in the sphere and this is what we do and what's interesting is that when you add plot after training what the embeddings look like you can see that it has learned in some sense our inductive bias of um of putting the numbers in order etc so these are these are t-sne t-sne plots right here the left would be the the um integer embeddings and it sort of forms this this this string what do you make of the t-snape plot here do you think these things are actually you know uniformly on a sphere or does the model just use like a tiny part of the sphere where it can make sort of a continuous path well what's for sure is that the uh it's definitely a low dimensional representation because you can see that the ts knee is actually very uh really shows a smooth pattern usually when you block tsnes of like word embeddings in nlp it's going to be a bit messy like you're going to get clusters but it's not going to be as well organized as here so clearly the the embeddings are lying somehow in a low dimensional manifold and so then you could think okay so why do we need like 512 dimensions if it's only using a small amount of them but that's actually because you know the transformer is going to eventually use these these extra dimensions to perform its calculations really so it's not as if they're wasted they're actually going to be used by the model yeah and the float embeddings are are very similar right in that you encode them as like a a a a sign a mantissa and an exponent and again the mantissa if i understand correctly same deal that you have a token per number between zero and and ten thousands and the and the exponent um is that correct that you have you say you have exponent from negative 100 to 100. so one token would be e minus 100 and then another token would be e minus 99 e minus 98. so these are all different tokens so now the transformer has to learn kind of a two to two different um two different embeddings both are somehow in sequence exactly yeah so for the for the so just to summarize so for the integers uh we encode the integer as the sign followed by a token of the base tokens of the base b representation of the integer and so for floats we also have the sine token then indeed we have the mantissa token so here the difference is that we only have one token for the mantis so we don't have like a base b representation which means that we do lose some information in the discretization process and then indeed to represent the scale of the um of the number we use uh an exponent embedding and and that indeed goes between minus 100 and 100 and so here indeed we do plot the ts and e of the exponents because they really have a logic to them for the mantissa it's it's less obvious if you plot a tns any of the mantissas it would look a bit anarchic but hear the exponents you can and actually just about this plot here um this plot is actually a tiny bit disappointing because we can't see some of the really interested features we had with our first models this is with the very big big model we with embedding dimension 512 actually when we were using a smaller model with a smaller embedding dimension we saw a really neat pattern um which was basically the fact that the model was learning the arithmetic properties of integers so it was basically creating a line with two four six eight ten etc then three six nine etc and here it's a bit less obvious probably because the big model is learning something even more complex that we can't interpret as easily if you go in the appendix you do see actually a figure where we see uh that the model learns like a basic representation the attention um the attention plots you mean uh actually not those ones yeah like if you zoom in the lot on the left plot you kind of see these these diagonal lines which are spaced out every six and every 12 showing that basically the model is recognizing numbers which have common uh divisors and is specializing to the base you know 6 or 12 representation which is often considered better than the base 10 representation so these plots just to to make it clear these are the cosine similarities between each of the tokens so the tokens would be distributed on the on the axes here these are tokens and these are tokens and then we put the the cosine similarities between every two tokens so naturally obviously every token is going to be very similar to itself but also very similar to its immediate neighbors so it seems to really learn this uh the ordering of all the tokens but then also yeah what what i found special um there is this this structure of the the common sort of the common factors uh common divisors between the tokens that's that's really cool yeah one thing also that's hard to see in this big model but which was much clearer in the small model is like you could see for example the perfect squares would light would be complete outliers you you would get like uh 9 16 25 49 which would completely stand apart due to yeah like sort of special properties i think that here so here is 49 right that that kind of stands out right this is this is this this gap yeah that's something which we haven't really been able to understand some guy sent me an email actually saying oh maybe i have an idea that there's a gap between 46 and 48 because uh uh like 45 has lots of factors of five and three whereas 48 has lots of twos and and and there must be some explanation or maybe it's just something due to optimization it's very hard to know okay yeah i think in this at this point it's it's a bit um also important that we look at the data generation process you you you give the model a bunch of options right to generate sequences and these are where do i have them so here we have the the operators that it can use on the left hand side are the integer operators and then the float operators would be in addition to the ones on or sorry no they're they're repeated in part but also um there are more in the float formulas and then you just generate in um reverse polish notation is that correct exactly so you generate reverse polish notation formulas given these these things and you can also have integer pre-factors right for for all the things so either you sample integers or you sample you sample the current element index or you sample previous elements of the sequence so the model could express you know if it's the fifth element take phi that current number times the previous element plus 2 times the cosine of something either a constant or again referring to some previous element or something like this is is there a logic behind why you chose the the why you made these choices of how you generate these formulas so actually if you look at this table indeed there are much more operators for the real case at the floating point numbers but you do notice that in terms of binary operators there are two which you can see in the integer set up but you don't see in the float setup which are integer division and modulus and this is this really illustrates that we're trying to learn rather different things in the two setups really in the integer setup we're focusing on sort of arithmetics and arithmetic properties of numbers whereas in the float setup we're really interested in a let's say a more classic uh symbolic regression problem with with complex operators yeah and yeah as you said our generation process is basically to build a mathematical tree uh sorry a unary binary tree this is like previous works by by foswan and guillaume and then indeed we fill in the nodes of these trees either with operators so the nodes are filled in with operators either binary or unary and then the leaves of the tree indeed as you said can be either variables or or constants and as you said uh the the choice of generators actually basically the front the the hardest part let's say of this problem because um one thing that's nice when you do these kind of uh symbolic math problems is that you basically have an infinite data set your data is just synthetically generated and so you can train as long as you want you don't have any sort of um you know you don't have any overfitting issues you don't you don't have to regularize that much you don't have to even the hyper parameter choices aren't that important what is really crucial here is like how you build your formulas and that's what makes the problem i think really quite fun to play around with because it's a bit like you know uh teaching a kid how to learn maths like you really have to figure out what is the best thing to show the model at what time and what is gonna you want the day set to be kind of hard uh so they can deal with complex cases but if it's too hard it's gonna learn more slowly i mean it's really an interesting problem how to generate the data and you decided just by playing around because so you you do have as we said you you have these these particular ingredients and i mean you can always say why didn't you have more or less and so on but you know you have a table of a bunch of operations that you can do you decided um as well to make uh to allow the model to use these sort of recurrence relations right to allow the model to say uh not only i want five times n plus two but i maybe i want five times n plus two times the the the previous or the the time step two steps back or something like this uh is there a reason behind you know including these recurrence relation is that just something you thought would be more interesting or did you look at the database and see that that's a lot of how these sequences are made it's true that often people look at the problem they want to solve in order to choose the parameters of their generation for example sometimes people use different weights for how to sample which operators to sample like they'll put more additions than multiplication or they'll here we have for example if you go right to the left here we have these hyper parameters for our generator uh for example you can see here the probability of choosing a constant leaf uh or index leaf so n or a previous term uh well yeah probably we could have like tuned these parameters somehow but here we really wanted to have the simplest choice possible on the rationale that basically our our data set is so huge um that is eventually we're going to see all possible uh formulas at some point it doesn't matter that much the specific values we choose and we don't want to tune them to a specific problem um and so this is why we really chose like very standard uh and also for the operators like we didn't use any particular probabilities with which to sample such and such operator we just left let everything as general as possible and this would be so this is built up as a tree because naturally you can parse uh these things as a tree you can generate them as a tree to have the sort of correct grammar but ultimately you end up with as we said this reverse polish notation which is a sequence right it's so this would be this would be one such formula not you you wouldn't have x but you would maybe have n or something like this so but ultimately this results in a sequence of tokens right so the input your model is these numbers encoded in tokens and the output is a sequence of these symbolic uh tokens yeah did you also investigate sort of the um the embedding space of the output vocabulary yes actually a good question so we did look at that and actually it didn't have any particular structure you could have expected maybe like cosine and sine are going to be close to in the embedding space um i think what's happening is that the uh the output space is actually much smaller right because in the input space we have a lot of tokens like we have for integers we have one to ten thousand that's like ten thousand words so it really tries to find a structure in the inputs for the outputs we only have a very small vocabulary compared to usual nlp tasks we only have like about 30 operators and so essentially if you look at the high dimensional space and you do it tsne you won't see much because it's just equally spreading these operators in the sphere or something like that there isn't much logic to it here um and how let's say um how universal are these sequences right how how many sequences that i could come up with freely would be inside of the scope of your model and like are there's are there is there a significant class of sequences that your grammar could not express so um with this unary binary true representation you can pretty much represent any function so of course there are some sequences which don't have any logic to them which aren't generated by a recurrence formula in which case you can't represent these sequences and that typically is the case with most of the sequences from the oeis database there's not an actual so we had to get rid of quite a lot of them and do some filtering now i did say that you can represent any function but there is a limitation though is that some functions are very difficult to express with this uh tree approach if you think for example of the collat sequence uh uh where basically uh for odd numbers you multiply by three add one and for for even numbers you divide by two that's the rule which is possible to express with a mathematical expression essentially what you do is say is write it as n modulus two times what you do if it's uh even plus yeah one minus one minus that yeah but that's kind of an involved way to write it and generally the model is going to struggle to output that because it won't have seen it much during training that's one important thing also which which we might discuss a bit more is that it's like a model i'm sorry sorry is that our model is kind of biased to the most likely experience the likelihood of the expression to be generated during training yeah i wanted to just it's like a hack right that we as programmers uh have for an if condition right it's just something we learned at some point like oh look you can you can make an if you have an if condition you can express it as like if you i don't know people program numpy or something like this that's exactly what you do right you don't say if you make like your mask with one minus you know whatever condition and plot you know and you multiply by this and then you have uh you know that and i think anyone who programs numpy or tensorflow or so on knows okay i can do it like this and then my stuff is you know expressible and differentiable as one formula so and but i think that's a that's a hack we learn right and if we just if we just generate data at random like you do you this is not something you come across as often as we come across when we you know program exactly yeah yeah it's uh it's very unlikely to see this formulation in our and uh in our data sets yeah absolutely okay cool um but but you know at the end of the day you generate a giant data set right there you go you go through it with transformers and you you make sort of you emphasize transformers is there something special about transformers like because couldn't can i use any any deep learning thing or you know why transformers well um first of all like previous experience i mean um guillermo and francois have been working on these transformers they've basically always been good at the problems we've given them likely i mean one natural justification is that as we saw for the outputs you can represent math as a language in a very easy way it's actually we can see here that it's much harder to represent the inputs as as tokens but the formulas themselves are very easy to represent as as a language with this polish notation thing and so it's very natural to use transformers because they're our best models to to deal with language um so yeah i think that's the the main reason and um yeah i i i'm not sure what else we could particularly i mean we could use like uh rnns etc but you know these days transformers are so powerful i mean these models we used we didn't even as i was saying before we didn't have to tune them much we just basically took the same architecture that was used in the paper two years ago uh we didn't even have to change the learning rate like it's pretty amazing how easy it is to train these things okay um yeah so the the transformers are natural way to deal with with sequences and from text learning we we kind of know this but we always learn sort of on on human text right and then and that has a particular structure and i want to think if i look at these sequences they're almost like there's so many symbolic formulas that could possibly explain these sequences and yeah i can you say you make you want maybe the simplest sequence or you you know you you don't want your your formulas to blow up that's you even generate only formulas that are let's say relatively simple so there's clearly a bias towards simplicity but i still there are a lot of things that explain the same sequence so i'm i'm thinking more is it like if when we as humans do these tasks um is it like a property of of humanity and civilization that we kind of come up with the same sequences that the person you know who made the riddle came up with is it because we kind of think alike right because of whatever society or or our environments that shaped us or is or is there like a property of math that says that says well if actually if you look for the simplest sequence it is kind of defined even though there are infinite possibilities like you do you know a little bit what i mean is it more like a property of humanity or of of mathematics i think it's probably two different things so as far as humans is concerned indeed we we tend to prefer simplicity that's like our ocam's razor principle we like going for the compressing information and going for the simplest representation um in terms of our algorithm here we didn't put at all this simplicity inductive bias from an explicit point of view we didn't tell the model give us the the simplest formula actually we could have done so because we could have for example given a penalty to like the decoder when it generates two long sequences for example but we didn't have to do this at all because the inductive bias comes from from the fact that simple formulas are more likely to be generated by the generator and that's basically the rationale behind our model is that it's always going to be biased towards the most likely formula uh corresponding to the sequence and as we were saying before sometimes that's not good because for the collat sequence it's going to struggle to output the one minus the mask thing um but in general that's kind of what we want in iq tests we ask we ask for the simplest formula to explain the observations is there is there i'm thinking of are there more things rather than just you know number sequences where something like symbolic regression could could be valuable i for example i've always thought of maybe reinforcement learning would be much powerful much more powerful right if we didn't only even if if agents have a world model what they call a world model they usually have like uh look almost like a numeric world model they just forward predict the the values that are going to happen there i always thought well if i had like a symbolic representation of the world i could you know be do much more powerful planning is there are you thinking of applications like these when you develop this right beyond number sequences or is there any interesting ones that you know come to your mind so as i was saying my co-author it comes from reinforcement learning and they've already been a few papers inserting like some symbolic parts into rl um loops uh and that's definitely gonna help indeed as you say i mean if you're a robot and you're trying to understand the world then you're gonna be it's gonna be much easier if you understand newton's law if you manage to if you want to for example predict how objects are going to move it's much easier once you understand newton's law then using like a specific vision uh model to to try and predict that's going to be much more complicated so indeed i think symbolic regression is going to be very useful for rl uh from my point of view i'm more from the physics background and that's also where a domain where symbolic regression would be very useful because typically i mean so we have these two approaches right we have numerical regression and we have symbolic regression and i think they're very complementary in the sense that numerical regression is complex is very good on complex tasks where you don't necessarily have a simple explanation for the for the data and symbolic regression is great for sort of inferring uh data where you have a simple underlying rule typically in physics like inferring laws from observation so yeah i think rl and physics are definitely two huge domains of application for symbolic regression and to to make this a bit a bit clearer so what i've done is in the appendix you actually have some success and failure cases of your model and so i have i have made a little quiz out of them and hidden hidden a bunch of them right here and i just want to draw people's attention a little bit to um some of the some of the so on the left the left three columns are success cases and the right three columns are failure cases both of the integer model right so these are integer valued sequences and do do i have this correctly you do consider it only a success if the formula is equivalent or do you consider it already a success if just the predicted values are the same um you can have the two criterions the criteria we choose in the papers we want the uh the evaluations to be the same so even if it comes up with like a different formula it's it's fine as long as like the the ones you tested on uh match yeah that's actually one tricky thing is that indeed you can't really rely on the formula to check if it was correct or not due to the degeneracy and so some papers have circumvented this by using like an rl loop because if you try to really uh supervise the formula then you can't make some you have to evaluate the formula which is non-determined non-deterministic and then you can't like back propagate this and so some people have used sort of rl loops to to provide reward signals from the evaluations uh what we do is directly supervise the tokens of the formula and and that okay maybe we can discuss this a bit later but that's also interesting because you know you could think this is weird because our our model is supervised to a formula and it's going to be penalized if it outputs at training um an equivalent formula yeah but that turns out to not be too bad and we tried we tried we tried expression simplification and it didn't help at all it doesn't really matter but yeah this is very interesting what you're going to come to uh with the uh success and failure cases yeah so the the leftmost column here is is pretty simple these are okay people already know it's success cases so nothing too unexpected right here like it figures out um that for example the middle formula this might be a bit small here even for people to read but this is um n n times the sine of uh gamma and gamma is what exactly uh it's a euler's constant euler's cons okay so n times the s the sine of gamma squared so the entire thing on the right hand side is a oh sorry is a constant right so it's essentially n times a constant yeah uh so the the model what it has to do is it has to somehow figure out the expression for the constant as a formula right because it it can't it it it uh it has to yeah it cannot just predict the number um and then it has to realize that i have to multiply this constant by n and that's why it's a straight line so and the other formulas are similar ish the top one for example is n minus the cosine of n and yeah again reminder these are this is symbolic symbolic regression um now the next ones are weird so here the top one it starts off very very weird but then it continues uh in the same path and you can still you can see sort of okay it's regular enough that the model could you know figure it out from the data points it has by the way the the green background that's the input right the blue background that's that's the what it has to predict so the next one i find particularly interesting it is the formula is the tan of the the tangent of n plus n times the last element and this is what the output looks like so you know how like how can the model from the just the left part figure out that this is the correct formula and then the the end date that just blows my mind like how does that work maybe the log scale would help a bit here because uh there is probably quite a lot of variability in the in the first terms and it's just squashed by the last term which is huge okay yeah i should have made it for the log scale um that's a good question yeah what is it what i find really interesting with these plots so here you're showing the success plots and on the right hand side you have the the failure plots is that we really see how symbolic regression is different from numeric regression like in direct regression you have this set of points and basically you're just trying to fit your function you're trying to bend the function so that it goes through the through the input points and so this is typically going to be very prone to overfitting right if you if you can't really understand the process then you're just going to fit a function which goes through the points whereas symbolic regression here isn't biased towards uh overfitting at all it's just trying to find a formula and so when it fails on the right hand side it not only fails outside the input points but also on the input points it's not even able to fit the points you gave it so this really shows a big difference we can see this a little bit i think so on the bottom left there's a there's a nice case where it it can it already fails yeah on the inputs like that's the best formula it can come up with you do have a beam search in there right these ones no no no tends to pull a bit more towards overfitting because in beam search you so the way we rank our beam is that we evaluate how uh how well the um formula matches the input points and so in that sense you're coming a bit closer to like actually overfitting the input points but if you use a beam size of one as using most of our experiments then essentially yeah you're not at all biased towards um overfitting okay yeah i mean this it seems like here it's just misjudged the formula on the top left is an interesting one where it just it looks like it's done everything correctly right it looks like so the red ones are the the outputs that it's supposed to match and the black one is the the line the function it produces what's wrong here is it like off by a tiny bit yeah so the screen is pixelated so i can't see very well but sorry but yeah um essentially we get two kinds of mistakes we get the mistakes where it's very close for example it confuses a like a four with a five and so it's going to be very close but then you have catastrophic failures uh where basically for example to confuse a cosine with an exponential or something like that you know that's just one token error but it's going to give completely wrong predictions and that's something that you typically won't get for for numerical regression you'll always at least fit your inputs however there is one thing where symbolic regression is better than uh numeric regression is that once it does find the correct formula then it's going to get predict you know perfect precision on all all the the subsequent numbers you're going to give it for if you think for example of um of extrapolating the sequence with a numerical model you're always at some point gonna you know uh get wrong predictions because you're not very good at generalizing outside yes typical thing that deep machine learning is good at interpolating but bad extrapolating but with symbolic regression once you've found the correct formula you can basically extrapolate as far as you want you you've got the right formula yeah and and so just saying uh for people who probably even people in the video will not be able to read i can confirm the formulas of these two things are completely different like the one is the sign of something simple and the one that's predicted is a very very complicated formula that just happens to almost fit or or maybe even perfectly fit uh the input data points right but then it is just that tiny bit off and that that gets worse and worse as the sort of the output progresses okay yeah so yeah there are a bunch of about a bunch of other funny ones like this one again these the scale here is um the scale here is observed it's like a e the exponent is 224 and there's just this one output that it's supposed to match and i mean that's just that's just mean to the model honestly yeah we do have i mean horrible expressions like our generator uses up to 10 operators and so if you look at expressions here we only chose expressions with three operators yeah you can imagine how horrible the expressions are with with 10 operators yeah and of course the accuracies are much lower i mean if you look at the ablation like our performance at 10 operates is about 10 versus you know 100 when you only have one operator yeah so i will i will uh i'll quickly uncover uh the rest of these but people are i encourage people to actually go and and look at the the success and failure cases also for the floating models i think it's it's really valuable and you can directly see as you say you know the differences between symbolic uh regression and i mean if you did numeric regression even if it has like a pattern like this like a zigzag pattern or something it would quickly degrade uh we've all seen sort of sort of numeric regression although as in your experiment so maybe we'll come to to this last so in your experiments there are um cases where the numeric regression is worse and there are cases where the numeric regression is actually better than the symbolic regression would could you want to maybe comment a little bit on the experiments specifically like in distribution out of distribution so typically in in distribution our um our symbolic model performs better than the direct model because it's it's got the right inductive bias right really we we feed in these these sequences which are generated by a formula and so it's much better than the numeric model at extrapolation because once it's got the correct formula it's going to give perfectly precise accurate predictions extrapolated as far as it wants etc however it is um slightly less good at out of domain uh generalization so one thing you see here uh in is i can't remember where it is in the paper but you see that um for example numeric regression is better when you have uh complex pre-factors right because here the expressions we generate the the pre-factors we have are built from like uh integers between 1 and 10 e and pi yeah and so that's well fitted for the symbolic model but what happens if you replace these prefactors by like prefactors which is sampled from a you know a gaussian distribution so these these two columns right here the difference between those yeah exactly and so what's interesting here is that uh in this case of course the numerical regression performs better than symbolic because numeric doesn't care at all about the fact that you're using these pre-factors because it doesn't really care uh it isn't trying to approximate these complex pre-factors what's interesting though is that the symbolic model still isn't that bad because it's actually able to approximate pre-factors with its own vocabulary and you've probably got a table with a few examples of this um and this is actually a purely something we discovered we weren't expecting this at all we suddenly like plotted the predictions of the model and we realized what it was doing yeah so okay for example here if you use the constants 0.3333 and you feed it to our symbolic model well of course it can't directly uh output 0.333 times n because it doesn't have 0.3 in its vocabulary so it's going to have to build somehow this this constant with its own building blocks and you can see that it does that pretty uh remarkably well uh and this is very surprising it's basically what happened is that during training it has seen some expressions because our expressions aren't simplified right so so we don't have something that is going to evaluate the expression so sometimes it sees a formula which has three plus exponential of minus six and it will notice what a numerical value that evaluates to in terms of the sequence and so it kind of learns to build any constant with its own vocabulary and it's important to say that you don't like other if if i see this i would first assume that you have some sort of uh gradient based regressor in there like that approximates these constants for you but you don't right the model actually has learned that uh to to output these symbolic expressions for particular constants yeah that's something i think which is a bit uh rather novel here is that we have an end-to-end transformer usually in symbolic regression you have a model which predicts a skeleton so an expression without pre-factors and then you sort of fill in the pre-factors with a separate solver here our model does uh finding the pre-factors all by itself so that's nice in a sense because it's like mathematically satisfying and it also gives us some quite nice approximations for example here you can see with 1.64493 it outputs pi squared over six and you may know that that's the sum of the inverse of squares and uh i think euler in his time spent quite a lot you know he had to actually found he found this you know numerical value and he spent some time figuring out that it was pi squared over six so that could potentially be useful for mathematicians um of course the drawback of it is that this is a complex process and if you have a very complex equation with lots of complex pre-factors then our model is going to spend a lot of its uh attention to building these pre-factors and it's going to make the task more complex and this is why i think our model isn't directly applicable to like real world problems like you know forecasting where you have very complex prefactors in front of each term of the equation is there any any other surprising things that you learned in the in the experiments um i mean maybe unsurprisingly a model like this is better than mathematica which i would have expected because i'm not i'm not a big fan of mathematica like stephen wolfram is cool but i'm not too too much into the way mathematica does things except for very very particular applications well i mean that's okay i mean it isn't that bad actually i was surprised at how how good it was i mean it has like these two built-in functions uh find sequence function and find new recurrence and basically find sequence function is going to find like non-recurrent formula it verifies yeah so for example if you feed it 2 4 8 16 is going to say 2 to the n whereas finally linear occurrence is really for uh when it depends of the previous terms in a linear fashion and and these are actually pretty powerful because a lot of uh sequences are linear and and mathematical will always basically get these right um because actually you can there's a there's a deterministic rule to find the the linear occurrence so that's that's fine uh find sequence function is very limited of course and you can see it gives worse results than oeis um but still i mean the these functions aren't miles away from our model um i think actually both our models and mathematica models are struggling a bit with oes they are outside of their comfort zone um i think mainly because um so one thing i should say is that here we're not evaluating on random sequences from oeis we selected those which have a label uh which says easy which means that there is a logic behind them there is a recurrence relation however or not necessarily a recurrence relation but there is the other ones just just to clarify the other ones you gave some examples in the paper of the other ones would be like the number of bus stops and you know in successive streets in new york city or something where you can't possibly know unless you consult like some outside knowledge yeah oei does have a lot of nerdy um nerdy sequences which are just for the fun of it basically and um um but even in the ones which are labeled as easy a lot of the sequences don't have a recurrence relation for example the the sequence of primes uh the sequence of divisors of n the sequence of decimals of pi all these things you can't really predict and so these kind of hamper are our models so i don't think this is like the best way to show that our the power of our model our model is especially powerful on like the sequences which are built from the generator which are very complex here in mathematica uh in in reis they our models are just only a tiny bit better than mathematica i wouldn't say it's the most impressive result and they are specifically also worse than numeric right you can see that the numeric models they they do outperform here and that might also be because one of the distribution shift and two if there are as well some even though they're labeled easy but actually you might still need some outside knowledge a numeric model at least will sometimes come close to the solution right close enough to to count as correct yeah exactly yeah on direct models is generally going to be better indeed when when there isn't a simple formula but you can still infer logic it's yeah yeah sometimes i mean you you give very i mean if if you've played a bit with the demo you'll realize that sometimes you give a very simple sequence for us and for some reason the model won't be able to recognize it because it uses our kind of logic which we can't really express simply as a formula and the numeric model will be very good at that so while yeah i'm gonna quickly open the demo i hope i have it ready somewhere um and maybe you can tell us like is there like in in in the course of this research was there a moment where it like didn't work at all or i mean you had some basis to go by right from the work of let's say let's say guillaume and and and francois uh but was there like what was the biggest problem that you encountered during this research um to be honest the this was i was surprised at how quickly we were able to get uh models working in the first place uh at least on integer sequences it was pretty quick to get some results from that point of view as i was saying before just plugged in our transformer we just had to to build the generator basically which isn't that hard um i think what we struggled uh with a bit was basically finding a baseline to compare with this is why we built this this numerical task because this is such a novel kind of path in symbolic regression to look at recurrent sequences that we didn't have we didn't have benchmarks we didn't have uh things to compare to and and you know it's a bit disappointing to show some results of in distribution accuracy if you have nothing to compare to so yeah yeah we built this this uh new rec model just for that purpose and um and yeah um in terms of yeah challenges i i i really yeah i was i was surprised it was much easier than i thought okay it's interesting because i think we interviewed we interviewed uh guillaume and and co-authors on a previous paper on the machine learning street talk i asked them like pretty much i think the same question and that they're all they already said like no you know kind of we plugged it in and it you know it worked out and you know i was it was cool so i think this is like uh maybe it's it's forbidden knowledge but this might be like a field of deep learning where there's things actually still you you you can get you can get like results it kind of it works maybe or maybe let's say you get started with something that works pretty quickly yeah whereas whereas if you're in like reinforcement learning you spend months until something actually starts working yeah and the explanation is simple it's basically just that you have this synthetic task and so you have infinite data and the big problem of deep neural networks is when they don't have much data then you really have to get clever about how you regularize how you choose your hyper parameters how you build your architecture here you can just throw anything at it and it'll work it'll learn as long as it's got enough parameters and that's one thing you have to have a lot of compute resource for this project and uh i mean here the transformer is is um pretty big and it's trained on a huge every epoch we train has five million equations uh and and trained you know for like uh three weeks or something on 16 gpu so it's you know pretty big scale thing nice um lastly i just want to present this demo you build so people can try this out for themselves so if i input like one two four eight and that should probably already be enough and then i have to like click away and then it will compute it will tell me uh the next ones are 16 32 64. uh that's pretty impressive i wanna i think i i try to challenge it a little bit i like to try to to do come up with some maybe i i thought of like a music sequence like it's probably too regular right let's see um i think it'll get that one right so yeah it will it will okay that that's that is fairly regular if i look at the plot um but yeah i invite people to go and challenge challenge your model a little bit right here you can also choose uh sequences of this uh oeas database and um yeah check out the model this is really cool all right so i think this this is there anything you want to like special that we haven't come to you want to mention about the paper itself that was that was great for me thanks for your questions i think that was great for me as well i i'm always happy if i can ask like all my all my dumb questions to the people themselves in this case stefan thank you very much thank you and your co-authors for for writing the paper and thank you so much for being here this was really really fun thanks a lot
Info
Channel: Yannic Kilcher
Views: 19,420
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, research, symbolic, symbolic regression, neuro symbolic computation, integer sequences, oeis, number sequences, ai number sequences, machine learning sequences, integer sequence rules, embedding space, transformers, attention mechanism, sequence generation, learning number sequences, predicting number sequences, facebook ai, meta ai, beam search, symbolic vs numeric
Id: 1HEdXwEYrGM
Channel Id: undefined
Length: 71min 10sec (4270 seconds)
Published: Sat Jan 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.