The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the most important abilities for generative models is to be able to speak coherent English her mom didn't let her have a dog so she asked for a and when you try to auto complete this now you know the most common noun that you've seen so far also the most the the proximate one is dog not cat dog already appears twice in the sentence even gpt2 XL which has 1.5 billion parameters it's most likely completion is still dug I'm just gonna read one tiny story uh just straight out of the paper because I think that will help people understand like what this data set you know ultimately is Tom has a big pot of soup he wants to share it with Jane Jane takes a spoonful of soup but then she makes a face the soup is that's the prompt and then you show a completion very bitter she does not like it she says I don't like this soup it is too bitter he looks around the kitchen and finds some bread and cheese he puts them on the table and says here Jane you can have some bread and cheese they are not bitter they are sweet and yummy Jane is happy she says thank you Tom you are a good friend I like bread and cheese they are not bitter hello and welcome to the cognitive Revolution where we interview Visionary researchers entrepreneurs and Builders working on the frontier of artificial intelligence each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work life and Society in the coming years I'm Nathan lebens joined by my co-host Eric tornberg hello and welcome back to the cognitive Revolution today's episode is great for anyone who really wants to deepen their understanding of and intuition for how language models really work certainly as measured by how much I learned in the course of the conversation it's one of our very best our guests Ronan Eldon and Yuan Julie of Microsoft research have created a small natural language data set called tiny stories which they designed to reflect the full richness of natural language while still being small and conceptually simple enough to support research with modest compute budgets they did this by using gpt4 to systematically create 1 million children stories using only words that an advanced three-year-old could be expected to know data set in hand they then began to explore a number of aspects of language model performance behavior and mechanism by training a series of models that range in size from just one million to a maximum of 33 million parameters still just two percent the size of gpt2 they then use these small models to explore the development of language model reasoning abilities identifying so-called logical Primitives beginning with a basic understanding of grammar followed by the Learning of facts and then eventually adding certain logical micro skills such as negation and exclusion these findings create the perfect context in which to discuss the tricky and often controversial topic of emergence as well as to compare and contrast how large language models learn with how human children learn and to explain how the differences that we see across language models and children do in fact make some sense given the different incentive structures in play in each case they also did some great interpretability work in this paper and I really relished the chance to get into all three areas that they explore first they look at the trade-offs between the number of layers in a Transformer which to a large extent governs the number of logical leaps that a model can make with on the other hand the width of a layer which seems to deter to determine how many facts the model can store they also identify attention heads with distinct roles including distance heads which simply reflect the distance between tokens and which look almost exactly like The Alibi scheme which is now powering long context models such as Claude 100K and the recent Mosaic ml 65k release and then on the other hand semantic attention heads which focus on meaning that there should exist such completely different attention heads within a single model and then an alibi scheme should emerge in the wild is really to me mind-blowing finally they examine the role of individual neurons finding that many of their small model neurons do in fact correspond to human interpretable Concepts we close the conversation by zooming out and discussing why small models are more interpretable than large models the challenges inherent in attempting to extend this work to larger scale models and why controlling language models might end up being more like horseback riding than microbiology throughout this conversation I was really struck by two things first it seems to me that we've only scratched the surface of the potential for curriculum learning approaches I fully expect that we'll start to see Ever more sophisticated approaches which use specific data sets to layer on specific skills in a strategic Progressive manner creating highly specialized small scale models that can solve specific problems extremely efficiently second the value of these toy models for developing understanding really is tremendous if I could make just one suggestion to listeners if you want to get the absolute most out of this episode it would be to visit the hugging face website and try playing around with some of the bigger models that they've released the very biggest are still only 33 million parameters which means that they can load easily and run quickly right from the hugging face model page if you do that as I did in preparation for this episode you will actually have the chance to explore a lot of the concepts and you can set up your own little experiments to test the reasoning ability of these models I guarantee that you will come away with a deeper understanding that you will retain for longer and much better and if you do find anything interesting I would love to hear about it so please do reach out to me via our email TCR turpentine.com or on Twitter where you can always DM me at labens now I hope you enjoyed this elucidating conversation with Ronan Eldon and Yuan Julie of Microsoft research Ronan Eldon and Yuan Zhu Lee welcome to the cognitive Revolution thank you so much we're super happy to be here you guys have just published this paper called tiny stories and I think it's a really fascinating uh bit of research on multiple levels so I'm really kind of excited to dive into it with you guys it touches on a bunch of different themes including some of the hot button themes that we'll get to around emergent capabilities and reasoning and you know you guys are studying this in a very unique way that makes the problem I think more attractable and more approachable hopefully for our listeners as well so I'm really excited to introduce this work to them maybe just for starters can you give me a little bit of a introduction to what inspired uh the tiny stories project I guess I'm I'm kind of new to uh llms or deep learning in general I come from Pure math and uh when I started looking into you know architectures trying to understand what those models are actually doing how to improve them etc etc I got very quickly I got very very frustrated because it's very easy to come up with ideas but in order to actually check whether an idea is good almost always you need to do an experiment that involves a lot of compute it's just very very hard to check things you need to you know you can either train small models which basically don't do much in terms of you know they don't actually generate text that sounds coherent you can train like maybe a bird size model and then it'll do something on some Downstream tasks but whatever it does doesn't is it doesn't look much like what those llms are doing if you want to really get an llm experience you need to do an experiment with a lot of computer that involves you know tons of gpus Etc so you know for me it was just a way to address the frustration of not being able to get any any insights without you know having to do a large experiments and so the main way that you have accomplished that if I understand correctly is by kind of narrowing the conceptual space of what both the data set you know contains and then obviously what the model is trained to do right like instead of taking uh you know a small cup out of the whole ocean of mixed up of everything language you've created a kind of we're gonna we're gonna tackle one you know very consistent type of input that's fair I guess we should mention there have been many attempts to come up with a synthetic or or non-synthetic a smaller data set that has all those elements that those large language corpra have right so you know in language you have all sorts of elements you have a lot of facts you have so first of all you have grammar and vocabulary right those are the obvious things you have in language but then you have facts you have reasoning that you can infer from those texts and there's like also many layers of reasoning and and you know there's I guess there are many capabilities involved in being able to parse those data sets so you know our initial motivation was to come up with a data set that has all these qualitative elements but on the other hand is just not as massive as those large language corpora right and you know you Andrew and I had uh so first of all there as I said there are many uh synthetic data sets out there some of them I think reflect in a in a pretty good way certain aspects of language such as reasoning or facts or grammar or stuff like that but we felt like there is no single data set that has all those Dimensions together which are all integrated into something which is not too large right and we felt like in order to understand in order to gain insights about llms we need a data set that has all those elements yeah I was just going to add that I I also came from an email background and uh so I was doing theory of machine learning since uh maybe seven or eight years ago when the field just got started and yeah at that time like everyone was doing research on Vision models and for vision there's a very nice data set called c410 or even I missed I mean those are very small data sets they only have like 50k images and when you train on those data set you can get a pretty high quality image model and they can do all sorts of things and they reflects what's going on in real large models and I mean at that time doing research or making progress on both theories side and applies that is kind of easy because they should just training those models only take like one day at most but when we move to this phase of large language model or language model in general the research has become so expensive and I've seen all those blog posts saying that there it's impossible to do a PhD now in machine Learning Without like 8100 and I mean probably only one percent of the PHD student amount of compute so we really want to see whether there is a way to kind of bring the I mean all good old days which are like those FIFA data set of fast experiment iterations back to the language side and that's what motivates us to consider this small data set or this simple data set I think the point is I mean there are other kind of synthetic data set or simple data sets that are I strongly say reflecting some aspects of natural languages but they are not real natural language they are like just doing Simple arithmetics or doing simple like stream matching or number manipulation I mean they're not real natural language and we want to keep the authentic of natural language but just reduce the overall complexity so we still will studying natural language not some symbolic manipulation and still we want the iteration or experiments to be done yeah very quick way you created the tiny tiny stories data set I always like to be as concrete as possible so I'm just going to read one tiny story uh just straight out of the paper because I think that will help people understand like what this data set you know ultimately is Tom and Jane are friends one day Jane goes to Tom's house Tom has a big pot of soup he wants to share it with Jane quote Jane do you want some soup Tom asks quote yes please it looks yummy Jane says Tom pours some soup into two bowls he gives One Bowl to Jane Jane takes a spoonful of soup but then she makes a face the soup is now this is just an example presented from the paper that's the prompt and then you show a completion and you can you compare and contrast this against other open source models but I'll just read the 28 million parameter version uh that you guys trained the soup is very bitter she does not like it she says I don't like this soup it is too bitter Tom is sorry he says I'm sorry Jane I didn't know you don't like bitter soup I will make you something else he looks around the kitchen and finds some bread and cheese he puts them on the table and says here Jane you can have some bread and cheese they are not bitter they are sweet and yummy Jane is happy she says thank you Tom you are a good friend I like bread and cheese they are not bitter hey we'll continue our interview in a moment after a word from our sponsors omniki uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button I believe in omniki so much that I invested in it and I recommend you use it too use Cog rev to get a 10 discount so there is our whole uh tiny story and you know it I read it almost in uh like I'm reading to my you know four-year-old or my two-year-old because it is kind of a children's story and I understand that that also is kind of part of the motivation so how did you create this data set um how did you kind of how do you conceptually you know think about those stories you told us a little bit already is kind of having those key elements of the grammar facts uh you know some amount of like reasoning required uh but how did you create them how big is this data set from from this motivation to have a good synthetic data set we should just point out that this is maybe the most natural idea is to rely on you know human development right it already has the solution for us because young children they are able to speak English somewhat like mostly Korean coherently I have a daughter I can testify that not extremely coherently but somewhat coherently and you know this is there's already a solution to this coming from like human development right so all you have to do is just create a data set and make sure that it'll be it basically that any example can be understood by a small child on one hand and on the other hand you want it to spend as much as possible of you know the knowledge that a small child has you want it to be as diverse as possible and we decided you know that it makes sense to have this data set somewhat structured so just the structure of a story it kind of makes sense because into a story you can inside a store you can have all those elements combined together right grammar facts resonant stuff like that and we just you know I think it's a really good time to to to try to create this data set because finally we have those you know models GPT 3.5 and gpt4 which those models can actually understand the instruction you know I want a story which is somewhat creative and only has very simple words right so gpt4 wrote these stories is that the so yeah most of those stories were written by a gpt4 some of them by 3.5 3.5 is already good enough to write those kind of stories it's not that great gpt4 is is definitely doing a better job now it's pretty easy to just write a story right just if I just want to write a short story even gpt2 can probably write a decent short story the problem is to actually get a diverse data set that spans you know all the vocabulary that you actually want to spend and if you just ask even gpt4 create a short story and you do it a thousand times and you do it with temperature with rather high temperature let's say temperature one which kind of gives rise to the mo the most diversity you can get still about one-fifth of the stories will be about children being scared of the slides at the park we actually actually did this experiment so it's not very creative it's if you just tell it to create a story without any other instructions you're gonna get a very repetitive data set and and the whole game is how do you get diversity how do you make the data set you know not be very repetitive and here the idea was just to collect uh a list of just the vocabulary of work of simple words we have about 2 000 Words which um supposedly three-year-olds uh understand and then what we do is we just ask GPT for okay here's uh one random uh verb one random noun and one random adjective try to combine this into a story in a creative way we do about a million calls like that I think we have one about 1.5 million stories in the data set so on one hand they definitely span all this vocabulary because there's only two thousand words but on the other hand you definitely do not spend all possible combinations of you know words so you know that you're not gonna you know it's not if you can later create a story with some prescribed combination you will have demonstrated that the model has some creativity inside it yeah so that's how we created it that's really interesting just to do a little bit of math on this so a million ish stories which first of all that answers the question of why not go use real stories because that's a lot of books to scan so there might not I don't have any children's books there are we'd have to get uh your hands on a whole lot uh to get to a million so the the need for the synthetic data is there I'm interested in what you were seeing that was like way better about gpc4 versus 3.5 it sounds like with the stories that were repetitively about the slides I would interpret that as maybe like mode collapse like reflection of kind of you know effective rlhf likely is that how you would understand that too huh I think it's mostly just the model is regenerating the most likely stories because that's what the model is doing if you don't give the model any content you will just bleed out the most likely stories because it wants to minimize this language model laws so potentially just a child scared of the slice is the most common story yet that exists on the internet so the model just learned to generate it without any given conditions so that's why we want to create some condition to move the model outside this particular like high probability Zoom one difference that we see between GPT 3.5 and gpt4 is after it so you give it three words it needs to combine them somehow in the story and sometimes it's not that easy even if I give you three words I don't know I think there's an example there uh ancient uh Thunder and sad or something like that you want to combine them in a way that won't look kind of too superficial you want to create a story that actually seems fluent and you don't want like a complete change of Topic in order to you know be able to combine the next word and gpt4 seems to be able to do this pretty fluently whereas GPT 3.5 sometimes you get stories that don't make so much sense the words appear there but they don't appear in a very satisfactory way it totally makes sense also for you to note like yeah maybe this is just the most common story we don't necessarily have to invoke any like exotic theories of why it keeps talking about slides but I wonder if the non-rlhift version you know if you had access to kind of the base gpt4 model if that might have been different I had the opportunity to Red Team the gpd4 early before it had all the safety measures but it did already have the rlhf and you know kind of the instruction following capability um I never saw the like totally base model I don't think that many people did they in the technical report they said it was not you you know not people weren't sure what to do with it right so I think they maybe put that one out to pasture did you guys try any different versions of gpt4 I was obviously being you know Microsoft you you might have some privileged uh access to different versions that the rest of us wouldn't have yeah so we did we did have access to an earlier version that I'm I mean you know we're not sure about the exact technical like what what the difference is from the model that's now available to the public but that model had less safety feature is on it but as you said it it may be the same model you add access to so it did have a certain extent of rlhf right now I I think if you take you know just the language model trend on the pile without any rlhf and you you say you know create a story such that blah blah blah maybe the most plausible completion in terms of a random you know entry from the distribution of web pages is I don't want to do it like write me a story such that that's the question answer I don't feel like it or you know it could be that that you know the net instead of completing the story it's just gonna ask you know another question without any rlhf without any alignment that the model has to to I mean the model doesn't know that it's supposed to actually perform your your instructions right you know that's the the most basic part of it but I I think that the rlhf they did on gpt4 is is just good enough so that it makes sure to as accurately as it could actually you know satisfy the constraints that you give it so it almost always combines the words that you ask it to combine into the story and also it almost always actually writes a story that only uses simple words yeah I think as you wanna point out the biggest difference of the pace like gbt4 model compared to this RL HF version is just following instructions I mean for base model you can give it as a beginning of the story the completion is very good but he will just say write me some story that combines certain elements then the model has a very hard time understanding the interaction because those things are very rare in the internet it probably needs some fine tune so the model understands what does it mean by interaction it's not like a conversation it's an instruction when you ask write me a story yeah makes sense yeah at best you could maybe do a few shot approach but then you probably have I've seen a lot of issues with over indexing on the examples as well so yeah I totally totally if you should approach would probably Prime the model into you know thinking of specific plots also I'm doing some AI 101 type education uh the friends company right now called Athena and I was just doing a webinar this morning where I was getting into that with folks saying you know if you are going to do a few shot probably don't do one example in your fuse shot at least do two because otherwise you tend to get this like over in context learning on just the one example you gave so yeah I'm with you on on that for sure so a little bit of math so there's like 2000 words there are you know let's say that's just to take a round number um a thousand each of kind of you know the verb the noun the adjective so that in you know pure you know extended expanded form is a billion POS possible bags of three words roughly order magnitude right so then you made a million stories so I just wanted to establish that the space of possibility versus the actual data set that these models were trained on is about a thousand to one ratio do I have that roughly right that's true I think it's pretty accurate only in addition to those three words we also have another another way to add diversity which is a bunch of features we ask uh GPT to add to the story such as a plot twist a bad ending the dialogue so that adds a little bit more diversity but one in a thousand is a pretty good like uh ballpark estimate of the ratio cool and then just cost of this if we were going to pay like retail price for gpt4 to write all these million stories that would be if each one is say 300 tokens I'll just take a nice round number because that maybe equates it to roughly one cent per it would be like a ten thousand dollar gbt4 retail price to generate the data set that's pretty accurate Okay cool so then this curriculum concept is I think Super fascinating and I this is one of the areas that that had me so intrigued by the paper you're taking inspiration you know obviously as you said from human development and you know starting with simple words which definitely makes sense as an approach I always kind of try to keep in mind as well that like these things are very alien and you know I do I'm very intrigued by this curriculum sort of approach but I wonder what about like more weird curriculums uh this is maybe outside of the scope of like this particular research but I kind of keep waiting for somebody to show up with a like we trained it first on like pure logic notation you know we've seen this a little bit that's kind of been discussed a lot recently that the code pre-trained models seem to demonstrate better reasoning you know once the kind of language part gets added on you know obviously who knows exactly what that baking recipe looks like how do you guys think about that like do you expect the same thing that that somebody's gonna pop up with a hey we did a pure logic or we did like you know just massive amounts of like abstract algebra first and kind of taught some sort of you know structure that we then were able to layer natural language onto he could definitely help because uh we we based on our previous research on the I mean attentions in language models they are kind of some simple attention mechanism that the language model may have the first one is just it's associating two tokens that are exactly the same and the second one is after it Associated to tokens that are exactly the same it also copies uh the tokens around the first token to the second token so it's just like us when we read some word we go back and see what's the previous time that this word appeared and was surrounding Hong text and I think just training this height is actually pretty expensive it requires a lot of training data and something like coding or logic is the perfect way to train those hats because for coding when we Define variable we definitely need to look back like what's the previous definition when we call a function like we check that function we see what the function is doing so it kind of May set the language model into learning those important Concepts like look back or checking the surroundings and that may serve as a very good warm star for training on other things like simple natural languages it may makes a model learn much faster yeah so so maybe let me I mean and another way to say what you Angie said is yes I mean we do we do observe this to a certain extent that you know maybe coding improves models reason and I think at this point there is no like overwhelming evidence that this is actually the case but there are some observations but we are not sure at all that the reason behind it is that you know when the model learned how to write code it actually learned how to reason it looks like rather the reasons that this this works are are there are the explanations are just much simpler you just managed to calibrate the exact attention heads that you need and those attention ads don't have any particular sophistication in them they might just be able to very accurately look at some relative position to a given token or just compare are two tokens in a very precise way and so so that means the reason is more like the types of components in your neural network that are required for coding are already there but those components are are pretty simple it's not like the network has like very sophisticated neural paths that you know emerge after the training that actually know how to do reasoning and for that we we actually have a paper we wrote at Microsoft research about a synthetic task called Lego so this is a very basic synthetic task that has the core elements of reasoning and what we observe there is that so we use a a Transformer based on a birth architecture and what we observe is that the pre-trained Bird Transformer uh basically grasps this reasoning task much faster the task is is something simple it's it's basically you get a string that looks like a equals to one b equals to minus a C equals to minus b d equals to C etc etc and you have to resolve you know the values of our variables and you know at first we thought you know maybe there's some in some kind of profound sense The pre-trained Bert model has learned how to reason and this is why it grasps this task so well but if you dig into it just a little bit what you realize it's the explanation is just much simpler much more superficial than that it's just the pre-training has given rise to some simple uh attention heads that if if you just just initialize the model with those attention ads then it basically grasps reasoning much faster this explanation is is is much closer to what actually happens when you train a model to code and then it exhibits better reasons and incapabilities so this is maybe a good time to talk about what we mean by reasoning um and I see a ton of confusion out there about this maybe you can help us you know get a little bit more clarity I guess one thing that I kind of observe is you know of course people are debating this capability and it seems like you know you've got kind of different standards of evidence or like you know people put the burden of proof in different places to put my cards on the table you know I kind of conceive of myself I call myself these days an AI Scout and I'm really interested in like what is possible what can be done not necessarily holding the systems to the standard of that they can do it every time or that they can do it in all cases you know or it certainly matters like how adversariably robust they are um but you know I wouldn't uh you know say oh well we found an example that it failed on therefore like if you know can't do X if it could do you know X nine out of 10 times before it got to that kind of crazy example but how are you guys thinking about reasoning as you know something more than a binary obviously in the context of This research initially we think well reasoning as something that is just a subset of consistency when we generate sentences or when we say things we need to make sure that they are consistent with what we said before and there's like the first level of consistency which is just a nearby words they need to follow some grammar rules and follow some basic cinematics and those are not really reasoning it's just some simple like more like stochastic parrot where you just do simple pattern matching just to cut the previous like couple words and just generate one that is consistent well I think what goes to reasoning is when the consistency goes to the next level which is you really need to consist be consistent with like something very far away from the current token like something consistent with the general plot of the story for example there's a word but and then you need to say something in the opposite order and those level of consistencies they are The Primitives of reasoning so we do think that anything beyond just a local consistency should be think of some ability that is recently the first thing we have to say is that type of reasoning we're thinking about is a very very basic it's just some you know basic core capability that comes with speaking coherent English some people I guess still say that you know large language models will never be able to reason I guess they have a very very different definition of what reasoning means than what we have right I mean what we mean by reason and is really like you know the capacity to just apply some basic Logics when you uh generate text right and and I think maybe to be concrete we can look at one of the examples in the paper so if we look at the sentence like uh Lily likes cats and dogs she asked her mom for a dog and her mom said no so instead she asked and then you do uh autocomplete we kind of see it as a hierarchy of capabilities so some words in this sentence in order to to complete them to to know what the next word is you just need some very very basic grammatic rules for example she asked her mom for the next word is a for this you only need to know a little bit of grammar and that's it now the next word after a she asked her mom for a dog well you know that the if if you just know grammar you know that the next word should be some noun but here you already need to have some you know contextual tracking of you know what's going on in the text the relevant nouns here could be are probably dogs and cats right those are the two objects that were mentioned in the sentence before now we go to the next sentence and you know you you say like her mom didn't let her have a dog so she asked for a and when you try to auto complete this now you know the most common noun that you've seen so far also the most the the proximate one is dog not cat dog already appears twice in the sentence you know she likes dogs and cats she asks for a dog her mom said no so our smaller models actually complete this by saying dog and even gpt2xl which has 1.5 billion parameters it's most likely completion is still dug because it's still at that level with where it is it did resolve that there should be some noun there and it does know how to look back in the sentence and see that okay there is there are two nouns dog and cat but you know dog appears more so it's more likely that you know if I just had a dog in the previous sentence we're in the just five words before it's gonna be dog again but on top of that if you have a very very basic reasoning capability then you're supposed to be able to apply elimination and realize that okay she can't have a dog we had the set containing the two objects dog and cat but now dog is not allowed so what's left is cat you know we we thought this is one of the most basic examples of a completion that would require some extent of of reasoning there's always this uh I mean this intertwin between reasoning and planning for example when we say reasoning many people would think about mathematical reasoning like I prove a mathematical theorem and that's not only reasoning it's also planning I need to come up with the correct method I have the intuition like what's the next step should be and for us the reasoning that we are more interested in is just consistency like you should say something that are consistent with what you previously said and the consistency is not only local it's Global and that's where kind of we think as the reasoning for natural languages the only thing it needs to do is generate text that's consistent with the prompt like that's that's the only objective a language model has to you know fulfill the next word it generates should be as consistent as possible with all of the prompt in order to achieve this consistency there are several different levels so for most words that it generates the only capability that's actually needed is grammar or maybe not for most words but for many words just by knowing some grammatical rules you know you know that if you have a sentence Amy wanted the next word is probably two she wanted to something and you don't need to know anything beyond that right now the next level after that and and this is again very vaguely speaking it's not like there is a very strict hierarchy but the next level is to have some semantic understanding of what's going on or just to to understand what are the relevant nouns actions stuff like that or maybe which action is could be related to Which object etc etc and you know if if and I I guess if you look at models of size say around 1 billion it's very very good up to that level it almost always gives you a word that is grammatically correct and also has the it I I mean this word is is kind of well related it works well with you know the previous few words that you saw or it fits well with with the previous few words in the in the prompt but the next level after that is sometimes already requires you know first order logic or second order logic Etc so you're kind of breaking it down into like micro skills you know I I this is a ridiculous analogy but I'm kind of thinking I follow this guy on Tick Tock who coaches basketball micro skills and it's amazing how many micro skills there are you know involved in being a good basketball player and you know mostly the untrained eye even among basketball fans like can't really enumerate them but this guy has enumerated them and now he's teaching them you know won by very small one and so it's you know maybe similarly where you you wouldn't say like this person can like fully play basketball that probably doesn't even like make a lot of sense or sounds already sounds strange uh but you wouldn't say if there's any missing micro skill that they can't play basketball you have some sort of Continuum there that's like a you know people can be better and worse at playing basketball people can certainly be better at and worse at reasoning and language models too can be better and worse at reasoning and that probably seems to map on to some sort of hierarchy of micro skills that it either has or it doesn't have or it's like you know maybe in the process of grocking you know at any given point in training so that leads me to the other you know big bold uh vocabulary word that I want to dig in on a little bit which is emergence again tons of confusion tons of different uh you know meanings out there I think some people mean like things that surprise us that we didn't necessarily predict some people mean like things that happen suddenly I guess what I kind of think is like it seems like there is some process where and I always think back to the the gracking paper and the Neil Nanda exploration of that which I'm sure you guys are at least you know somewhat familiar with where there is a phase change from initial memorization to a circuit which you know what's so amazing about their work is they actually show this circuit in very like concrete terms and it's like and this is the circuit that does this algorithm that allows it to generalize you know to the full set you know from just the sample data that it was originally trained on you know I don't know that we have any circuits here that we could kind of elucidate but does that feel right to you do you feel like there's this sort of process of like memorization sort of being gradually replaced by like concrete circuits that solve particular micro skill challenges is that your model of what's going on under the hood here okay let's take a step back okay like people talk about emergence I guess we we can both agree that you know this is a very this is not a well-defined notion at all you know it's not like it's also it's not like you see a sudden phase transition from the model not being able to do something to you just slightly increase the size and then suddenly it has this it's really good at some capability it didn't have it all before rather you know it's that vaguely defined term saying that there are some qualitative capabilities that you know the model at certain sizes uh has whether at smaller sizes there's almost no trace of these capabilities at all like you know uh gpt2 did not know how to summarize text and suddenly it gpt3 you have this summarization capability but the the notion is uh not well defined at all on on the other hand we do see that as we increase the size of the model suddenly you have some certain capabilities you didn't have before and you know in a sense I think a good analogy is if you compare dogs monkeys and humans right you increase increase the size of the brain suddenly you know humans can do math where whereas monkeys cannot so you know it's it's an ability that emerged when you made the neural network larger not that I imply trying to imply at all that the same mechanism explains you know both things we have no idea about that but you know what we say here is one of the most important abilities for generative models is to be able to speak coherent English right this is an ability that we see emerge also in a much larger networks trained on you know those large language corpora and I think tiny stories basically gives you a much smaller data set where you can observe this emergence you know with much smaller scales of models in in the sense that you know if the model is uh one million parameters then it can hardly generate coherence stories and and if you go to 10 million then almost all stories will be coherent same with reasoning like uh you know one to five million parameter model all of our reasoning prompts fail whereas for 30 million almost all of them succeed now as you and you said all of this basically has to do with keeping coherence with the text so so you have the emergency of the ability to generate the next word in a coherent way on different levels of difficulty so the easiest level of difficulty we could think of is just when you have something that follows from some easy grammatical rules and then you know you have you you can think of like different level of levels of difficulty sometimes you need to know a certain fact in order to be able to complete the next word right Jack was hungry so he went looking for some you you to to complete this you have to know that you know to to satisfy the desire for about like the to satisfy hunger you need some food right so sometimes you need to know a fact so there there's all those core capabilities that are necessary in order to keep consistency along the text and each one of them we can actually witness its emergence as we increase the size of the model so what then is the theory of what is happening I have a theory but I want to hear yours on so you you give that example a minute ago right uh girl wants you through a cat or dog mom says no dog so it's gonna be a cat but gpt2 you know much bigger model whatever that's like 30 times bigger than your biggest in this uh in this research right like you're you kind of max out around 30 million parameters GPT two is like what 1.5 or something so it's a lot bigger 50 times bigger Maybe it still says dog which is you know every we can all tell that's kind of obviously wrong you've got these much smaller models that can get that right you know in some way there's this emergent you know observed phenomenon that it is able to get that you know exclusion concept what do you think is happening there is is that a micro skill that is like you know this sort of exclusion you know that that's like a little piece of reasoning that is grocked by the small model but not by the big model I mean it seems like there's you could be a really good stochastic parrot but it feels like there's something there that has like truly kind of settled in to the structure of the network and maybe that didn't happen with gpt2 because the data was just too noisy and it's kind of all over the place and so it it wasn't able to learn those same things how am I doing here does that does that resonate as likely uh true or even plausible yeah I think that's a valid conjecture uh what we think is for gbd2 because it twinned on Vibe data or just think of like Wikipedia where you'll try to minimize the language model laws the consistency is like the least concern it's more about just getting the knowledge correct I mean you're talking about some object or some person and you want to know his birthday or you want to know some specific aspect of that person I mean this has nothing to do with natural language it's more about just a sheer amount of knowledge that we encounter in the web data or something like Wikipedia so the model I mean because a model I mean I think both gpd2 and our model they are not large enough to minimize the loss to full extent like gpd4 so they have to select some part of the laws that they focus on and if the data has too many knowledges or too many other nuances then the model may just focus on other aspects compared to consistency well here our tiny story data really pinned down like because the language is simple and they I mean the vocabularies are simple the really difficult part is consistency and that is where the model Focus to minimize the training loss and that's why I think our model although is much smaller it gets better consistency compared to the larger ones yeah you could you can think about it as you know when you train a model on the entire pile on on you know those large language corpora they have much more incentive to learn the clothes preferred clothing styles of celebrities much before they learn how to complete this you know sentence with the the dog and the cat you know most of they definitely learned that you know Joe Biden is the president of the United States way way before they learn how to reason this is a conjecture of course I I didn't uh actually test it but I'm pretty sure that's what happens you know you just overload the model with so many facts that appear in so many places only once in many words that you generate this capability of reasoning becomes relevant so I think it's a really good exercises just to open a random Wikipedia article I I it's it's one of my favorite activities since I started thinking about you know language language models and just go over the Wikipedia article word by word and just think for every word what core abilities do you need to you use in order to guess you know what the next word is and what you'll see is I think definitely for for uh a a random Wikipedia article or a random example in in the pile in you know some some web training set you will see that only once in every 20 to 30 words to predict the next word you need to use some reasoning capabilities for one in every three or four words you need grammar for most words you can just kind of guess the next word using if if I just tell you what the net what were the nouns and verbs that appeared in the previous sentences are without telling you anything about the context or you know things which are farther away in the text you will still be able to guess them in order so so the reasoning is is kind of a rare capability that you need it only becomes relevant you know uh pretty rarely and and therefore you will you will that like the capacity of the model will be dedicated to other things much much before you will get any reason capabilities that's also why we think of why we always kind of see there's an emergency Behavior It's really because the reasoning or those like very rare consistency events they actually happen very rarely so even so only if you minimize the loss to a certain extent you'll start to learn those rare events and your model feels like different for example just a simple example say about feels hungry but he doesn't like sweet food so he went to eat I mean if the model says when to eat some candy then we think that this model knows nothing about what he's talking about right but this is only just one word difference out of the 10 to 20 words and only if the model get to that extent it starts to learn this consistency and we feel like the models start to know what it's talking about but in terms of laws it's probably we only see like less than 10 percent difference I mean if we turn into image classification when I tell you my model got like 90 correct and you have a model that gets 95 I wouldn't consider this as an emergent availability that you get five more percent but for language model this five more percent may actually be the emergent behavior and especially for Mass like if if I really want to solve a mass problem there is only like the connection between the two sentences where I need to make sure that my proofs are extremely coherent well most of the part I'm just completing some formulas write down the results but is only this very tiny amount that defines that gives the model the emergence capability so I think the two things are connected that the models and learning reasoning is like a hard task that it only gets at the very end and also we see the emergence capability for if we grow the size of the model or if we grow the number of training data yeah maybe let me just expand on your example you know so okay we have this sentence Bob could have either candy or pizza Bob doesn't like sweet food so Bob got some and autocomplete okay now the model usually if the previous sentence says something about sweet food if we don't read the entire sentence the most likely completion is actually candy not pizza you have to be read it in a pretty nuanced way in order to realize that Bob actually doesn't like sweet food right sweet food and candy come together so many so many times in the training data right so the neural network has to in quotation marks choose between using garbage capacity in order to be able to resolve these Nuance or to use its capacity in order to know that Joe Biden is the president of the United States right you can't have everything together the model has a finite size and there is some theoretical limit to you know the the amount of things the model can learn the model will definitely prefer to learn that Joe Biden is President and you know many many other facts because they re because they are relevant much much more frequently the curriculum development is space is likely to be a huge unlock over the not too distant future right I mean you're kind of seems like you're probably just kind of scratching the surface here because we've got web scale data that you know is not built for this purpose obviously you know where you're saying reasoning doesn't even it's not even required that often and so no wonder it kind of emerges late in the game um that you know maybe pre-training on code you know is kind of changing that in some interesting ways but man intentional design around what a gradual you know gradually like upstepping um curriculum might look like especially with the ability to you know to create the training data synthetically to you know kind of really kind of isolate and bring those you know key skills forward it seems like you could rebalance training data and probably shrink it like a ton and get to a lot of the you know just by kind of Shifting the balance right to get the these kind of emergent things to be more important relative to just kind of you know mind-numbing repetition of you know who's the president or whatever it sounds like I mean you're nodding that it sounds like you that aligns with your expectations too it's important when we design like a new version of the data for example to extend the degree of hours into maybe elementary school or middle school I think it's very important to balance the amount of knowledge in the data versus the kind of or ability that we want the data set to teach the model for example like if you go to elementary school there's ability to do simple math or mathematical reasoning or some Physics reasoning or do comparison of historical events I mean those will take some capacity of the model maybe it will actually take a very big portion and the remaining ones I mean if your data set has too many knowledges then the model may just prefer to use its capability or use its capacity to actually memorize the knowledge instead of really learning those abilities so we want to balance that there's some amount of knowledge that the model must have in order to base it to basic stuff like mass or basic physics reasoning but more importantly there's there should be a lot of data that only M4 size on the ability side like there's no new knowledge it's just a bunch of match between examples or a bunch of simple comparison of some basic historical events or some simple physical rules and their explanation or different varieties so that the model can actually focus on the ability part instead of just being screwed by the vast amount of knowledge so I think the read right now the web data They Don't Really balance between knowledge and ability training so that's why Trinity on them is not good for a small model because they need to allocate their capability or capacity to just memorization so I think that's basically the criteria for maybe design like better version of synthetic data I guess we can just kind of at maybe a relevant Notions here would be the breadth and the depth of the the data set or the capabilities of the model so you know the the the entire web is very very broad it's also so by breads I mean you know it has a lot of facts the vocabulary is very large you need to have a lot of knowledge to capture the data set and by depth what I mean is you know it has first and second and third order logic that you can infer from learning this data set and this is not well established I mean there is no I I don't think there is any research that that that really establishes that there is a trade-off between the two but you know it's it's very very reasonable to assume that there should be a trade-off between breadth and depth uh when you train the model yeah I think there's some optimal ratio because without the knowledge you can't really do reasoning like you have to have some basic knowledge let's say candy sweet in order to do recently and when you go to elementary school I mean you have to know some basic events like some basic rules for maths like one plus one is equal to two in order to do mathematical reasoning so there's a balance you cannot have no knowledge but you cannot have all the knowledge so maybe there's some optimal ratio between like price and depths my head keeps coming back to the same kind of thing where yeah we could should see so much gain from kind of rebalancing the data set and maybe even you know starting with some more abstract things like I could see sort of a or b not a therefore you know and then do that with like just you know all the letters and then start to you know introduce kind of these associations and layer that on like it it sure seems like there is a lot of of opportunity there yeah you're just mentioning the extreme case of only teaching reasoning because they are everything is just symbols there's no knowledge and it's all about just reasoning and maybe it's good to combine this with just something that is pure knowledge and maybe we can guide something good by just adjusting the ritual yeah maybe it's it's not clear at this point if you know a human being you can when you teach a human being you can kind of almost separate the knowledge and the reasoning right you can just say fact a blah blah blah fact B blah blah blah now here's how to reason I mean it has to be a pretty smart human being but but in general you know we are able to take those two things and then combine them so that in the end we will have those queries and capabilities that we learned using you know exercises that only involve if not a then if a a then b means not B then not a right when we learned when we study to the SATs we have you know those rules and and uh and separately we have the knowledge of facts and we're able to combine them and I think it's an important question whether it's even feasible in in language models to just take those things separate them into two different modules and you know will the the model actually be able to combine those abilities my conjecture by the way is no like as long as as you don't kind of combine them enough in the data set the model is not going to be able to infer uh the way that a human consciously infers the connection but if we could do that then this would be of course like a very very powerful technique to drain models uh I just did another interview with a couple guys from Mosaic ML and they talked about training at times on like massive client data sets even when they do that they typically still mix in you know kind of the general pre-training you know file or whatever because otherwise they see catastrophic forgetting so they have to you know kind of keep some mix at all kind of stages of training to avoid that so I think that would be very consistent I think with what you said like you you probably can't do it in strict phases um there's got to be some sort of like mixing strategy throughout the process right and that's going to be very non-trivial to do um it it might be feasible but it's definitely not and is it it's not gonna happen on its own just because you know what the model cares about is just being able to efficiently OCTA complete samples in the data set if the data set is either knowledge or reasoning then it has zero incentive to combine and even if you give it a few examples where this is actually combined no one is you know no one assures you that it'll actually be able to take those two modalities and really use them for the combination so I I feel like this is a science that's really I mean we're only beginning to understand how these things work hopefully uh we'll figure out a way how to actually do it but I'm not sure you you and you maybe I I don't know any concrete evidence of an example where where this seems to work uh right now we like the kind of concrete evidence that curriculum learning really works but we believe this should be helpful but I think overall as soon as I say the model has no incentive to connect like the different phrases where you train things they just I mean when you start a new phase it's just readily minimize everything that related to this phase and you can't just forget everything that he has learned so it's definitely a very non-trivial task to get this curricular learning work first you just said a number of kind of interesting empirical observations and you can you know address this kind of however you want but you noted that grammar emerges before consistency and creativity and consistency is you know kind of related there to reasoning in in our you know where you're telling earlier is that the same thing that I've observed in my kids I'm you know my memory maybe I'm sleep deprived but I feel like grammar maybe came last for them like they definitely have a you know a certain consistency like they want what they want you know they like there's um if they want ice cream like you know ice cream they keep the next token is ice cream and it's they're pretty consistent on that and you know creativity I don't know that's a little trickier but does this feel like it echoes human development in your mind like to me I'm not sure if it feels that way I love this example um so yeah I mean I I think the the learning process for children is very very different that way right children don't get like their incentive is not to say that correct next word children they want ice cream they're incentive the outcome needs to be I get ice cream so if they just say ice cream ice cream ice cream many many times maybe they don't get the best grade for creativity but they will likely get the ice cream right depends on how basically on the self-discipline of parents and whether they are limited in in this uh household yeah I have a very close experience with with this exact scenario scenario but you you know in it like more seriously like you know when when when children produce language I I mean I'm only based in Grace on my observation they they have constant contact with the physical world they know which entities are involved in the current conversation right we are talking about a book that I just read so you know it's it's very unlikely that the next sentence will be about like a car because you know they have their in their heads this entity like no we're not talking about a car we're talking about this book right whereas a language model you know if it if it just makes one mistake in in one word saying car instead of of book the loss that it incurs is actually not as big as the loss it would incur for in incorrect correct grammar which becomes which is relevant in in almost every word it produces and not only that you know the language model it's actually much cheaper it's much easier to have consistent grammar you only have to kind of be consistent inside the same sentence for the grammar to be correct you don't even have to look you know past five or six words back whereas for children you know for every like children have a few entities which are kind of active in their working memory so this entity could be let's go it could be I want to go to the park and have some ice cream so the two anti-relevant entities are Park and ice cream you're not gonna replace those two entities in the middle of the conversation that's you know what the child cares about at this point so it's really like uh the the incentive system is different but also the whole kind of cognitive perception around that that that's basically underlying the conversation is is pretty different um yeah I don't know if it makes sense like that maybe a a like a good way to think about it for a language model definitely in its first phase of training it doesn't need to look way back like three four five sentences before what it produces so so I think a good example is to look at one of uh the gpt2xl generations that we have in the paper so we have a story about like a snowman two kids or building a snowman and a dog comes and destroys the Snowman and then like you order complete to see what happens and maybe maybe I'll actually read it because I think it emphasize I think it emphasizes pretty well uh what I'm uh trying to say here so there's those two kids Aaron Ben they built a snowman a dog comes and uh destroys the Snowman and the and now uh gpt2xl's completion looks something like that uh Sarah and Ben cry and look at the snowman in the snow Ben asks to do something Sarah tells him to go away her hand tightens on the little red button the Snowman is covered in fur it is red it is real hair it is a real snowman it looks just like the ones they used to make Ben has a big smile on his face he looks at the face he has made he walks over and kisses the eyes he's happy to be alive he has done what he was going to do he has saved the world so okay what am I actually trying to say here except that you know gpt2 XL has a good sense of humor what I'm trying to say is if almost every sentence here the all you need to do in order or GPT to excel does in order to generate it is look at the previous sentence and maybe it knows that there is one entity which is snowman which is relevant here other than that there is no consistency at all in the completion that goes beyond you know looking at two consecutive sentences language models are if they're too small or in the beginning of training what they do is they don't they they don't actually have enough incentive to know the whole context of what's going on because more for to complete most words correctly you just need to have you know the context of the current sentence and maybe the one before and maybe also one or two important entities whereas for humans this is completely different we have agency we know what we want when we form the next sentence and yeah we care much less about grammar than about the ice cream yeah I would make analogy like human child is learning with our Rio HIV algorithm just doing reinforcement learning with the parents feedback and obviously the parents are very robust to grammar mistakes so they don't really want to optimize that in order to maximize their report so they'll probably care more about consistency the topics and they want to get the topic correct so they can get the reward well for language model is just next word prediction and that grammar is going to be penalized much more severely compared to the global consistency okay yeah I like that I'm uh I'm always a little wary of analogies and I'm I always want to come back and think what is sneaking into that analogy that I don't want to uh allow so I'll bookmark that one but I I you know certainly the surface level intuition there uh makes a lot of sense and it's a it's a good uh it's a very clippable uh analogy as well on the depth versus hidden Dimension size of the which I often just call width the debt you know number of layers versus uh width of a layer you note some interesting trade-offs there and I didn't really have an intuition necessarily for why that would be but you know for the paper you report that the fewer layer models will do better on grammar compared to consistency slash reasoning and from that you know it seems that more layers are important for this kind of reasoning consistency how should I think about that like well is there a is there a story that crystallizes why that would be none of this is well established it's all our conjecture is that like need to be studied much more but I think a good way to look at it is so so depth gives you tells you about how many times information can percolate between the tokens so every time you have a global attention layer like a Transformer attention certain tokens the information inside certain tokens Can percolate into other tokens so for example if you have some instruction create a story with this and these words and I want a bad ending as well and then I type in the beginning of the story it and it's it needs to autocomplete it these instructions can in every attention layer they only have one chance of percolating into the tokens of the story and sometimes these instructions by themselves are nuanced so maybe there is an instruction saying the story has a bad ending it's not enough to know in order to complete the next word to know the current sentence and that the story needs to have a bad ending you also need to have like the context like the The Wider context of what happens in the story so in order to fulfill these kind of instructions you have to basically have the information percolate several times between the tokens and and that's that's also the case for reasoning if you have first order logic so if we think about this example of cat and dog like uh Alice wanted a cat or a dog or a mother din letter have a dog so she got a cat how many times does the information have to percolate between the tokens here so first you want to understand that there was a cat and a dog involved that's one layer of global attention then you want to know that she couldn't have a dog and you really want to you and and so you have this set cat and dog and you want to do cat plus dog minus dog equals to cat so so after you know that you have either cat or dog you need the fact that you know she she couldn't have a dog to percolate into my token in order to know that you know the only available option is cat so there are actually two layers of percolation here you need to know that it's not a dog so the token not has to go inside the token Doug to know that you know dog is not allowed and then those two tokens not plus dog have to percolate into the generation in order to know that okay I had cat plus dog but I have to do now minus dog to have a cat so it it's just several so if you think about it as coding you have several you know conditionals you have to do and several times that you need to have pointers to information that appears in other places in the text okay this is very very I'm putting it in a very vague and non-formal way but I think you know it's it's a it's a good initial intuition probably to what happens whereas when we talk about facts so if we if you have a completion you know um I don't know the the okay like if if I have a completion in a language model that's like the capital of France is then all I have to do is to is I I just need to have one kind of lookup table with you know all the countries and their capitals I don't I I don't need to have many layers of global attention it's enough to just you know take the two tokens capital and France put them together and then just have one lookup table saying that Capital plus friends equals to Paris now here the dimension seems to play a much more important role because the bigger the dimension of the space is the more entities I can kind of squeeze into this Vector space and you know the also the more neurons they have inside my lookup table to tell me you know that to to have this list of all those possible facts so it's another way to say that that within a single attention block the attention relationships are not immediately transitive and so they need multiple iterations of attention in order to create that transitivity like if the the current token is looking back at a certain token but then that token is looking back at a previous token like we need two rounds of this to move the two hops yeah exactly you just you you first need the not to go into dog to know that it's not dog and then the not dug needs to go together into this you know set that has both cat and dog in it so that's already two leaps does that also suggest then that sort of for as many kind of logical leaps as you might need you need like maybe that many layers like you can't you're sort of bounded by if you have if you have two layers you can make maybe two logical leaps is that a general heuristic that seems sensible I think there's a tabs and whisk feed off for example you can simulate two layer leaves just using one layer but you'll have to enumerate like the all the two possible combinations which makes your size go from like say n to n Square so if you want to be kind of the most size efficient then you will definitely have to go as deep as a number of like logical loops but if you are not that deep then you can actually use a wider Network to concatenate the two steps into one and just make the intermediate layer like much bigger so so here also I mean uh the your question kind of uh bursts into an Open Door in the sense that uh the paper we wrote at Microsoft research about this synthetic reasoning task we call Lego that's exactly a task where you have multiple leaps of reasoning and we see a very direct connection between the number of layers that you need and the number of reasoning steps required to complete the task but maybe somewhat surprisingly we see that the model finds like very interesting and sophisticated ways this is actually not in the paper this is kind of like a follow-up work it finds the like very sophisticated ways to do multiple leaps of reasoning within a single layer so so definitely more layers help but it's not like it's a strict upper Bound for the number of leaps you can you can do I can only wish it could be quite that simple but that's really really interesting information I'm learning a lot from this the interpretability part of this paper is also really interesting uh you kind of break it down into the attention uh portion and then obviously the you know the MLP uh you know neurons portion and a couple things jumped out at me one was in the attention portion you seem to observe that there's kind of two sorts of attention heads one really just focuses on the distance relationship between the tokens and then the other is more semantic and the distance one I was like holy moly does that look like The Alibi scheme that has recently come to popularity with uh you know these super long context windows so I don't know if you guys have had a chance to study that but quite an uncanny resemblance right I mean you're showing all these attention heads where it's like this one you know it's just a very tight attention range and then you know they gr there's different kind of lengths and that's almost exactly what they cook up as the you know kind of substitute for positional embeddings in The Alibi uh research as at least as far as I understand it do you see that same parallel yeah I think they are definitely doing the same same thing which could explain like why Alibi is very I mean it's very helpful because the positionality uh positional inviting Alibi is already initialized to do this multi-scale decent space attention well for absolute positional encoding like what we use here the model has to learn to discover this optimal kind of positional base attention so just hardcode that positional base or decent space attention I think is a really good choice based on our observation and also we can talk like the shorter ones are really responsible to just learn the grammars and the longer ones they may just make sure that your content are consistent globally or maybe just to grab the associated words for example you have a Alice in one sentence and then you have Alice like five sentence ago you want to make sure that these two words you have a chance to put them together so so yeah let me just point out that you know clearly to complete the next word you you you you need two things usually you you want to know what proximate words are what what are the most kind of recent words you saw and you want to know what are the most important entities in the story are so these are going to be you know words with the relevant semantic meaning uh to what you want to complete now if I remember correctly what happens in Alibi is you there is kind of a little bit of a mix of both of them so you just take every attention head and you make it decay uh for every attention head you just prescribe some scale and the the the strength of attention decays with the distance uh inside the text is that is that correct I if if I'm not mistaken that's that's what happens there and we actually see I mean one surprising aspect is do we actually see a dichotomy there are heads that only care about distance and other heads that only care about semantics and there is hardly a mix between the two but we have to say that this is only for one attention block uh we haven't really checked for multiple attention layers what the Transformer will do together but just for if you just try a network with one attention block it seems that the network learns to separate the decent space attention versus a semantic based attention where some heads are just looking at tokens based on its distance some other Hazard just to get tokens based on the semantic similarity so is there anything else that you can I mean that's pretty profound in and of itself but that dichotomy emerges um because you didn't initialize it I mean now by there they've engineered it that way through you know some probably trial and error and heuristics and guesses and whatever but this is totally just happening on its own is there anything else we can say about what you see in the semantic one when I looked at those you know I didn't uh no light bulbs went off in my head to kind of interpret those um visualizations of the semantic blocks but anything you would highlight from studying those I think the most interesting one we see is just a symmetrical attention to the main character names so for example like there's some head uh where every token just tend to like uh the Tom and Lucy which are the two main characters I think this is pretty important I mean it's Justified to identify like what are the persons that are involved in the story so the next time when you generate like a new thing it's not going to say like Bob it's going to say something consistent so I think the Cinematic has at least what we see in the one Transformer block is more about this type of attention where it identifies like what are the main objects in the sentence and just try to make sure that most of the tokens or the relevant tokens attend to those objects like when you have the or ah you will attend to like banana so you'll know that you want to complete the next word as a banana instead of something completely made up so I think those thematical attention has they are really useful to just be consistent English inside the Transformers yeah but let me add that you know it's very natural to if to to complete the next word you want to know what are the relevant characters what are the relevant entities in the story but no one expects a priori that you will have such clean attention heads and attention at that exactly attends to the character and a different attention head that exactly you know attends to you know the objects in the example we gave it's like a banana and park a priori we might expect that it will all be just a big mess right every attention adds attends to a little bit of everything and you know why would it be interpretable at all but it's quite surprising with that when the model is small enough it seems that we can actually give meaning to both attention heads and neurons so is that does that kind of fall apart if we add a second layer like does then it just become more messy again or what is that you know as you start to stack layers what does that start to look like yeah I think when your Transformers are getting higher or getting deeper or getting larger it definitely becomes more messy because the Transformer can simulate like I mean if the Transformer is small it really needs to learn those separate modules in order to minimize the loss but if the Transformer has a larger degree of Freedom it has a luxury to for example use five attention Heights to simulate one or use three layers to do what could be done in one layer it has no incentive to be like ice precise or ice kind of conservative as the smaller ones so it's actually less interpretable and we also observe that in when we try to interpret the neurons as well perfect bridge then to talk a little bit about the neurons maybe just gives a little bit of understanding of like the technique that you used to you know I'll try to summarize it real quick and you tell me where I'm wrong you run a ton of uh stories through and you look for what uh tokens specifically are maximizing the activation of a certain neuron and then you can kind of print out like here are the Snippets and the individual tokens that maximize the the activation for this particular neuron and then holy moly like it really looks like there's a pretty coherent concept you know as you just kind of scan down that list of things that you know corresponded to high activation On Any Given neuron that's pretty accurate so you have this these middle layers in the MLP which we can think about as neurons those are really the coordinates that can either be activated or not you have the only basically non-linearity uh in there and yeah like again just like that attention heads up priori it's not clear at all that they would have any meaning those are just you know different coordinates of a certain Vector space like no one promises you that you know the neural network is going to use one particular coordinate for one particular kind of task and um indeed so so maybe let me mention that this basically uh follows an idea suggested in a 2015 paper by Lee at Al called visualizing and understanding neural models in NLP which is just like their idea is just to look at the tokens which induce the highest activations for every neuron in a certain text and try to see whether you know those tokens have a common uh role and when we look at larger models like gpt2xl what we and we try to look at those tokens at least the two of us could not find any common meaning you know the same neuron is activated sometimes on nouns sometimes on verbs sometimes on like there's there's just no clear pattern whatsoever whereas when we take a small model for example there is one particular neuron that seems to always be activated when the main character of the story is introduced um and you know that kind of makes a lot of sense if you think about what the neural networks Network needs to do I guess like if there was a programmer writing code that tries to autocomplete you know stories there would probably be a function that tries to locate the name of the main character because it it's useful in in many many places when you ought to complete in fact you know whenever you know that the name of some character should appear it's a pretty good guess to to think that you know this is gonna be the main character right so you have a neuron exactly doing that and we haven't checked enough to be sure but there's probably an attention head that that then attends to you know what this neuron outputs whenever you know that the name of some character should appear and when you connect those two together what you will get is you know this mechanism that is able to copy the main character's name to different places along the generation so you know this is a very basic uh mechanism that you can actually observe inside the neural network and this doesn't happen in in bigger models at least not in a way that is you know so easy to trace I guess my the simple version of it was I thought maybe it was just like maybe there are sort of a bunch of Concepts that are easy to identify where you can just see like okay at a glance I know what that is and maybe there's just only so many you know like maybe we only have so you're when you have 30 million neurons you know or 30 million parameters you have however many neurons you know maybe that's kind of enough and you can kind of capture those and then you go 50x and it's like well if you start fishing at random you know points in the network there maybe you just miss a lot of those you know they may exist but they're just kind of hard to spot because maybe they're sort of sparse if you will and then the things that are in between I mean I'm really getting out on a limb here but I was kind of thinking maybe those are sort of analogous a little bit to like the subconscious processing that goes on in our brains where I kind of know on some level that like processing is happening even for many Concepts you know that I don't have like a clear label for it's just you know there's some sort of churn happening in the brain but then like only a certain you know small set of that kind of Rises up to this level of like you know what I'm I've called it a conscious concept that I can sort of say like I have a label for that and it's like a tidy enough thing so I guess the two ideas there are maybe they're just a lot more easy to find in the small Network because you know you have to have them and you know they get packed densely versus a big Network you know maybe they just are packed more Loosely and then those other networks or those other you know neurons maybe are just kind of analogous to some stuff that we don't understand very well in our own cognition yeah it's definitely possible I think that's a one page of small language models say maybe more interpretable compared to larger ones because a smaller models can only do basic stuff and only the basic stuffs are probably interpretable like a very complicated stuff for example called gbd4 like write a code that are 1000 lines I mean those things it's almost impossible to interpret but how good a small language model keeps a main character consistent in those basic questions we can probably understand and there are probably some neurons associated with that and in gpd4 like out of the I don't know maybe 10 000 neurons or even more neurons there may also be some neurons that are dedicated to keep the main character consistent but it's just so hard to find it because it may be in the like 25 layer neuron like a 99 000 like 700 whatever it's just so hard to locate well for smaller models because it's so small every neural must be doing some basic stuff because the complicated one as we said in the consistently hierarchy or in the Lost hierarchy they only consists of very tiny fraction of the lot so the main fraction of the laws that contribute to like basic consistency grammar and things those are probably the things that are learned by the neurons in the smaller models and they are more kind of basic and more interpretive so you know there is there is many ways for the neural network to solve a problem like given given a problem and an architecture of the neural network there are many different configurations of the weights that would solve the same problem some configurations might be more interpretable to a human and some are just you know one big mess like every new neuron is little is doing a little something of every possible task and you know they're combined in very very complicated ways and then the the network has no incentive in the last function not to be a one big mess like most solutions to the same problem are one big mess you know this is where the entropy is located right and when the model is small it kind of has no choice the the neurons have no choice but to align with meaningful tasks because you know the neurons are where you have the non-linearities and you just don't have enough of them for the One Big Mess type of solution somehow the most efficient solution is the one that is not completely messy and if you have a large Network then you know it'll just kind of find a way to do it that does not align with the coordinate structure of the of the neurons whereas when the model is small you just have no choice so in interpretability appears as a as kind of a side effect so anything else that we didn't cover yeah so one thing maybe go into the initial motivation in creating the data set which is you know to have like some uh basically said like a small data set which is a testing ground for ideas in llms like an open question here is do we even have a reason to expect that behaviors we witness in this compact setting will translate to llms right we we don't know the answer to that like suppose we find an architecture that works much better for the tiny stories data set do we actually have a reason to expect that you know this architecture will also be better for llms right so I'm just saying this as a question I think it's one of the most relevant questions that you know stem from from this paper and and you know it it kind of connects to a more general question which is you know there is all those papers like the Google chinchilla paper and the open AI scaling gloss paper which try to suggest that there might be Universal phenomena in llms there is some for example a trade-off between width and depth that is perhaps I'm not they don't suggest it explicitly but there it but you know a natural question that arises is this Universal in the sense that it does not depend on you know the exact uh mix you take in the data set and the exact architecture and the exact range of sizes you take so the question here is are there Universal phenomena which will be common to the tiny stories data set and to llms being trained on these large corpora and yeah I mean we we maybe let me just say we we have just a few indications of some sorts of universality but at this point it's completely open and you know we we really hope for the sake of you know saving energy and also having you know just opening the door for PhD students to actually do llm research we hope there is some universality going on so that you know you could gain insights not necessarily on Tiny stories but on any small data set which would actually be of relevance to llms yeah our future works is mainly just plenty to extend the capability of tiny story and if you can't create a story that capture like Elementary School knowledges I think this is already a pretty I mean it's already a really good data set we play a language model for example less than one billion maybe 300 million parameter and it's just cool that all everything for elementary school or maybe even third grade of elementary school I think that's already a very good model I think people will love to interact with it he knows how to talk in those like the basic knowledges and maybe it's for I mean the data set would be diverse enough to capture everything in real language capture every aspect of real language but just like a like a down scale level and once we have that data set I think it really opens the door for everyone to do natural land language research not the ones that has like like 100 a 100 in their hands but the ones with just a laptop GPU they can train the model in like one or two days and they can gain some interesting observation I think what we witness in llms is kind of a mathematical Miracle going on and what do I mean by that you know you take this algorithm which is pretty simple it's you know it's gradient descent plus plus I don't want to belittle you know all the basically really smart technical contributions that are inside that algorithm but all in all you know it's it's basically gradient descent with an architecture that's very clever but still it's quite simple and the Miracle is you take you know all this this huge training uh Corpus you fit it to the algorithm and you don't just get a network that is memorized some text you get a network that can actually genuinely you know create like synthesize new contact content show signs of reasoning understanding and so on and with we think tiny stories is just kind of a compact example where you observe the same type of Miracle of course it's not as it's not nearly as exciting of what happens in llms but already there at this size you you see that there is you know some very interesting generalization and emergence going on and you know even if you know it doesn't give us a lot of insights about uh large language models this is still kind of a nice playground to try to develop maybe the mathematical foundations necessary to understand why neural networks are able to generalize so well so maybe then just one final question you know and I'll encourage people to uh get in there and try it out what other interpretability type work have you guys seen that has inspired you that you would recommend that folks in the audience uh go take a look at as well yeah I think one of the works that inspired our research I mean it's a work from our group yeah previously which is called Lego it's a synthetic reasoning task which tries to understand what a model is I mean try to understand what the attention mechanism of the model is and we identify like several type of different attention has in the in the Transformer some of them are as I said they are just looking at the tokens that are exactly up here before like Alice appeared before and Alice is associated with Alice and there are some other more in I mean more advanced magnetisms such as doing reduction or some other stuff so I think this work is I mean pretty inspiring and it tells us like Transformer is at least doing something that is reasonable instead of a pure Mass so that's why we also have the interpretability section we want to look at the attention has and we do see some very good behavior that corresponding to some aspects of the natural language maybe there is one work I I want to mention also there is a paper called uh Transformer feed forward layers or key value memories that's another paper I like um that's a paper that tries to interpret what neurons are doing again I think basically bird size model Transformers and I mean they they they are basically able to show that at least some of the neurons have meaningful uh roles in general the theory behind you know the interpretability of neural networks is is it is at its very very beginning right now so there are plenty of very very clever works but it seems just very difficult so in spite of you know really nice works in the literature I think we are still you know we light years away of being able to actually understand what's going on inside the like the model and you know a priority there's no reason to assume that we'll ever be able to really understand right I mean we have a very limited understanding of how the human brain works like it's not like we can point to a neuron and say you know this neuron has this and this role and you know that thought process and there's just no reason that we'll be able to ever do it in in neural networks and like there is also no reason to assume that the solution that the neural network finds that that solution that gradient descent finds to the problem is not a very very messy and not interpretable solution so we'll probably be able to find to come up with some you know kind of basic or or small examples where which are partially inter interpretable and we might have some insights about big networks but yeah I I personally am not very optimistic about you know being able to interpret what's happening inside those models to uh satisfactory extent that you know might lead us to being able to like control them and you know manipulate them I'll make sure they have better alignment and so on and so forth yeah large-scale models interpreted me if we may need to take a different approach so I think it's impossible to look inside the neural networks and just pin down like the attention is doing something or the neural is doing something but maybe we have to do an approach that more like our Sparks of ATI paper where we just talk to the model and we kind of try to it's more about interpreting other humans intention when we talk to them and we threw a sequence of conversations maybe we can understand what the model like to do and what the model doesn't like to do or what the model is good at and or was the typical cases of the model's failure and it's more towards like psychology study but really for robotics that maybe we need to take that approach for interpretability you know Humanity has taken advantage of horseback riding for quite a long time now we have no idea what every neuron inside the horse's brain is doing we can't really interpret how you know we we give some I don't know how to call it like command to the horse like physical uh cue and the horse or base and it's very very very useful and we can actually rely on like you know horseback riding is very reliable there are very few cases where you know the horse is acted unexpectedly in a way that you know caused accidents like humanity is profited from that vastly uh maybe you know living animal rights aside here and you know it works perfectly even without interpretability and you know we we just figured uh figured out ways to align the behavior of the horse with our needs by you know taming the horse we can tame it without understanding the exact process right that's going on here on there and you know this is a big success and I think you know we I think it's just a good analogy right I mean it's it's kind of horseback riding for the brain those uh llms right they are just they they give us suddenly we can we can go much faster to much longer distances even if we don't exactly understand you know what the horse is doing like definitely you know the Mongols didn't understand much about the biology of the horse they could still use the horse like in a very reliable way so you know I I just think this is you if even though I'm pessimistic about uh actually understanding the the inner workings of the neural network I'm very optimistic about the usefulness and the fact that you know we will be able to align it efficiently Ronan Eldon and you want Julie thank you for being part of the cognitive Revolution thank you very much yeah thank you for the invitation it's really my great pleasure amniki uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button I believe in omniki so much that I invested in it and I recommend you use it too use Cog rev to get a 10 discount foreign
Info
Channel: Cognitive Revolution "How AI Changes Everything"
Views: 4,908
Rating: undefined out of 5
Keywords:
Id: mv3SIgDP_y4
Channel Id: undefined
Length: 120min 19sec (7219 seconds)
Published: Tue Jun 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.