Emergency Pod: Mamba, Memory, and the SSM Moment

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my sense is that neither the human brain nor the Transformer are the end of History the purpose of this episode today is to really sound an alarm and say that I think we now have that new architecture we're going to see more effective agents more compelling long-term assistants more compelling long-term AI friends and companions all of this if I had to guess I would say it probably happens several times as fast as the trans former era hello and welcome to the cognitive Revolution where we interview Visionary researchers entrepreneurs and Builders working on the frontier of artificial intelligence each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work life and Society in the coming years I'm Nathan lens joined by my co-host Eric torberg one of the most important big questions that I get asked is what are the the chances that somebody invents something better than the Transformer since 2017 with the introduction of the Transformer attention is all you need Transformers have dominated the field one of the big realizations that I had as a relatively recent enter into the field a couple years ago from my multimodal perspective at weark was oh my God it's a Transformer solving all of these different problems as as I'm watching art creation come online as I'm watching language models get dramatically better as I'm watching all sorts of Niche tasks Advanced rapidly image captioning image matching video captioning all these different things so many things everything everywhere all at once but all the Transformer so I think I've been pretty clear in recent months on a bunch of different episodes that this is hard to predict I sometimes call it the hundred trillion dollar question which is the size of the global economy today but my sense is that neither the human brain nor the Transformer are the end of History if you watch my AI Scouting Report I go through the history of expectations and predictions about how far AI might make it given raw compute guesses these decades back most notably in the 90s with Curts wild publishing his Singularity as near drawing these EXP initial curves and saying hey right around 2020 Is 2020 2025 that's when you're going to have enough compute to match the power of one human and then later you're matching all of humanity and becoming truly superhuman well the human level AI has shown up roughly on schedule it is as we've covered you know in many different ways human level but not humanik in many ways it is very alien and yet this power which now is in most post domains ahead of the average human and in many domains really closing in on Expert performance this has basically all been driven by the Transformer the attention mechanism is doing everything for us so I've been really interested in this question How likely is it that somebody will invent something better than the Transformer and I've been watching out for it and there have been a few plausible candidates this year which I've mentioned a number of times those in include most notably retnet which is a publication out of Microsoft collaborating with sing University in China we also had Lily U from meta on to talk about the megabyte architecture which was still fundamentally Transformer but a hierarchical approach that was different in in meaningful ways and of course there's been a ton of improvements but there have been a few candidates for things that hey this might start to look like something that could be even better than a Transformer the purpose of this episode today is to really sound an alarm and say that I think we now have that new architecture and it is called The Selective State space model AKA Mamba published in just the last couple of weeks I saw this paper pretty much as soon as it came out saw some of the claims and have gone really deep into not just trying to understand it that was about the first week since it came out but then also really begin to project into the future what is this going to mean I think if my AI scouting Paradigm is good for anything it should be good for identifying new research that really matters and at least giving a pretty good guess about what that information what that new research what that new capability is likely to unlock in Practical terms so that's what I'm going to try uh to take you through here today it is going to be hopefully accessible throughout I always try to use vocabulary words and then the most plain spoken terminology that I can so I'll try to make it as accessible as possible throughout it will also definitely be technical I will be getting into the weeds for sure deeply but I actually want to start off with something a little bit different today so if you'll indulge me I think it may prove instructive to begin with a little examination of our own human cognition so what comes to mind when I say the word rainbow there are a lot of different ways to think about the concept of a rainbow and I'm sure everybody had a slightly different experience in just those couple seconds that I let you think without further prompting some may have conjured up an image immediately obviously we have highly visual thinkers among us so you may have gone straight to a colorful Vivid image others may have gone to Smells I'm not a big smell person but I do find that the rainbow Associated always with after the rain there's a very distinct smell and a very consistent smell to that and somehow that is conjured for me very quickly you can also get a lot more conceptual think about it through the lens of for example science what is happening how is it that light interacting with water in the air is bent differently such that the colors normally perceived as just normal white light are split and we can perceive them individually you may think of just the history of science as well and what what an achievement that was but you also might think about going deeper into history thinking about how ancient humans understood the rainbow thinking about how in various Traditions it represents a promise from God a sense of protection a sense of rebirth renewal obviously associated with spring as well you might think about just the your colors if you're an artist or a creative or even just a kid who remembers that Roy G Biv pneumonic which was never that great of a pneumonic but that's what we were taught Roy G Biff the colors all in order how they blend together how there really is no exact distinction but it it is a spectrum a sort of continuous space of color profound in its own way you might even think of just contemporary identity politics you might think of the rainbow as a way to express self-love you might think of the rainbow as a way to make a statement about who should be included in what spaces in society and on what terms and you probably have personal memories of rainbows as well of course everybody's uh personal memories will be different so these are many different lenses on just a very simple concept and I want to use that experience and and all those different lenses to unpack some of the strengths and weaknesses of current AI systems and also look at how this new AI architecture The Selective State space model Mamba States space models more generally compare not only to the AI that we currently have but to us I've done the the cognitive tale of the tape in the past comparing humans to Transformers a lot of what we're going to do here is talk about the humans Transformers and now State space and who knows what comes next future that I think we can start to get a reasonably decent read on so let's talk about some of the functions that go into that cognition that we just experienced around the concept of the rainbow capabilities that we have and we take for granted include the ability to accept multimodal inputs the main ones for Rainbow in particular that we can take in we can take in the site we can actually see the thing itself and we could also see the text which represents the word rainbow we can hear the word right that's how you heard it from me just a moment ago you heard the word rainbow and even with touch you could if you're a braille reader you could have your fingertip interact with an encoding of the language of rainbow and have that rise up to some higher level understanding so this concept can be loaded in through all these different modalities now I think what's also really interesting is that at a high level once this concept is loaded into our brains and becomes the focus of our cognition it doesn't exist in the same way in in which it entered it exists in a higher dimensional more associative space all of the different Notions that come to mind that information is not encoded in the word rainbow itself right that is just seven letters in the English language it's just seven bytes of information and yet we can load up such Rich understandings that information is encoded in our brains it already exists there to be tapped into by this given stimulus and so we're not just working strictly with tokens we are working on some higher order Concepts which are not super obvious even today like what exactly is the nature of those Concepts and and how does our cognition work obviously that's not a fully solved problem but it is clear that the cycle of accepting a stimulus in working it up through layers of the neural architecture to the most abstract the most highlevel Concepts thinking about it and then often working back down toward okay now what do I want to express right that thinking of what happens next seems to happen at this high level beyond words but then to take action in the world or communicate with others we have to then translate that down into some lower level thing that could be motor commands but for the purposes of today let's think mostly about generating language right and we also generate from the studies I've seen just a couple tokens at a time we have this higher level understanding of what's going on but our actual words are really rolling off of our from our brains into our our mouths and into the sound with just a couple tokens of buffer you can think about you can just understand that yourself by just thinking like you just don't know what you're going to say in the future and where the exact words come from is not super clear or intuitive right there there is some not fully conscious mechanism by which these high level thoughts that we experience as conscious States get translated into words that can actually be articulated we also and this is super critical this is going to be one of the key Concepts to understand for the Mamba architecture and what makes it new and different from its immediate ancestors but also something that we can do is that we can treat a given input differently depending on its context okay that's super important so again we can treat a given input differently depending on its context so each of those angles that I gave you before whether it's the mythological or the scientific or the personal memory or a story or a a person who loves rainbows all these different prompts that I can give you I use that word definitely advisedly they change the way that you think about rainbow the rainbow is not loaded in in the same way every time it is loaded in with the awareness of this other information this other part of the input and so the high level states that arise as a result are significantly different and again you can just understand this through basic examination of your own cognition I think certainly it it is pretty obvious to me when I take the moment to reflect on it and then finally again think about this memory right I I just I gave short shrift so far to the personal memory but this is something that again we have the ability to somehow reach deep into our own personal history and recall things that we may not have recalled for years may not have experienced for decades and so that gives us this sort of long-term individual coherence and identity that is in many ways made up of the fact that we can tap into these long-term memories so that's human cognition okay contrast now to the current AI Paradigm on the multimodal front we now have multimodal AIS that can take in all these different kinds of inputs the most profound ones I think are just the ability to integrate vision and language that's if nothing else were happening that would be a huge deal but there's a lot more besides we're going to see AIS that have senses that we couldn't dream of having right talk about just seeing in seeing additional colors through additional wavelengths just as kind of one very early example of what that might look like but there's no reason to think that AIS are not going to be able to take in all sorts of signals that we just don't have either The receptors for or the ability the native ability to parse and they're going to be able to learn how to take in lots more modalities this is already well underway with Transformers hey we'll continue our interview in a moment after a word from our sponsors real quick what's the easiest choice you can make taking the window instead of the middle seat Outsourcing business tasks that you absolutely hate what about selling with Shopify Shopify is the global Commerce platform that helps you sell at every stage of your business Shopify Powers 10% of all e-commerce in the US and Shopify is the global force behind all birds rothy and brooklinen and millions of other entrepreneurs of every size across 175 countries whether you're selling security systems or marketing memory modules Shopify helps you sell everywhere from their all-in-one e-commerce platform to their inperson POS system wherever and whatever you're selling shopify's got you covered I've used it in the past at the companies I've founded and when we launch merch here at turpentine Shopify will be our go-to Shopify helps turn browsers into buyers with the internet's best converting checkout up to 36% better compared to other leading Commerce platforms and Shopify helps you sell more with less effort thanks to Shopify Magic your AI powered Allstar with Shopify magic whip up captivating content that converts from blog posts to product descriptions generate instant FAQ answers pick the perfect email send time plus Shopify magic is free for every Shopify seller businesses that grow grow with Shopify sign up for a $1 per month trial period at shopify.com cognitive go to shopify.com cognitive now to grow your business no matter what stage you're in shopify.com cognitive so what about that higher order more associative form of processing that I described it turns out that the Transformer is also doing something very similar to The Loop that I described of taking in an input working it up to higher order Concepts and then kind of working it down to make a specific prediction the Transformer is actually doing something very very similar it is starting from the embeddings moving through the layers um and remember and it's especially in the big transformers there are many layers dozens of of layers that successively process the data step by step until finally it reaches the end and there is a pretty systematic study now of how the different layers work and it's quite clear I would say from a bunch of different research results at this point that it is in the middle layers that the highest order concepts are the focus of the model's cognition just three results that I'll mention to kind of ground that and and can go check into them more I've covered a couple of these in the recent AI research podcast one is the influence functions research from anthropic they looked at training data and said how can we tell what training data is most important to a particular output from the model what they show is that in the small models with relatively few layers you have apparent stochastic Parry where the things that are most relevant are those that have the same keywords so you can see that the understanding is relatively shallow and you can infer that even if the text that is generated passes as correct in some sense that at least from this analysis the small models don't seem to have a very sophisticated high level understanding but the big models do when you get up to real scale and we're talking here like tens of billions of parameters I think anthropic was working with something in the sort of 50 60 billion parameter range by that time you see now really sophisticated relationships between the inputs that are determined to matter most and the the you know the actual work and the outputs at hand so that's really interesting and they specifically locate that in the middle layers another concept is editing uh and changing the the fact patterns that the language models have learned this is a technique there are a couple different papers on this one is called Rome and that one really started to scale up this concept editing work and and develop metrics for the the robustness of the the concept editing the reliability and you can do now tens of thousands of concept edits where you for example say Michael Jordan played baseball right you want to edit the the worldview the history as the model understands it so that Michael Jordan played baseball but you want that to be consistent so that the it answers that way regardless of what kinds of questions it is asked you want it to be robust simple rephrasing and you want other things to stay right right you don't just want to replace the concept of basketball wholesale you still want Larry Bird and LeBron James to have played basketball and they can do all of those things by editing Concepts within a Transformer and again that editing is mostly happening at the middle layers and then most recently the representation engineering paper which was from Dan Hendrick and collaborators where they start to look at the activations again in the middle layers and ask questions about them like what kinds of Concepts can we identify in this information and indeed they find that they can both classify and even start to control Model Behavior with this growing library of highle Concepts which are represented as a vector Direction in activation space in these middle layers of the Transformer so we have pretty good evidence I think now to say that when it comes to this aspect of cognition this Loop of taking in this compressed input language you gradually work your way up through the layers whether in the human it's through the visual system or through the auditory system or even through the Touch system and gradually get up to these higher level Concepts where you have all these associations you have all this veilance and everything is there and then after chewing on that for a little bit in a way that we don't have certainly perfect self-awareness of we can then translate that back down into concrete nextto constrictions in make language and Transformers are doing something very similar it is weird though that the Transformer does that with a homogeneous architecture you have the embedding at the beginning of the Transformer where you convert your inputs into numbers then you basically have the same exact block over and over again and of course there's a lot of little variations but the core concept is really the same every time through every layer that is a multi-headed attention block where the tokens are all computed relationally to each other and you can figure out what to pay attention to two based on the overall inputs then you have the multi- layer perceptron which is dense information processing but notably that works on a token by token level then you also have some nonlinearity some way of filtering out noisy information and then you have the skip connections and you just layer that block over and over and over again till you hit the scale of the model that you're trying to build and then at the end you have some d embedding reduction to specific prediction as well the middle is really the same thing over and over again so I do think that is like remarkably weird that we have created something that doesn't have much specialization of internal architecture and it's just that the layers themselves end up taking on these different aspects of the cognitive process there's division of labor between the layers even though there's not a difference in form so I think that's pretty striking observation turning now to that third Point what about the ability to process the same inputs differently this too is something that Transformers can do via the attention mechanism and it is broadly understood to be a very important part of why they work so well so how does this work in in the abstract you can imagine two classes of machine learning architecture two classes of model one would be where you have the weights the and the weights again the weights are the numbers that are learned in the training process and they are the numbers that are used to transform the inputs through whatever layers or process until outputs are reached so one traditional typical normal AI architecture machine learning architecture we just have the weights fixed and just apply the weights in the same way regardless of what the inputs are so you feed into some inputs you convert them to numbers and then they're just going to be crunched through layer one in a certain way and they're crunched through Layer Two in a certain way and they're crunched through all the layers in a certain way and that the way that those numbers are crunched can just be the same every time regardless of what the input is most machine learning architectures historically have worked that way if you go watch the three blue one brown 2017 introduction to neural networks which is aass CL and he has some some great visualization I still recommend that all the time because of just how fundamental and how elegant his explanation is he does not talk about attention at all and the types of architectures that he's describing are basically this way you feed some inputs in the weights are all there and the weights just do their thing each layer crunches the same way and that's just how it works so what's different about the attention block is that inputs have two routes by which they affect the overall process the inputs are the inputs that get fed in layer to layer through the model but they also have this other path the ability to shape what the attention block is going to do so you may be aware that the attention Matrix is different every time depending on the inputs that's why we have these different you know k q and V portions and first their nature is determined and that again depends on the inputs and then they are crunched against each other so again there's two paths by which the inputs affect the number crunching that ultimately happens they just are the inputs and then they also affect part of the way the inputs are crunched that is why the attention M Matrix is different every time and can't just be totally pre-computed and this is understood to be a huge reason that the Transformers are so powerful thinking back again to our experience and the way that we considered the concept of rainbow differently depending on these compressed supplemental inputs the different lenses of historical and mythological and scientific and so on chat PT can do the same thing the Transformer the language models can do the same thing right if you say explain to me a rainbow through the lens of Mythology you will get a totally different answer from if you say through the lens of Optics or the lens of childhood or whatever this is something that the simpler models can't do right if you have an image classifier it may have an a rainbow class but it's just applying the same standard computation to every single image and hopefully it lands on the rainbow it's not loading in all these other associations and this is what this secondary path of influence from the inputs to the computation ultimately allows for it it's a huge reason and I think this is pretty well established in the literature from a bunch of different angles it's a huge reason that the attention based Transformer architecture is so powerful it's that forking path it's that two ways of the input determining how the information processing happens it's a big deal because this is what the Mamba architecture unlocks for a totally different class of model and we're going to get into that in a lot more detail but before we do let's talk about the big weakness of the Transformer the big weakness that information passing that unlocks so much power is also slow if Transformers are quadratic in nature is because every additional token has to have a relationship computed with all the tokens that came before it and that's a fundamentally quadratic comput a process so obviously lots of optimizations and also approximations have been used to try to work around that and certainly we have scaled context window up a lot but basically the state of the of play is that the full dense non-approximated attention is still quadratic and all the approximations largely seem a little worse often they can work quite well but they don't seem to work as well as the full dense no compromises attention so the weakness of attention there is that it's slow and and also that the context window itself because it is quadratic in the context length the context windows are fundamentally very limited now we've seen that they have grown a lot right and at 100,000 tokens you can fit The Great Gatsby in Claude 2.1 is now on the market with up to 200,000 tokens of context and that is definitely not nothing and a lot more than it used to be just a couple years ago we were talking about 2,000 token windows and not long before that Bert I think is 512 so there has been definitely rapid uh growth in the computing power obviously that underlies all this and with that the ability to extend out these context Windows longer and longer but that's still not that long 100,000 tokens that's just one book right and if you're reading that entire if you read The Great Gatsby that is the entirety basically of what a Transformer can work with at any given time that also would translate into like a three-hour podcast if I take the transcript of this entire thing it's going to be most of 100,000 tokens and that's not that much right I don't know what my effective token intake rate is per day but it's a lot higher than that I hear a lot of audio I have a lot of conversations I read a lot of things as well I obviously have a ton of imagery coming into my cognition so 100,000 tokens is not much in the grand scheme of things and for Transformers there is nothing but that context and the weights themselves hey we'll continue our interview in a moment after a word from our sponsors if you're a startup founder or executive running a growing business you know that as you scale your systems break down and the cracks start to show if this resonates with you there are three numbers you need to know 36,000 25 and 1 36,000 that's the number of businesses which have upgraded to netsuite by Oracle netsuite is the number one Cloud Financial system streamline accounting financial management inventory HR and more 25 net Suite turns 25 this year that's 25 years of helping businesses do more with less close their books in days not weeks and drive down costs one because your business is one-of aind so you get a customized solution for all your kpis in one efficient system with one source of Truth manage risk get reliable forecasts and improve margins everything you need all in one place right now download netsuite's popular kpi checklist designed to give you consistently excellent performance absolutely free at netsuite.com cognitive that's netsuite.com cognitive to get your own kpi checklist netsuite.com cognitive omnik uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button I believe in omnik so much that invested in it and I recommend you use it too use Cog graev to get a 10% discount you have the weights they have a ton of information encoded in them but then they just take their input they're always waking up if you will with total Amnesia and they have a a world to the degree they have a world model they they're waking up with their world model ready to go but then they're confronting some input that is totally novel right and they have no memory of of you know confronting that input before and they just have to to Crunch what they're given to Crunch and they can do it in the super richly expressive way because of the forking path and the multiple ways in which the input can ultimately influence the information processing but at the end of each episode when the context window runs out there really isn't anywhere for that to go right that information typically is just discarded and then we come back and we use the llm powered assistant again another time with a totally new context so we've got right now a lot of hacks to try to get around that this is why we have a system prompt from open AI or custom instructions as they call it in chat GPT because you want to have more consistent Behavior you want to have it sort of know you but it can't know you it doesn't have any mechanism to know you it doesn't have any way to take that context that it crunched in the context of one interaction one episode and turn that into any longer term durable memory that it can later use that mechanism just doesn't exist so it's fundamentally an episodic technology and then we have of course retrieval augmented generation right a database or we can save our logs and start to query back into our logs but from the standpoint of the model loading those logs in that's also still just one episode right it doesn't have a real memory of that earlier episode it can only just load in a part of it and that however much it loads in that's context right then you can get into these more elaborate constructions where we save all the logs and then we come through and process them periodically and try to summarize them or synthesize something more useful out of them that was one of the techniques from the AI Town paper that I thought was most interesting it wasn't so much just the fact that they had all these guys running around and talking but they were periodically sweeping through all their recorded memories and synthesizing higher order memories more compressed represent presentations of what those little Bots had experienced so that later that could be loaded into context more usefully but again it's still kind of just a hack right because it's like a language model which is purely episodic processing some memories and creating new language and then that language kind of getting loaded into a model it's lossy right language because it's compression it's also lossy and it's not necessarily lossy in super principled ways and this is why the Transformer AIS that we have today don't make great Companions and don't make great long-term assistants and are not super effective as agents because 100,000 tokens is not that many tokens if you want to go out and browse the web and do comparison shopping and navigate tricky situations there's no way for it to learn your preferences right to have an instinct and an intuition for your preferences it's not going to have that what it can have is explicitly declared preferences at the system prompt or accessible through some database that it can query and load in but it's never going to really know you it's never going to have an intuition that is inbuilt to understand you or even just itself its own history it's just there is no mechanism there's nothing but the context window currently under consideration and the weights themselves you might say that the memory is missing I think there are paths by the way to fixing this within the Transformer architecture and in an episode a few back where I sketched out the future of the Transformer and so you can go listen to that one if you want the the full Deep dive on that that's outside the scope for today but briefly we've started to see these additional tokens built into the Transformer vocabulary and then into the training process with some pretty interesting results there has been the backspace token where with a a modification to the loss function the model can recognize that it's getting out of distribution and add a backspace token and then Carry On from where it was before with still the information that like it knows that it took this step and then incountered the backspace so it can really help with avoiding going off the rails that's not yet super common place from what I understand but very interesting result another Super interesting result is the paper think before you speak this was the pause token the thinking token the idea was that hey if we just give it some extra tokens without necessarily having to do anything but just allow it to Crunch a thinking token and train it when it's appropriate to do that then maybe that can bring about better performance and sure enough it does and so from that I imagine gez maybe a memory token could also make sense how exactly does that memory token get created is not super clear but you can start to imagine that you might load in tokens especially because we've seen this in the multimodal context so often right the the ability to bridge from one space to another means that you can take an image and convert it into language space and even with frozen Vision model Frozen language model you can teach you can just train a small adapter such that the language model natively knows how to interpret the output of that small adapter because you've literally transformed image into language space and to the model it it is understood as language even if no vocabulary could get there you have that ability to to shape information into language space so that the language model can work with it I think there's something very similar to be done with memories to compress some of the history into maybe just one token how much memory you can fit into one token who knows but clearly you can fit in a lot more than just one word we know that from the image work so exactly what the nature of this memory compression is likely to be is unclear I can see the beginning of a solution on the Transformer Mainline AI development path right now just based on these other kind of Behavioral tokens I can see a memory token also really starting to work I haven't seen that yet in the research but I do think it is conceptually possible but in the meantime somebody has invented a better way and this is now bringing us to the real subject of today's episode so the paper uh that announces all this is called Mamba linear time sequence modeling with Selective State spaces this is just two authors Albert goo and tree da you may recognize those names and it is a real tour Force I mean what two individuals have done here I think has a pretty good chance of rising to the level of that canonical attention is all you need work which remember did not invent the attention mechanism but in fact abstracted away a lot of previously complicating detail and demonstrated that the attention mechanism itself in just this very basic form was enough to do basically all the tasks that people were trying to do it turns out it was not all you need for all tasks you might want to do but it was all you need for the measurements and the ability to examine or characterize a language model that they had available at the time this is a big deal let me first just talk about who the two authors are a little bit they are both professors Albert goo is an assistant professor at Carnegie melon he's also the chief scientist at a company called caresia Ai and his Twitter bio simply says leading the SSM that is the state space model Revolution the other author is Tre da and he is also an incoming professor at Princeton he is also a founding Chief scientist at an AI startup called together AI which has raised $ 102.5 million series series a and his expertise is really in performance super close to the metal highly optimized algorithms for really making the most of your gpus if you know his name treow you probably know it from flash attention and Flash attention was a major advance in how attention could be computed without any sort of approximations really doing the full computation but doing it in a way that worked much better just based on how the data is shuffled around and and what moves to different parts of the overall GPU physical Hardware U finding a better way to do that made it just dramatically more efficient and has been a huge unlock and just major major compute savings so these are the two guys it's the leader of the state space model Revolution and it's a a really accomplished Hardware expert and that's Albert goo and tree da so what's the big difference here it all starts with the concept of the state the thing about State space models that is so different from what we have now come to understand as the norm in the Transformer is that it has an internal State a state that evolves through time and also propagates through time okay so one way to about it is that it is something that outlasts the single interaction between the weights and an input so if a Transformer has all these weights it takes an input and they interact and yes there's that fork but there's not really anything else right what the state space model adds to this at a conceptual level is the idea that okay yes you have your weights and you have your input but you also have an internal State and the the nature of the calculation now becomes I'm going to process both the last state and the new input by the weights to get not just an output yes an output but not just an output also a new updated State and so this state becomes a longlived thing that can go on and on through time with all this motivation it should appeal to you as something that's like hm this state can go through step after step and it propagates through time it's not the output but it is a modified state that is used to calculate the output at each step like immediately I think hm there sounds like there might be something there that could help solve the memory problem and indeed these State space models they have some real strength people have been working on this for the last couple of years there have been a whole bunch of papers about it bunch of results it does have real strength when it comes to Extended memory there are specific benchmarks that try to test this where you have long strings and these often have been historically programmatically generated and there tasks like here's some sort of expression which may have tons of parentheses in it and and brackets and curly brackets you know code is like this in arithmetic you can just write out infinite arithmetic with whatever sort of Open Bracket open parentheses notation and then in coding you have to close those things right so the challenge then becomes for the AI can you can I give you this giant long sequence of all of this stuff with all these order of operations open parentheses open square brackets and can you close it can you close that in an effective way State space models have dominated this category this is one of the few things where the Transformer is not Supreme and the big reason is that these things can they can get arbitrarily long and it's really hard for the Transformer to keep up with these arbitrarily long things like some of them just can't do it there's literally no way so the state space models have already shown even prior to this state space moment that there is a role for them that they do have some really desirable memory properties that the Transformer does not have now how this works is actually pretty complicated but also principled I followed references back to an earlier Albert cou and treow collaboration in that case with other authors as well and the original hip paper hippo hippo it's called describes how long sequences can be compressed into really small state representations using a process of projection onto a basis of polom functions now that's complicated math but the way they evaluate memory back in this original paper is for its ability to memorize and arbitrary noisy input sequence and there are different ways of doing it where you wait more recent data more heavily or wait all data the same but but what they show is that you can create a slightly lossy yes but still extremely useful representation of a sequence extremely efficiently by representing them in this way and by the way they have some pretty nice scaling properties too so there's motivation to work on this because the structure of the computation is actually in in many ways a lot more favorable than the Transformer each inference step takes the same amount of time constant time inference why is it constant time because you have a state and the state is of a fixed size okay that's important the state is of a fixed size it changes through time but it doesn't grow through time so the state because it's of a fixed size and a new token of input those two can be considered the inputs they are processed by the weights an output token is generated and the next state is also generated and that takes the same amount of time regardless of how long the sequence historically has been because the state itself doesn't grow that's in contrast to Transformers where as the context grows you have this quadratic nature of the computation there are caching strategies that help you're just able to cach the parts that you already did so you don't have to redo it with every single token but fundamentally you still have that quadratic calculation the state space model though it's constant time inference because the state itself does not grow it evolves it changes from step to step but it does not grow so constant time inference that's great and then that also means linear scaling with the sequence length for the purposes of training so this is awesome right like this is much better if each step takes the same amount of time then you're in a fundamentally different regime than if each step is starting to take a little bit longer than the last step huge difference okay so what's the problem well historically the problem is they just don't work as well you've got these few areas these kind of Oddball parenthesis closing tasks that the states spased models do really well on but for most language modeling tasks that is to say like all the stuff that chbt can do for us that we find so much value in the Transformers for the states space models just haven't been as good interestingly I think that the authors of this paper do really give you a pretty good sense for their motivations for this work in conceptual terms it's not as accessible is what I'm presenting here but this is such a t of force because it begins with such highlevel conceptual insight and then they go you know so deep all the way down to this super low level code which is by the way open source to actually make the implementation of it work but just starting at the very high level they're like you know all these stat space models that we've been using so far have been of the of the simpler class where the actual layer-by-layer Transformations are fixed and the information just flows through it they don't have that property like the transformer has and like the brain obviously has where the nature of the computation that is by what numbers will the inputs be transformed the state space models have not had that ability to do that dynamically so that really limits what is sometimes called expressivity just how rich Dynamic powerful can they be their motivating observation is that the state space models have probably been limited by that that they can't match the Transformer because they just don't have the same multipath means for the inputs to affect the information processing that unlocks the power in the Transformer so you might wonder well why is that you know why were they designed that way and historically as I understand it the the reason has been that that was the only way to make the feasible you know in reasonable with reasonable resources and time the math of this can get pretty abstract in terms of the notation but I basically the state space models that have come before have been designed so that they can be represented and computed in parallel and also so that they can be represented and computed in a recurrent fashion the parallel form is called the convolution and that's less important for non-technical folks than the aide that there is a way with the traditional State space models to do the calculation that is highly parallelizable and that's obviously super desirable the Transformer is also highly paralyzable which is a big part of why it's been able to scale with GPU infrastructure so the state space models people want to do the same thing can we make this highly parallelizable so that we can hopefully scale it and yes you need to have the convolutional form and then you can do this calculation in a highly paralyzed way and that that's by the way you use mostly for training because you have these like large established batches of what the text is you want to process that in batch you don't want to have to do literally just one token at a time that's pretty annoying so you have to have this convolutional form which allows for this high level of parallelization but then when you switch to inference then it does become a recurrent process inference next token prediction you're just predicting one token at a time this is called Auto regressing where you're taking your token feeding it back in as input and running the same process over and over again the loop is feeding the output right back into itself as input but it is still only just doing one token at a time and and you really can't parallelize beyond that so you'll see the training and also if there's a long prompt then the initial mode could be convolutional in structure and parallel in in computation so that can happen quickly and then when you're actually like generating new stuff you have to go one token at a time anyway so that's the recurrent form so all these states space models had this dual nature they wanted to have the parallelizability but then of course at runtime you're going to generate one token at a time so here's the key dual breakthrough for the Mamba paper they said okay is there any way that we can broaden what a state space model can be so that the calculation that is done is influenced by the inputs in kind of a similar way to what the transform does we want the inputs not just to flow down a assembly line so to speak where they get you know transformed by exactly the same numeric Machinery each time but we want the nature of the Transformations as we go down the line to also depend on what the inputs happen to be how can we make that possible keep in mind these State space models are a little bit different you have the state as it currently exists and then you have a new input and then those get processed by some parameters such that an output put gets generated and also a new state you can look at the diagram in the paper but ultimately if you are just generating content with this then you're again feeding each output back in as input but you're also feeding the new state in as input and the Old State you let go of right it it it just goes away like the state has evolved the state has changed so what they decided to do is say okay what if we let all these parameters and the what if we let the Transformations that we're going to do to the state and the input depend on the input itself that's like on some level a conceptually simple thing to do this is like a a relatively small change in the equations that Define the architecture but the key problem with it is that now you truly can no longer have a convolutional form and so the computation is no longer in the way that it was so you're now really constrained to the recurrent form and so you might think well geez that's just like not going to ever work we have to be able to parallelize in order to really make this scalable and this is where treow one of the leading experts in the development of these super efficient lowlevel algorithms comes in so what they do here is they call Hardware aware algorithm design parts of it will be beyond the scope but I want to at least take a little bit of a dive into this to give you some intuition for what it is and then there's plenty of room for greater depth of study beond this as well the hardware that they use is the Nvidia a100 and that is a state-of-the-art machine until recently it's been superseded now by the h100 um but the a100 is no slouch it's just one generation behind the latest h100 and it's been a standard you know it's what gbt 4 was trained on for instance so it's a big machine it costs thousands of dollars it runs at 400 watts which is not nothing that's still only half an electric teapot when it's boiling your water but it's like 20 light bulbs in today's world so it's not nothing but it's also not like a huge machine thousands of dollars you know 20 M light bulbs worth of energy and what they really try to do is figure out how can we organize this computation so that it can work fast on this machine knowing all the specs and knowing that we have a world-class expert in treow in terms of writing the Cuda code that is going to manage how this information moves around in a very strategic and ultimately effective way what they do is first of all they look at the nature of the chip they note that there are two kinds of memory that are available on the GPU and this might even be a little bit of a simplification there could be some even in between from diagrams I've looked at but the two that they really call out are the SRAM which is also called the SM or the shared memory and then they have the uh hbm which is the high bandwidth memory and these are two different kinds of memory that serve very different purposes actually on the a100 there are 6,912 Cuda cores those are the actual processing unit that do the math right that do the like multiplication and addition that ultimately constitutes the matrix multiplication that ultimately constitutes all this stuff there's 6,000 912 of those on a single a100 chip the shared memory or the SRAM this is the memory that is absolutely closest to those cores where the actual number crunch in takes place at the very lowest level there is this SRAM shared memory right next to that highly integrated with it and amazingly not that much of it um the amount of SRAM on an a100 is just 164 kilobytes 164 kilobytes that is with one letter being a bite that's actually less than the context window on the lead llms but that is the memory limited as it is that the cores communicate with most directly and that's the absolute fastest Access Memory that they have the next level of memory out the high bandwidth memory that is the much bigger memory pool on the a100 it's 40 gigb of memory okay so that's obviously qualitatively different like five orders of magnitude is greater it's so totally qualitatively different thing this is where the parameters get stored right when you have these giant models however many billion parameters you're talking about we multiply that by a sort of integer Factor it depends on the quantization like how big the numbers are in terms of how many decimal points they have but basically if you think of a number as a couple of bytes how many billion parameters you and that's going to tell you how many gigabytes of space it's going to take up so a 40 gigabyte a100 actually not even enough to run that wouldn't be enough to store the entirety of gpt3 Weights 175 billion parameters is going to be bigger than 40 gigabytes of memory so that used to get into arrays and again there's a lot of intelligence to how do those parameters get loaded in how do they move we don't want to be waiting on them there are bottlenecks everywhere in this process so you've got the cores where the actual number crunching happens you've got the SRAM which is the memory that is small that is most closely integrated with that and then you've got the high band with memory outside that's where like the parameters are you're going back to flash detention the advance there was figuring out how to manage how to better manage the movement of the parameters from the high bandwidth memory into the SRAM and back and exactly what order that that movement and that calculation should be done to really get the most from the hardware the naive interpretation it turns out way less good than flash attention and you really have to understand the design of the chips you really have to understand Cuda at a deep to be able to write this level of optimized code and read out how many people in the world can do it but he is one and certainly one of the best okay so what is the hardware aware algorithm as it exists today well the big thing seems to be that the state the model State never comes off of the SRAM it is only dealt with in that highest speed closest to the core memory so the parameters of the model those will sit in your high bandwidth memory the inputs can then be used to interact with those parameters so that they are different each time right so that it's not just this super consistent assembly line type of thing but where the individual stations or or the layers as you go down that assembly line actually do different things depending on the input that process inevitably because there's a lot of parameters involves accessing those parameters on the high bandwidth memory but the state itself this thing that propagates from one inference step to the next holding its size constant right evolving its value evolving its content evolving what information it contains but holding its size content that stays in the SRAM all the time and is not unless you intervene and say okay we we want to extract this information it's never exported off of the SRAM from state to state at all and that seems to be the big unlock that allows the Mamba architecture specifically to run fast okay so this Hardware aware algorithm this Hardware aware management of the state is super key and it's super technical without it if you had just naive implementation this thing would be too slow to run it would not be effective would not be scalable because it just can't run fast enough so again the code for this is all open source they have a repository out there I am not a Cuda code expert by any means I did copy some of it into Chad GPT and and start to orient myself to it a little bit that way but it's out there and it's super super detailed hyper optimized code super super technical the core idea again just keeping the states on the SRAM so that doesn't have to come on and off all the time using up the bandwidth between the high bandwidth memory and the SRAM to give you just a little bit of a a sense of additional techniques they highlight a few one called kernel Fusion which is basically when you have multiple different Transformations depending on the the nature of the linear algebra and how the Transformations relate to each other instead of taking an input multiplying by matrix a then getting the result then multiplying by matrix B you can often reorder the computation so you do the Matrix A and B interaction first and then you have one net thing that you apply to each of the inputs right that's obviously highly simplified but that kernel Fusion concept is taking multiple Transformations and collapsing them into a single transformation that can only happen under certain conditions it can be really complicated to do but again if you want to maximize the performance then this is like a a pretty core and well established technique another big one that they do is called recomputation so this is in training because you're trying to minimize the traffic between the high bandwidth memory and the SRAM to make things as fast as they can be you don't really have the ability to save all of the activations or save all of the gradients along the way so they use a different technique which is called recomputation where they basically compute the activations and the gradients as they do the backward back propagation step which again takes more compute but less moving of data around and it turns out that the moving of data around is the more pressing bottleneck so actually recomputing certain things in these instances can be much faster you want more on that you're going to have to dig in yourself and and get deeper but zooming out again the upshot is with a hardware aware algorithm they've made such a dramatic performance Improvement on the way that state space model calculations can be performed that they're able to support this greater generalization this multiple paths for the inputs to impact the nature of the information processing much more Transformer like also again human brain is clearly doing something like this they're able to support that despite the fact that it means no convolutional form can exist and therefore your parallelization is greatly limited they're able to support that because of this super Hardware aware super sophisticated low-level algorithm design and with that all sorted out it actually allows them to begin to scale so they have trained a bunch of different models a couple different modalities and for language modeling they're doing this on a couple relatively small models a one and a half billion parameter model and a three billion parameter model and they go up to 300 billion tokens trained that's a pretty decent number it's a long way from state-ofthe-art gp4 was generally thought to be trained to like 10 trillion tokens which would be 30 times the 300 billion that they did here llama 2 was trained to like two trillion tokens so this is also still like an order of magnitude lower than the Llama models even the small ones but it's also like not not trival right to train up to 300 billion tokens that shows that there is some scalability but it also gets you far enough down the loss curve that you can start to get a decent sense for okay just how good is this thing and this has been the thing that made the headlines right these are the curves now that we can say huh this seems like a really big deal because they are beating Transformers on language modeling to say that's a big deal would be definitely understatement right again going back to the very beginning one of the most common super big super high impact questions that I'm asked is are we going to see something better than the Transformer and here we are seeing something that is marginally better than the Transformer they look at this in terms of flops training different architectures with the same number of flops and comparing the L curve they show that basically if you have a short sequence length the Mamba architecture just slightly seems to outperform the best Transformer version that they use and again one of the authors here is somebody who created Flash attention so they should have a pretty authoritative take if you had a naive implementation of any of these things you could cook the books but they have a vanilla Transformer in this paper and then also a Transformer plus plus which they describe as the best Transformer training recipe that they know and basically the Mamba on the 2000 token sequence length just basically exactly matches it and then they look a little beyond that now start to look at an 8,000 plus token sequence length and you can see there's like actually a little space now between the Mamba and the Transformer where the Mamba is lower has a lower loss lower perplexity so this is right off the bat a big deal right this is like wow on these core measurements of just broad pre-training as far as it's been scaled so far and this is out to like 10 to the 20 flops they are beating the Transformer this is like still five orders of magnitude from gp4 in terms of flops but if you've studied the scaling laws you know that one of the big findings over the last couple years that has guided so much of where the big Investments have gone has been this observation that scaling laws do kind of work I would say that there is a lot more scaling to be done obviously from 300 billion tokens to 10 trilliant and Beyond there's a lot more parameters that you may want to have in the model eventually over time there's a lot more to be explored in really all different dimensions of this like you can't predict the exact behavior of an AI based on pure flops but with a reasonably high level of confidence You can predict the aggregate loss those curves are actually pretty smooth and pretty predictable so if you see something that is beating the best architecture even in a somewh but low flop regime and you can see that it holds from 10 to the 18 10 to the 19 into the 10 to the 20 flop range then you have a pretty good chance that it's going to extend into higher orders of magnitude as well given how solid scaling laws have been and how these curves look where they you have a very similar shape it seems to me quite likely that we will again see a pretty predictable scaling law that can be developed for the states space models as well and that it probably will carry on on a few more orders of magnitude they also do and I won to get into this in too much detail but they also do other modalities in the paper the one that I think is the most interesting is DNA because naturally there are long sequences why am I harping on sequence length just as a reminder because this whole thing is about solving memory right so you can compare 2,000 sequence to the Transformer 2000 sequence but the Transformer does well at 2,000 sequence you can do that again at eight and these days the transform does well at8 um but the Transformer can't go you know to a million very easily so why did they not test language on longer sequences my guess is because there aren't super readily available longer sequences of language just ready to go for training this is something that perhaps they could clear up for us a little bit but then they did go on to do training with super long sequences on DNA up to a million tokens and what is so striking about their result with the million tokens on DNA is that they don't see performance loss even as they go up to a million tokens and basically no other architecture has that quality every other architecture that they study as the sequence length gets longer they see a significant drop in performance but with the Mamba architecture even up to a million token they see improving performance this is a huge deal right again the whole point of this or the whole motivation the whole reason I think this is such a game changer is because I think this unlocks the ability for models to start to develop longer term memory and I think a lot will flow from that too a million tokens is a lot right it's five times longer than the state-of-the-art Transformer and this is just the very first paper on the Mamba architecture so there's some other stuff in there as well they work on audio they have some kind of toy problems that are interesting to to look at that kind of show how the previous state state space models just couldn't do certain things at all but the new design that they have does work and this is where they also by the way introduced the concept of selectivity the state space models in the past were what the authors call linear time invariant which is a fancy way of saying that the inputs proceed down an assembly line and they get transformed in exactly the same way regardless of what they are whereas in contrast to that the new thing with these ctivity mechanism that they introduce that's where you now have different Transformations happening as the data gets processed again depending on the input all within a single inference step so key points it's beating the Transformer on these kind of familiar comparisons where you can say hey what does it do at 2,000 sequence and what does it do at 8,000 and let's take that up to at least 10 to the 20 flops and then they're also seeing these really remarkable demonstrations on both both the toy problems and on these DNA tasks where the longer the sequence gets the more the model improves I think that's just a really really interesting phenomenon from which we can start to see through the fog a little bit and hopefully figure things out one other really interesting aspect of all this is the inference throughput on the a100 they show that the batch size that you use influences tremendously how much throughput you get the the GPU is built for extreme parallelization if you're just running one autor regressive sequence through an a100 80 gigabyte uh chip you are not getting the max out of it and that's true across any of these architectures because it can parallelize more than just a single autor regressive unrolling can and parallelize so they show that with their Mamba 1.4 billion parameters that you can just run a bunch of individual autor regression inference processes at the same time and it looks like an a100 can handle essentially 64 concurrent inferences really without any trouble and that's where it seems to max out its parallelization but these are all better than the Transformer it's all higher throughput than a Transformer of similar size so a big deal now again the Transformer a little bit more paralyzable you can have a bigger Transformer that sits on multiple different devices and it's not clear exactly how that's going to work as people try to scale up Mamba right because you have this state and it's in the SRAM and you're minimizing this input and output so I think there's definitely more work to do there on like exactly how it might scale Beyond say seven billion parameters but seven billion parameters is already quite a lot a lot of models are working at that scale and doing important things and the throughput is just much much higher on this single a100 with the Mamba architecture as compared to the Transformer we've got a lot going for us here right we've got something that is demonstrated even before this work to have much better long-term memory properties but just wasn't generally powerful enough to be competitive with Transformers now we've got a conceptual generalization of the state space model that puts it into Transformer territory and perhaps even a little bit better on these core language modeling tasks huge deal it does come at this cost of not being as parallelizable you have to do it in recurrent form but Hardware aware algorithm design makes this feasible in a way where it is actually faster than the Transformer at least up to the scale of the devices that we have and it's again demonstrated that the longer sequences the better it works with DNA we're talking up to a million with these toy problems again up to a million tokens that is a big big deal and there's one more thing in the paper that I think is that is interesting as well they also try just your very first thing in terms of integrating the state space model and the attention model what does that look like they basically show that if they interweave the Mamba and the multi-headed attention architecture together then it further improves the performance a little bit not that much but it does in fact improve the performance and you can just see clearly that the loss curve is just a bit lower and they say the Mamba multi-headed attention architecture is only slightly better which to me is like okay yeah maybe it wasn't that much better but like it is visibly better the loss curve is lower and you've only tried one basic interweaving approach and already it was better so I think there's a lot more where that came from okay so what else do we know right well the paper is published there's also the code repository that is published there's also the model that is published that is trained up to the 300 billion tokens your 2.8 billion parameter model to something like six gigabytes and now people are starting to work on it so an obvious question would be what other techniques that that kind of already exist work with this model interesting data point on that is that there have been already a couple of people who have gone out and find tuned it one that I've spent the most time with is a project called Mamba chat Haven HQ is the company behind this Haven is all about helping you fine tuned models for specialized tasks Justice Mattern is the individual so they dive into this and basically take uh a pretty simple approach to doing the fine tuning it seems like the core libraries there mostly work with a uh fine tune on 16,000 chats out of a data set called Ultra chat 200k you get a Mamba chat model and your browser a Google collab notebook you can run this maamba chat so I've spent a little bit of time doing that sitting and just looking the performance of the Mamba chat basically the chat works and it can chat with you very normally it seems to know stuff it can be induced to hallucinate it summarized very effectively for me at 500 tokens no problem at 1500 tokens no problem at about 3,000 tokens it seemed to lose coherence and I was quite interested in well if this thing is built for long-term memory why doesn't it seemed to handle long sequences well in text and basically what I came to in trying this and in reading all the papers is that it hasn't been trained on long-term text the magic of the state space model which we do see demonstrated in these other modalities like the DNA and the toy problems is that it can carry information forward via the state indefinitely but that doesn't mean it will and in fact the average length of chat in this Ultra chat paper is 1,467 tokens so under 1500 tokens that is super small right compared to today's even just context windows we do see that the model is able to go beyond that length but not like dramatically beyond that length and I think if there's one thing that should keep us all grounded this is the thing that is least well proven as of now and where there is a little bit of a leap required so that is definitely worth keeping in mind as a possible reason that some of my Downstream inferences and speculations may not actually work my experience also in terms of using this on collab was that it was like sometimes fast sometimes slow I wasn't able to get an a100 which is the hardware is used in the development instead what Google collab gave me even with my paid account was a V100 that is one generation before the a100 so it's not like a slouchy machine but it's not on par with the a100 either so not exactly sure how to interpret the speed sometimes it seemed quite fast and other times it seemed quite slow just for like objective comparison the V100 has 96 kilobytes of shared memory as compared to the A1 100's 164 kilobytes so it's a little more than half in any event it is clear that the recurrence is a fundamental limitation on like just how fast this thing can work and even from the paper with a single a100 they expect that you will get 58 tokens per second if you're just running a single line of inference 58 tokens per second is faster than you can read so it's fine we speak at 100 150 words per minute most people can read a couple hundred words a minute speed reading you get until three four five 600 words a minute so it's comfortably faster than you can read with the seven billion parameter scale if you imagine a 20 billion parameter scale it's not exactly clear how much that will slow down but let's just say you get down to 20 tokens a second now maybe you're actually kind of waiting for it as an individual user and certainly if you're doing like background processing the latency does start to become a factor at weark for example when we generate a video script for somebody we may be generating a thousand tokens and this might be the kind of thing that could take the better part of a minute so it isn't trivial the the latency is still an issue even with this like Hardware aware design and I wasn't fully able to characterize that in my own experimentation on collab but fine-tuning basically works the libraries the data sets the ability to like get it to be a chat assistant really know obvious problem there summarization seemed to be limited by not having been trained on super long context and speed was hard to assess but again I want to zoom out a little bit and just consider the fact that maybe these head-to-head comparisons are kind of missing the point or at a minimum they're just scratching the surface of what we should be thinking about because the way these results are reported is that okay this Mamba architecture outperforms the Transformer when we train it on 2000 or 8,000 sequence and again the 8,000 is more outperformance than the 2,000 which is almost indistinguishable so as we go to Long context we are seeing Advantage but these are pretty standard Transformer measurements and you're not necessarily playing to the Mamba architecture's strengths if the whole point of this is that it can have long-term memory in a way that the Transformer just can't because it has this state that is long lived going through time I would just emphasize that none of the tests are really meant to test that all of these benchmarks are things that like Transformers can do none of them really are things that are fundamentally different from what the Transformer can do and so there may be whole areas where this Mamba architecture especially as you start to think about long-term memory there are probably all sorts of things that the Mamba architecture can do that the Transformer just can't do at all and this is where I want to start to get into somewhat more speculative scouting work but hopefully you'll agree pretty well grounded and likely enough I think to happen that this is something we should start to take seriously and plan for what can we do with this state space architecture that is now as expressive as powerful at the unit level as the attention powered transform former but which has potentially a lot of different surprises and and nuances and idiosyncrasies left to be discovered remember just how many things we have seen that are just plain weird about the Transformer AIS I suspect that we may see similar amounts of surprises from State space models but it seems quite clear that we can take this architecture and start to think about breaking out of text windows that we can think about new training strategies that are designed engineered data sets that are designed to encourage long-term coherence to encourage adversarial robustness to make the Model Behavior more predictable because you have this ability now with the state to encode longterm context we've really only barely scratched the surface of that kind of data set creation again the chats used to do the fine-tuning 1,400 tokens that's nothing that's a handful of back and forth total right a book The Great Gatsby is almost 100,000 tokens that's nothing in the scheme of what a human processes these architectures have been demonstrated up to a million but the loss curves just were still going down at a million right there's a pretty good Prospect I think that we could start to think of millions maybe even tens of millions and then again I think maybe all this is missing the point if we try to frame it as state space models versus Transformer is Mamba the Transformer killer the answer there is almost certainly no these different architectures as a hope now is starting to become more intuitive they are fundamentally different things they do things in fundamentally different ways and from that it seems very likely that they will have fundamentally different strengths and weaknesses and all of this analysis so far is largely conducted on what you might think of as the Transformers home turf right it's all on short sequences it's all on familiar benchmarks and these are the things that we test because these are the things where Transformers have traction we don't go to these super long sequence things because the Transformers don't really work on them we don't even really have the data sets there are a couple but there are not all that many so when you see that the Mamba architecture is outperforming the corresponding Transformer almost across the board on all these benchmarks that are fairly standard in the field then I think you have to also keep in mind that we're not yet even really playing to this new architecture's Strength there is as we broaden our thinking about how we might use this thing we are likely to see that there are areas where its performance its outperformance is even even bigger and we might also see by the way some some areas where its performance is inherently less so in challenging myself to well what what is probably worse about this state space model Paradigm compared to the Transformer I think the most fundamental thing is that the state carries all the information from the context or the episode into each inference step if important information was let go in a previous inference step then it will no longer be available in the state for the next inference step so that's a pretty important one right it's like you have a long-term memory but once you let go of something it's gone that's going to create some pretty different Dynamics I suspect I would guess that a state space model for example might be more likely to just simply Miss some things or appear blind to certain details and attention mechanisms are not always great at this either but because they do compute every token in relation to every other token if you are looking for like a diamond in the rough right if you load in The Great Gatsby into Claude and you put one weird sentence in there and you say spot the weird sentence at a minimum you can say that the attention mechanism is Computing the relationship between those tokens and all the other tokens and it's likely to pop something up weird when it does that such that it can probably identify the right thing I don't think that for free I think they they train on that stuff they train on increasingly long sequences to create that long-term coherence in the Transformer but you do have that every token to every token so to spot a diamond in the rough you have the opportunity to make sure you are looking at every token as the challenge arises in contrast if you're using a state space model and you've processed a bunch of information gone through a bunch of history and at each step along the way the state is evolving and that means incorporating new information and encoding it into the compressed form that is state but it also means letting go of information over time then once it's gone it's gone you may not be able to say oh hey sometime back in this episode or in this context there was something weird can you now tell me what's weird it may have just identified that as an anomaly at the time and Let It Go such that you may not be able to retrieve that so I'm starting to get a little bit speculative here but I think this brings us to the question of what's next and I think there are two big things that are really going to be next there is going to be the sculpting of memory and there is also going to be the remixing and the recombining of architectures and again remember even in the Mamba paper the Mamba interwoven with multi-headed attention is the best performing version the Mamba version beats the vanilla Transformer but the hybrid already beats the Mamba and again this is the attention is all you need paper for the selective State space model right this is the one where they're saying hey look we have a new thing here that is as powerful and is in fact beating the Transformer on its home turf right on all the familiar benchmarks that you know and love Transformers for the new thing is beating the Transformer on that stuff and then hey look if we just do the most naive recombination of the two it works a little bit better yet now to me that suggests that we're going to see a lot of remixing so let's talk about sculpting memory and remixing sculpting memory first and foremost this is going to be a data question I think more than ever before we are going to need intentionally designed training data we are going to have to be really careful and thoughtful about putting together long sequences and what we want models to learn from those sequences what we want to encourage them to retain what we want to encourage them to let go of and this kind of work this data creation work is a slog it definitely can be resource intensive it can be a grind and it's not always considered super glamorous but I think that it actually may be the most important work that is going to be needed to unlock the power of the state space models and the the long-term memory function because at least from this fine-tuning that we've seen it's not happening automatically right if you just go try it on short sequences does it handle long sequences it seems that not by default and that shouldn't really shock you right if you are optimizing something and you only ever optimize for retaining information for a certain length then what happens when you give it a ton more information it may kind of gum up the works right when you see these models going off the rails it's not exactly clear what's happening there certainly there's a lot more work to be done to understand the internals of these models you might think that like the state is getting clogged up with like too much information and it's just becoming too overloaded and and not working well maybe it's not letting go of things as it should you can also imagine having the other problem as I already sketched out where maybe it in certain situations it has let go of things that you wish it didn't so we're going to have to create data sets with I think a high level of intention to really shape how we want the memory to work if we want to have needle in a Hast stack type Behavior we could probably incentivize that quite directly even with synthetic data right to create situations where there's an anomal and you need to be able to identify the anomaly much later on that would really incentivize a certain kind of information retention that is like anything that is kind of contrastive or out of step with the current state is super important must be maintained on the the other side again you could train in such a way where if that never comes up in the training data if that diamond in the rough type thing never comes up then it probably won't happen by default right that that the the state will not retain that information if it is not incentivized to retain that information in the training process and the training process largely is defined by the data that it is learning from so how do you want to shape you know if you wanted to do some of this data work you could think to yourself how do I want to shape the way that AI longterm memory is ultimately going to work what behaviors do we want it to have what qualities do we want it to have let's start to think about data sets that can teach those qualities and behaviors again I think this may prove to be the most enduring because everything else that's going to happen is going to remix but the data sets actually seem to last longer than almost anything else certainly in mlu the math benchmark these are are one of the few things in machine learning right now that's like 3 years old but continues to be state-of-the-art relevance I think there is something similar to be done in terms of data set building for the state space architecture era by the way a lot of that probably private when you think about long episodes and you think about how much investment companies make in training their people and how long that process is how many tokens are represented in an employee onboarding and training process it's a lot so you would expect I think that private companies would be well positioned to create a lot of this data so that's data the data I think is going to prove to be super important architectures again I think when people ask is this the end of Transformers is it now like mamba or state spaces are all we need I think that is kind of the wrong framing on the question the way I see this unlock is we used to have one mechanism that is the attention block that was so powerful so much better at almost all tasks than almost everything else that we literally just used it and and only it right even knowing that hey we see that these layers are doing different things right going back to taking in a stimulus and working it up to higher order Concepts and processing it in the middle layers before then working your way back down to a tangible next prediction that's all done by the same thing right it's kind of crazy now we have the existence proof that okay with a state space model you can achieve similar even slightly better power the Transformers own home turf tests flop for flop parody a better scaling properties faster more throughput amazing but I think the real way to think about it is that we now have these two blocks these two units and that they each have different strengths and weaknesses and the real question is going to be how do we combine them and we are starting to see some developments from just the last couple of weeks where this kind of thing is starting to happen so one from together which is the company where tree da is one of the co-founders and the the chief scientist we we have this new model called striped hyena 7B and this is a combination architecture of on the one hand the attention mechanism and on the other hand the state space model this is not yet from what I can tell and I don't believe they've published a full paper on this this seems to be a more commercial release this stripe tyena 7B but they describe it as offering a glimpse into a world Beyond Transformers it is notably not a world Beyond attention but it is a hybrid structure that includes a state space model as I understand it not yet a selective State space model but is a traditional State space structure combined with the attention mechanism and it is a globally competitive 7 billion parameter model that you can already get access to through the together AI API so that happened quickly and we can assume that the selective State space one may soon coming as well there is also another paper recently called block State Transformer and this one even gets more into the weeds it is a deeper integration of State space and attention mechanisms it is again not yet with the selective State space model but a global state that can get fed into an attention mechanism so in this block State Transformer unit they now have still the self attention of text to itself then they also have a cross attention of the text to the context that is evolving through time and this is again very competitive I won't get into the the details on this one as much but suffice it to say it's doing its fair share of winning and again this is without even having the selective State space model The Selective State space model was just published two weeks ago the block State Transformer came just a little bit after that the striped hyena came just a few days after that as well they've already announced that they're at least taking the architecture now up to 600 billion tokens of training again with the recurrence there it seems like it's not as easy to parallelize as other things but they're going to continue to push that I'm sure and there's just so much room for remixing that I think we're nowhere near done we've gone from One Core building block to two and we've only begun to think about how we might remix them I think we're also going to see probably a lot of evolution just of the state space model itself right the re the Transformer has gotten all the attention because it was working the Transformer has been optimized like crazy because it is working the state space models didn't attract that kind of attention because they weren't really broadly competitive but now that we've seen this demonstration that they can be how about moving from say a single state to a multi-state model we see this already with attention right there's not just one attention head there's multi-headed attention and I think you can certainly see an analogy where multiple States might make a lot of sense you might have different states which are responsible for different things one great unlock that the state provides is it's another thing that you can make a target for an optimization process so what I mean by that is with the Transformer and as far as I can tell so far also with even this most advanced selective State space model all of the optimization is just on the language prediction we're just talking about pre-training perplexity measure it's just how well is the model able to predict the next token in whatever text it happens to be being trained on at the time but with the stat space models you also get out a state and that state can have various properties and you might even think about optimizing multiple things you want to optimize the prediction but you might also want to optimize some properties of a state or multiple different states one really natural way to do this would be to have some sort of contrastive objective between the different states just to push them in different directions right we we might have multiple States we might want to optimize all the states when it comes to making a good prediction but we also might want to try to create an incentive Within in the training regime to make the states look different a richer and more robust overall representation perhaps you might also have some states that are dedicated to specific purposes I think one of the one of the really interesting things is going to be to what degree do people take another conceptual elaboration here you might imagine for example a built-in classifier again in in the Transformer you have this forward pass layer by layer you get a prediction out that's all you get out and we can try to go into the middle layers and use techniques like represent a engineering or whatever to see based on the activations does this appear to have a certain concept loaded in and active or not that's starting to work but it's definitely kind of messy you can also have like classifiers that sit outside and try to filter inputs or try to filter outputs or identify when things are problematic but you don't really have classifiers built in you try to do that with rhf and you have a hell of a time doing it lots of false positives lots of false negatives you could imagine though having multiple States within a state space model where again there all working together perhaps to make a good prediction but maybe one of the states is specifically optimized as a classifier such that they may fire off in a different way as well in a way that can be aware of the full history that can have all these beneficial long-term memory properties of the state space models but because the state is also something you can optimize you might be able to turn different states into effective classifiers so maybe you're getting both an output in ter of the next token but also getting classifier values out at the same time as well very interesting potential I think for things like that you may also see different kind of wirings maybe you feel like this this long-term memory is great but I can't necessarily fully trust it maybe I need to look at the current state but also some historical States I can't save every state because keeping in mind the only reason this is scalable in the first place is that the traffic from the high band with memory to the SRAM that is closest to the cores has been designed in such a way that the the traffic there is minimized we can't just take every internal State and and remove it and and save it but we could do some we could accept some performance hit to have some historical States and maybe we would find that if we have my last state but also some selection of historical states to work from as the current input perhaps that could work even better there could be hierarchies of states there could be mixtures of experts built into this kind of architecture as well there is a whole host of opportunity for Innovation and elaboration I think in this new architecture just in the same way that there was with Transformers but again even more so because now we have two fundamental building blocks that seem to work comparably well and and have these different strengths and weaknesses what other kind of things might we see new wiring diagrams new ways to Branch perhaps some way to recover the the parallel form uh for training perhaps through an approximation it does seem like fundamentally this selective mechanism is by definition recurrent but if we were to accept some imprecision could we create a convolutional version that would be close enough to load in a bunch of context to get something off the ground fast or to facilitate training I would not be surprised if something like that is possible we might see regularization techniques that are really interesting with memory just thinking about the human cognitive experience we can remember what we had for breakfast this morning many of us most of the time we can remember maybe the best breakfast we've ever had and we can remember what we normally have for breakfast but we certainly can't remember every breakfast we've ever had so there's some sort of abstraction happening some sort of compression happening where things that are not outliers that fit patterns we get rid of the details of those individual episodes of our memory and we compress them into the general memory of what I typically eat for breakfast as opposed to a bunch of super specific individual breakfasts so I think you're probably likely to see some sort of weight Decay you might call it State Decay weight Decay you know you basically say by default throughout the training process all the weights are going to get lower they're going to get smaller every step and then the only way that they're not going to get smaller is if they specifically get turned up through optimization this has been shown to be pretty key to the grocking phenomenon where the generalization that happens seems to depend on the fact that the model's first solution which is just to memorize gets gradually weaker due to the weight Decay but the actual algorithm that solves the problem continually gets turned up and overcomes weight Decay and when that gets turned up enough that it is the dominant mode then you have the grocking some version of this also certainly happening in the human memory where clearly a lot of details are pruned from our memories there's work there to be done but my term there is State Decay from weight Decay to State Decay not to say we won't have weight Decay as well in these State based models but State Decay something that you know has just been there for a while that doesn't appear to be useful can we just gradually let go of that stuff and maybe have multiple states that they do it in different ways where some try to retain and others try to let go so that we have a robust multi-state representation that maybe can do a little bit more than any individual state with a single training objective could do on its own finally before moving on from the architecture bit here's a quote from the together AI blog post about the striped hyena model they say early in our scaling experiments we noticed a consistent Trend given a compute budget architectures built out of mixtures of different key layers always outperform homogeneous architectures this was a conclusion that I came to in my own reading and just thinking about this and then eventually I came across this quote so I don't want you to think that like I'm taking too much from this one quote on the contrary I thought it was like a perfect representation of what I had inferred was likely to be the case so we're we're headed for a future here where we're not going to see the same attention block is every single layer we're also not going to see that the selective state space block is the only thing that matters instead we're going to see that these architectures built out of mixtures of different key layers are going to be the way and that they'll probably blend together and that's what the block State Transformer is all about as well so what else can we begin to expect if all this analysis is right on the hardware Dimension the state itself is another dimension for scaling we have the parameters that's that's been the the main way that we've talked about models and their scale and we also have the context window which we've been scaling in the Transformer but here we have something new we have this state which is itself of finite size and it is where the information from the current episode from the current context has been compressed so how big that state can be is a pretty important question you are fundamentally limited by the size of the state how much information you can actually bring forward so I think this doesn't seem to be explored at all in this current paper but how big the state is is definitely something that you will want to push right and I would expect that just like we saw a rush for bigger parameter counts and longer context Windows bigger States is something that is going to ultimately get Scaled is the current Hardware optimized for that probably not if you just consider the fact that the algorithm here was developed in a hardware aware way that sort of implies that the hardware was not designed for this algorithm well we've certainly seen Hardware designed for Transformers it will certainly take a while if we are going to see Hardware designed for stat bace models but I wouldn't be shocked to see something like that happen not exactly sure what that would look like but perhaps it would look like a different ratio between the SRAM or the shared memory and the high bandwidth memory right now from what I've seen over the last couple Generations V100 a100 h100 those memory sizes are roughly growing at the same Pace whatever has determined that ratio the ratio hasn't really changed over the last couple Generations if I'm right about all this we may see higher ratios of shared memory to high bandwidth memory because maybe the models don't necessarily need to get all that much bigger but the states need to get bigger and so the hardware may have to adjust to allow for that beyond that how much more can we get out of SRAM and into some longer term storage it's kind of crazy right the the state is the important thing here or it's the big new innovation that gives us this possibility of longer term memory more coherent in all likelihood more agentic sort of behavior and yet the state itself doesn't exist anywhere except in this very small amount of the high speed memory and by default it's not exported from that so it's not saved anywhere so all the states by default are lost if we were to try to get all the states literally every time stamp off of the SRAM then it would probably be prohibitively slow but how much more might we be able to sneak off the SRAM or could a different Chip have a somewhat different design where information could flow out of the SRAM into some sort of storage without clogging the the critical input Channel I don't know but it does seem like a question is going to be what more can we get out of SRM and at what performance cost and eventually is there any hardware modification that could ease that because I do think we're going to want to look at historical States for sure and just to get one historical State off that shouldn't be too big of a problem but to get like lots to get every 10 states or every hundred States or every thousand that's going to be a lot more especially if you're talking millions of steps in an inference if you wanted every thousand you're looking at a thousand States how much does that slow thing down I don't know today but I think that's another area where there's going to be some interesting scaling laws and Analysis on current Hardware how much can we save out of these internal states without tanking performance they really do think that's likely to matter then you might think about application development so let's imagine these things start to realize the potential that I'm talking about and we we start to have models that whatever exactly their internal wiring they have this long-term memory that they start to have have something that the Transformers lack that that they have the ability to evolve meaningfully over time that they can because they have this longer term coherence that they can start to be more predictable in their behavior fundamentally less stochastic fundamentally more shaped by all interactions that they've had something that can evolve with you something that can really in a much deeper longer term holistic way actually start to get to know you and again all that's it's going to take a lot of work right the data part I think in particular we have to go create the data that incentivizes that and that doesn't really exist much today I I would not be surprised if we start to see it come online relatively quickly but it doesn't exist yet today so we are going to have to go and create that but as that starts to come online then what sort of things are are going to be possible in terms of applications I think one thing is we'll see a shift probably from Context management to something that you might call context selection or another way to say that is maybe you go from rag to something more like State search so what do I mean by that it's like really long contexts you aren't going to want to have to recompute every time right if you're looking at reading a whole body of literature in order to load up the latest and greatest knowledge into the context or if you're talking about reading somebody's whole email history to really get a sense for who they are you're not going to want to have to redo that for every single inference it's not going to be practical so we're going to have some states which get arrived at exported off the SRAM at whatever performance cost again it'll definitely be acceptable to do this from time to time till cash estate just how many you can do will be interesting to to find out but you can definitely do one here or there so if you want to run through a million or millions of tokens and build up this state that has evolved through that entire history you definitely will be able to take that state and save it and then you might have a lot of those running around so instead of you know having to be very careful about what specific text do I put into context maybe the idea is more that we have a big library of pre-computed contexts which we can then select from today I'll often give a few shot example but maybe in the future I can go to a library of existing contexts that have done hundreds or thousand or a million of those problems right read the whole textbook and then did a thousand problems and are now ready to solve my particular problem it seems like that is going to be a potentially big big deal and again going through a database and retrieving some text and putting that into the prompt that feels kind of clunky by comparison to saying do I have any states that have what I need this also suggests why would you want multi-state architectures well maybe you want one state that's kind of your core state but then you want auxiliary states that have information again we sort of work this way right we have I'm Nathan I'm me I'm like going through my day I know what's going on and then I like get into a mode where I have all this working knowledge that's really front and and Center in my mind and then I'm going go switch to a totally different context and go play with my kids and I have kind of a different seemingly quite disjoint set of mental states that are associated with those two things but in some sense I'm still me right I'm not suggesting that I work like these states space models work that they are going to work like I work but I just see that behaviorally I have this kind of flexibility and I can imagine an architecture now with again perhaps a core state or a couple core States and then like swappable States the way that I've been saying this is from mixture of experts to mixture of states can we take these states and combine them recombine them in in interesting ways almost for sure it seems like that's almost for sure possible right we see that in representation engineering even within the Transformer mid layer activation it seems almost for sure that you would be able to take States and combine them perhaps with weird unexpected consequences in some instances but like functionally almost for sure that should work there's also like disposable paths or self-d delegation could potentially get a lot more interesting I was trying to use an agent platform the other day and the agents still don't quite work so I'm trying to use this agent platform to do something it involves like searching through my email my email has a ton of stuff in it the search returned a ton of things and then it kind of broke like it was too much couldn't page through it effectively or whatever and I was looking at it I was like I can see why this is a pretty big challenge because even if you're using gp4 turbo or Claude 2.1 and you've got north of 100,000 tokens like an email search out of my email I need to delete more stuff but man a lot of people have a lot of email right we have gigabytes of email so the search results go pretty deep and then you need to scan through them and figure out what's relevant and you can self-d delegate that but the Transformers just don't handle it super well whereas you might imagine a state space model scanning through these long search results much more effectively and this is actually an area where you could also Imagine the state space model parallelizing really well you might say okay I've come as far as I've come I'm at this state and now it is the time to search through the email well maybe I can just take 20 of myself and divide that work 20 different ways and each go through a page and for each email I'll be asking does this seem relevant and if so I'll indicate as much and then that way I can really quickly scan through all the stuff with a ton of context to determine what's relevant and not relevant then once I find all the stuff that's relevant then I pop up a level in my recursion depth and now I can just discard all those versions of me and me here is the selective State hybrid architecture of the future we can just discard all those versions that actually ran through all the annoying irrelevant email results and we can just pop up and say okay here's all the relevant stuff maybe you have some notation that like we performed a search or whatever but you don't need all that you don't need to recall you don't need to clog up your context you don't need to clog up your state with the all the random emails that were not relevant right so instead you just run that process discard and then pick up back where you left off with all the stuff that actually is relevant I think we are likely to see far far more gentic behavior and far more robustness far more apparent sense of direction or purpose or kind of unity toward goals based on the fact that we can build up these really long contexts you might even think of things like developing sort of a immune system a runtime immune system as a state right this get gets a little bit beyond what we typically think of as the cognitive but the human immune system has memory it knows what it has seen in the past sometimes these memories fade other times they last a lifetime obviously it's all super complicated but the human immune system is in a very fundamental way built on a memory system so what would the equivalent of that be in one of these State space models what if it was just a log of all the attacks that are known right we maybe can't train into the core weights that there's all these attacks who knows what the training mix is going to look like fine-tuning in general has not proven at least for Transformers to be a great way to teach facts so new kinds of attacks new things to watch out for those are tough right open AI has their system prompt and there's a lot of debate that happens around when they get embarrassed by some failure of chat GPT and then later people report it being fixed people accuse open AI of changing the system prompt at this point it's pretty well established that they're not changing the model that frequently but they can change the system prompt and so maybe they can just go in there and be like hey watch out for this and have it watch out for that and hopefully the storm will blow over and we can include it in the next data set but we can at least cach the behavior and stop the embarrassment for the moment I don't know how much they do of that I suspect not all that much because clogging up your system prompt with that kind of crap just doesn't seem like the kind of trade-off that they would want to make unless it's a pretty serious vulnerability the fact that my spear fishing thing like always worked they could have easily said like do not spearfish in the system prompt and that probably would have stopped it from working or at least would have helped quite a bit so I I don't think they're doing a ton of that but with the states space Paradigm you can imagine doing a lot of it you can imagine a dedicated state within a multi-state model you can imagine keeping much more up to dat on recent reports these are the things that you really need to be watching out for now and potentially those things could be passed around in a far more lightweight and Rapid Way and if it is a dedicated state within a multi-state model you potentially could get significant robustness out of that I mean that's how we do it right the old shame on you for fooling me once but shame on me if you fool me twice you know that's because I have memory right I'm supposed to remember it and I'm supposed to recognize it and I'm supposed to know better the next time and the Transformer just doesn't have that mechanism but the state space models definitely do and the question is can we shape it and can we get the performance to where we need it to be to where we can really trust it I strongly suspect we can I strongly suspect we are going To See Much More agentic Much More goal directed much more consistent much more predictable Behavior much less weird going off the rails less apparent stochastic Randomness and I think all of that is probably going to be unlocked with mostly data sets if I had to guess I would say it's like 2/3 data sets and oneir architectures you're going to have to have the data sets to train into the the various architectures and the architectures will co-evolve if you're thinking about data sets and you want to build one by the way I would really emphasize this would be true at any company as well I've said before many times AIS can handle tasks really well but they but jobs are too big for them if this changes that then the way it's probably going to happen is that people are really going to start to record jobs instead of just having these more manageable bite-sized units of tasks we're going to have to start thinking about data sets and and super long episodes as ways to record jobs and I think it's also going to be really important to the degree that we want to automate whole jobs and we want to do that in kind of a robust way it's going to be important to make reasoning explicit and to really capture reasoning this is important already on Transformers I've covered a few different times how when I tried to fine-tune GPT 3.5 for the way Mark script writing task it didn't really work at first something didn't seem better maybe even a little worse and then when we started using reasoning steps in the data set to make clear this is how we want you to think about it this de first go through the reasoning the strategy and then write the script teaching it this is how we want you to approach it and we want you to spell it out and then do it dramatically improves performance I think that there is this kind of analogous reasoning or Identity or narrative level that people have that seem to guide our actions over time and make us legible and predictable to one another and help us stay on task that probably gets built in here too it's not just about this is the task this is how I break it down this is how I approach it but if you were really thinking how can I get an AI to do a job one of the things I think is missing that you don't really find this on on the internet right it's the higher level Narrative of who am I what is my job what are my goals what am I doing right now that's in pursuit of that right that's all always somewhat in mind and then it can fade a little bit as you're really getting into the task but you select your task you decide what to do based on this stuff right and you decide when it's worth pushing through difficulty versus when it's worth giving up and going and doing something else instead these decisions are Guided by this high level narrative and we don't really have that in the training data it's not on the web so I think we're really going to have to work hard to capture it and it probably just starts off by like having people monologue more and just force people to take the moment to really spell out who are you what is your job what are you doing and to weave that in to all the activity I think that is likely to be a big part of how we can get the AIS to imitate human behavior and again that data doesn't really exist yet certainly not at the scale that it's going to need to exist if if we're really going to see the sort of performance that I'm expecting from all this new technology okay what about interpretability and safety this is a really interesting topic and I've seen different takes on it one simple take which I think definitely has something to be said for it is maybe this is a huge win for safety work and mechanistic interpretability work because the state itself is such an obvious thing to focus on it's like wow with the Transformer we have all these internal activations and circuits but what's happening it's all so alien here because we have this state that is the Long Live State and you kind of compare the last state to the current state and to the next state it does seem like that is a very natural objective study and to look at what concepts are activated to look at Concepts like intent is there deceptive intent is there helpful intent is there harmful intent it would seem that you can really zero in on this particular thing and study it super intens ly techniques like representation engineering that I've mentioned a couple times are seemingly likely to be pretty adaptable to this circuit type approaches perhaps as well you have a ton of Weights still right like you still have billions of parameters that are involved in this information processing so studying the circuits that develop within those that's definitely also a thing but the existence of the state as something to focus on does seem like it could be a huge Advantage for interpretability and safety at the same time at least as it's currently set up the fact that the states are never actually taken off the SRAM at all and you don't have access to them and that's somewhat necessary for performance reasons does suggest that you can study these states but if you actually want to monitor States at runtime right now we don't have a great way to do that if you're tinker tinkering and you're studying you're doing interpretability work then yes you have a natural place to focus your energy but there's not a practical way to say in fact it would kind of violate the whole premise of the hardware aware algorithm to say I want to export every state so I can study it later you maybe could fit in some sort of checks that could happen purely on the SRAM and that could be another output uh but definitely some Innovation would have to happen there to have a state-of-the-art state monitoring applied at runtime without a Major Performance hit again maybe we see Hardware changes there where there's more SRAM and it makes it easier to do but very hard to say right now just behaviorally I would guess there will be as many surprises and weirdnesses with these as there are with Transformers maybe with the hybrid mechanisms you start to get somewhat less behavioral weirdness because they hopefully get the the best of the strengths and the weaknesses are compensated for but if I were just thinking pure States based models I would expect a lot of weirdness and even with the hybrids you it certainly should be open-minded to behavioral weirdness and so I do think we're set back a bit in terms of just understanding AIS generally like what can they do what can't they do what do they trip over we have answered that to a decent degree for Transformers but we have not even really begun to answer it for States based models we can only begin to answer it now zooming out bigger picture still yet considering philosophical questions and questions of human AI interaction Dynamics I think if everything I said about the agent capabilities is true if because of these long Conta we can start to see more coherent consistent predictable legible Behavior if these things can evolve over time and get to know us in some sort of more intuitive deeper way then I think we're going to have a totally different relationship with them it's going to be much easier to project value onto them we already see this happening of course go back to our second episode with replica people falling in love with like pretty primitive chat Bots now we are seeing character AI people are spending hours today all sorts of new sex bot chat app type things are coming at us all of these things still don't have long-term memory you know we still don't have a mutual Evolution with them this could really change that and if it does we are going to be much more inclined to see these things as real value right and we're going and the loss of a state with all the context that it represents could be like a real loss you may not even be able to recreate it right because again that log is not there by default you could log every input but by default most people are not going to be logging every input and intermediate states are lost so losing a prized state or a valued state that could be a real loss not the kind of thing that's easy to get back not like with a Transformer you just reprompt or whatever it might not be like that anymore these things might have more durability and they might even Merit more moral weight right I don't really know why I'm conscious I certainly don't claim to have the mystery of Consciousness solved but I hope to do an episode on this one Paradigm for understanding Consciousness relates to just the fact that we can have this kind of longtime Horizon awareness and that to some extent that that comes about by us modeling ourselves we definitely have an ability to predict how we're going to feel in the future we may be right or wrong but we do have a a kind of running prediction that helps us when things are changing from our expectations realize that in part because we do have this at least a little bit ahead planning for what am I likely to be experiencing next when that uh expectation is violated it definitely brings things to our attention you can definitely imagine these State space models self modeling right what if you had a dedicated state whose job it is to model what the likely state is going to be 50 States from now that's a bit of a trick to pull off but it it does seem like there's nothing fundamentally blocking that and and that's getting close to at least one definition of Consciousness will there be subjective experience I have no idea but it's it's much easier for me to imagine caring about an AI that is powered by a state space model with a long running super high context State and to be right about caring about it as compared to caring about a character that's coming out of a transformer model today I'm radically IC about these things but certainly it seems like this is a significant step in that direction if you've ever heard of the age of M by Robin Hansen I would say a classic that is more classic now as a result of this he postulates a AI that essentially simulates a human brain or emulates a human brain that's why he calls them M and they're supposed to be very humanik because they're to a high enough degree of precision emulating what a human brain would do but they have all these different properties because they are digital they can be paused and put into long long term storage and then woken up and when they're woken up time has passed but their state is changed because we have this durable digital storage and so he envisions a world where that plays out and these things are easy to clone right you can spin up a bunch I I alluded to that in terms of application design and in terms of self-d Delegation to go scan through a ton of emails a lot of the analysis in the age of M is way more apt in in a future where long running states are a key component of what AIS are we're not headed immediately for Real High Fidelity emulations but we may be headed for enough of the core capabilities such that a lot of that analysis is going to I think start to become more and more relevant you look at the Transformer and you look at the M's and you're like ah these are just so different that I don't really think the analysis of the M's and like how that's likely to play out in terms of economics Society whatever I don't think that follows from Transformers but much more so it seems like it could follow from the states based architectures so hope maybe we can get Robin on to to go through that because I think it's only looking more and more relevant now okay zooming out again a little bit further just kind of talking big picture if you are looking for Investments to make if you are looking for companies to watch I would definitely look at both of the companies that the two authors of the Mamba paper are affiliated with that's Albert goo he's the chief scientist at caria and tree da he's the chief scientist that together AI caria specifically says that they are training models with subquadratic scaling properties and this certainly fits that bill together AI seems to be even more focused on just as you'd expect from Tre Dow's Hardware aware super close to the metal programming they seem to focus on infrastructure managing Compu getting the most out of compute etc etc given that both of these guys are involved with startups it is a little curious that they published This research we are in a moment of General closing and I would be fascinated to talk to them about why they chose to publish this work I think this is the kind of thing you could keep secret um especially given how it's not just a highle conceptual thing but it's also the lowlevel implementation details that are so important to making the thing work that's the kind of thing you can keep secret you can keep Cuda code a highly specific implementation that you could keep secret so they chose not to uh it's out there they may maybe they even have more secrets that haven't been shared with the public but this is a pretty big paper that and the codebase that I am kind of surprised is published in today's day and age just as a thought experiment if they had taken it to Microsoft soft and tried to sell it I would expect that they would have been able to get a lot of money for it but they didn't so it's out there unless I'm way wrong about what this can enable that definitely just means acceleration will continue open AI is well known for pouncing on stuff like this when when I've shopped this idea around to different people and said what do you think why might this not be as transformative as it seems like it would be to me and one of the interesting speculations I got back was well maybe the leading Labs already have something like this and so therefore it's baked in already to the very best stuff that we're seeing maybe gb4 has an element of this I don't think that's true certainly the way that they are presented seems to be like you have a context window you can't go past it so I don't see anything to suggest yet that there is something like this the leaders like open AI they didn't invent the Transformer either obviously that what they have done is they have pounced on the advances they have really tried to push the performance and maximize it they've really tried to figure out what is this good for they are obviously highly invested in data creation and collection of All Sorts uh they're doing licensing deals now with all sorts of different data providers I would guess that they already have somebody internally maybe even a team working on characterizing this I would guess anthropic is probably already doing the same if it is working they will make it work and interestingly with open AI in particular their new assistance API would be dramatically improved with an architecture like this the assistance API which is basically the gpt's API version allows you to have arbitrary length threads and also documents you know rag style knowledge base that's attached but this arbitrary length thread you think well how do you have an arbitrary length thread if you have a finite atten window the answer that we've got so far is that the assistant basically queries its own history as part of your you know subsequent call and fetches history that's relevant and loads that into context and then you are charged for and this has not been totally clarified yet U you are charged for the new inputs and outputs but also whatever was fetched out of History that's a little bit messy doesn't sound that awesome it would be totally smooth with an architecture like this right you have now an arbitrary length you have that history yeah you can query it perhaps if you need to but you have this state that is propagating through time and this would also lend itself to again if used in pure form and I don't think it's going to be pure form I think there's going to be hybrid with attention but if used in kind of pure form then you could also Imagine hey you only get build for the additional inputs and outputs at any given extension of the threat that could be a huge deal so I think open AI is advantaged by this probably relative to other things which is again why it is kind of confusing as to why it was published their ability to pounce on things to drive scale to really get relentless on the question of what is going to make this product work well is going to be a huge Advantage here because this is something that is just new and totally uncharacterized and I do think their assistance API if there was anything that broadly I think they they don't have this in their production stack yet but they might have something like it in the pipeline and if I deite one piece of evidence it would make me think yeah maybe they do it would be the nature just the structure of the assistance API the assistance API looks like something that was built a little bit more with this kind of architecture in mind honestly than with the Transformer maybe you could think of it as like a bridge from a pure Transformer structure to a future structure that they anticipate time will tell agents we're behind I've been thinking a little bit about coming to the end of the year what have I been right on what have I been wrong on I definitely expected more progress on agents than we've got one of the big reasons I think that we haven't got as much as I expected is that GPT 4V has only recently come available was demoed in March we had expected to see at that time that it would be launched soon and that multimodal inputs would be quickly become the norm and I definitely think that vision is super important for effectively navigating even just the digital world right even just doing stuff in a browser the browser is meant to be interpreted visually and only recently has that really become possible so now we have the application development agent work um to be done on top of that I think there is a big unlock there and I think we're going to get to effective agents regardless even if we do have to bruteforce it through purely Transformers between the vision between doing things like rewarding reasoning you know I open a I had a result earlier this year where they achieved a state-of-the-art on the math benchmark by giving a reward signal at every intermediate step of reasoning as opposed to just the final result I think that is going to work for agents if I had to guess like what GPT 4.5 would look like if you interpret the Tik Tock of the last generation if gpt3 was a certain scale a certain kind of latent power and then 3.5 was when they really applied the rhf and made it behave then gpc4 is another bigger level of power but it was released before this paper came about the rewarding uh intermediate reasoning and so maybe GPT 4.5 is still the same raw power but just like way more reliable reasoning that's kind of what I expect from GPT 4.5 context Windows continue to go up as well we've also got new scaffolding obviously we've got rag we've also got skill libraries there's a lot of reasons to think that agents will work but this seems like a qualitatively different reason to think that agents will work that with the long-term memory with the big enough states with perhaps multiple States it seems like there is a path to a level of coherence robustness legibility predictability that will make these agents both far more effective and also just far more familiar in the way that they proceed and finally I think all of this does suggest that things are getting a bit out of control you've heard me say that that things are perhaps out of control many times I would like to emphasize that I think right now we are hitting a moment where there may be a qualitative shift in just how out of control things are getting a totally different line of research but just again in the last week or two deep mind put out a paper in nature about a technique called Fun search searching the function space where they used a frozen language model that means again the language model weights are not changing they're not even fine-tuning it but nevertheless they were able to use it to advance the state-ofthe-art in a number of problems including a couple of mathematical problems that have been open for decades you know this is pretty remarkable the way that they do it they have to have things where the solution is scorable this relates to the PNP problem division where certain kinds of problems are very hard to find the answer to but are very easy to verify the answer to a classic example of that would be factoring the product of two large prime numbers if I give you some giant number and even if I tell you like hey this is the product of two large primes you're going to be at it for a while before you can figure out what those primes are cryptography is largely based on the extreme Challenge of this problem however if you have the private key if you have one of the prime numbers that was used then you can just do the division and there it is so it's very timec consuming to find the answer but it's very easy to verify the answer this Deep Mind paper is for problems like that this almost Bears its own episode but the way that they do it is they use the language model they generate ideas they give it like its best recent attempts you know functions that have scored the highest and say Here's a function here's how it scored here's another function here's how it's scored your job is to come up with a new function that will score even better and it took a lot of generations but not that many I think they reported a million Generations in this paper but a million Generations we're talking low tens of thousands of dollars at retail API price with GPT 4 pricing to run a million Generations so not that much and with a little bit of additional clever structure to make sure that they were trying novel strategies and whatever and the fact that they could quickly score the results they're able to advance the state-ofthe-art on multiple open math problems that again have people been working on and have been open for decades that Paradigm seems to extend extremely naturally to exploring the architectural space in machine learning and particularly now that we have not one but two different fundamental blocks that are like equally expressive and have these likely quite complimentary strengths and weaknesses where the attention is really good at dense analysis and every token relates to every token and we can see everything that's under consideration clearly in one view versus the state space where you have this long-term memory and the ability to learn from lots of examples but also have to let go of certain information over time this architectural space is barely explored at all and AI could probably do it you know GPT 4.5 even even four right I mean this was Cody that they used which was a palm 2 generation fine-tuning on this Deep Mind paper so that's not even as good as gp4 but you take gp4 you apply a similar structure you say here are the two blocks that you can uh really play with that are like fundamental units you can reorder them you can remix them you can have different sizes at different layers you can have different interleaving patterns you can have different kinds of skip connections you can potentially even Define new ways that kind of bleed the blocks together and each time you do that we will actually do some training we'll actually instantiate that model we'll run it through some training and we'll compare it to other models that were trained on exactly the same initial data that can be pretty fast in previous episodes we've done analysis of like how big does your cluster have to be to train gp4 in how many days or whatever well the biggest model the three billion parameter model is still like five orders of magnitude less compute than gbt 4 and you don't even need that maybe you can get by with one one millionth maybe you can even do seven orders of magnitude less if you see something that hey in early training it's beating things that we that are currently our best then those are the things that you want to look at right so to take a language model and have it do a million different versions now the score is not so easy to to calculate for each of those million if each test takes one one millionth of gp4 compute to evaluate then when you do a million then you've basically consumed all of gp4 computes so this would be a big undertaking still for a significant lab with significant compute resources but might you expect that GPT 4 with a million attempts could beat the state-ofthe-art I would bet it could and what about the possibility also that as the state-of-the-art improves it's not gp4 doing it but it is in fact some State space attention hybrid doing it I'm assuming that one of the biggest challenges in the Deep Mind paper is that hey we need to avoid the same Generations over and over again right if you have the language model do the same task a million times you're GNA have to take care to get it to vary what it's doing but if it has a long-term memory of its own then perhaps it can be incentivized to explore different kinds of spaces within the architectural space without having to be tortured so much into doing it and again you need the training data to create that kind of behavior if that's indeed the kind of behavior that you want to create but it certainly seems far more possible than it ever has before that highly agentic highly goal oriented highly on task little specialized agents might soon be doing the machine learning architecture R&D or at least a significant share of it for us in an automated way and if that is beginning to happen then we're really starting to close some fundamental feedback loops that begin to look like some sort of takeoff some sort of intelligence explosion so with a big caveat that maybe I have this all wrong maybe it's just not going to work maybe it just won't scale past the 300 billion I I don't know why that would be the scaling laws so far have been pretty predictive but yeah maybe I have it wrong maybe it just won't work but if I'm right I think we are looking at not the end of Transformers but the end of the Transformers era the end of the time when the same exact block repeated over and over again would give you the very best performance and instead we would be heading into the beginning of a new multi-architecture era where we will likely have even faster recombining of these elements to create ever better and and also ever more specialized architectures where we will have proliferation and evolution so quick that it's going to become increasingly hard to analyze and make sense of this is already pretty hard I've spent the last two weeks trying to understand this from every angle and try to figure out what's going on there's a lot more to come I think it's going to be very hard to keep up with this stuff particularly if we start to see the loop close where the models can do the exploration of architectural space to build even better models I think we're going to see all that and more effective agents more compelling long-term assistants more compelling long-term AI friends and companions all of this if I had to guess I would say it probably happens several times as fast as the Transformer era has already unfolded right from 2017 here to late 2023 a six-year period of Transformers invented to today I would say this new architecture gets validated assuming it does and gets elaborated in a similar way in probably a third or maybe a quarter of the time in part because so many more people have piled into the space because the hardware has ramped up because the data sets are there because the benchmarks are there because the models are increasingly able to help all of these factors feeding into the same Dynamic the cycle is turning Tighter and Tighter so if you thought that the cognitive Revolution was going to give you any rest I am sad to say that I don't think that is the case if anything it just seems like the intensity is turning up and up and up the cycle time is getting shorter and shorter and all I can say is buckle up it is both energizing and enlightening to hear why people listen and learn what they value about the show so please don't hesitate to reach out via email at TCR turpentine doco or you can DM me on the social media platform of your choice omnik uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button I believe in omnik so much that I invested in it and I recommend you use it too use Cog rev to get a 10% discount
Info
Channel: Cognitive Revolution "How AI Changes Everything"
Views: 9,025
Rating: undefined out of 5
Keywords:
Id: X5F2X4tF9iM
Channel Id: undefined
Length: 150min 6sec (9006 seconds)
Published: Fri Dec 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.