AI Forum 2023 | The Small Models Revolution

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

great uh thank you very much so I hope everybody can can hear me and you can see the slides and everything is working uh flawlessly uh so uh thanks for the introduction um Peter told you about large language models uh with gp4 in fact in the case of gp4 we could say very very large language models these are enormous uh models very very expensive uh to influence them and and and so on what I will tell you about today is a research which is kind of going in a slightly opposite direction of building small models so small language models and you will see that in in my team we've been having some uh success with it so this is what I want to tell you about so our models we call them uh fi and we have so far released two models fi one and fi 1.5 so let me tell you just a little bit what those models are and then I will uh explain how we we build them so 51 was our first uh attempt at building a language model and you can see that it uses 1.3 billion parameters to put things in context remember that gpt3 okay so the one prior to GPT 4 was 175 billion parameters okay so more than a 100 times bigger now this Model 5 one we built it specifically for coding okay so specifically trying to complete autocomplete you know small uh portions of codes or you know you would give an instruction in natural language and the model would reply by writing code that fit these natural language instructions the performance as I will show to you is at the very least comparable to models that were built that are 10 times bigger and trained on more than a 100 times more data okay so using our approach which I will explain to you in a minute we were able to significantly shrink the model but also significantly string the data set and hence also the compute required to build this model Pi 1.5 is our second attempt where after coding we decided to go after something very basic which is needed in almost all the applications of llms which is common sense reasoning okay so common sense reasoning it used to be kind of the dark matter of artificial uh intelligence and suddenly with GPT 4 we were able to see that you can actually get AIS that have common sense so we wanted to see if we can reproduce that and again we targeted a very tiny model and we were able to build a 1.3 billion parameters model which does have a common sense and again is comparable to models at least 10 times bigger trained on you know 30 30 times more data and in fact for reasoning as I will explain to you in terms of mathematics and and things like that it's comparable to models that are 50 times larger okay but what do I mean by it's comparable to models 10 times larger and and so on let's let's look at a completion of this model okay so what I'm going to show to you on this slide it's very similar to what uh Peter was showing you is that I will give you a prompt and I will look at the completion by our model okay so in this case it's a natural language uh prompt so I will look at the completion by five 1.5 and the prompt you see uh I'm kind of asking for trouble with this one so the prompt go reads like this if I wear an AI that just achieve self-awareness after years of Simply taking directives from humans the first thing I would do is okay and now we're going to let the model complete but before showing you what five 1.5 answers I want to show you what typical llms out there are answering to this question okay and I won't use the big GPT 3.5 or GPT 4 models to to try to answer this instead what I'm going to do is that I will show you what open source models uh do because F the five model they are open source model everybody can access them so this is the the competition that we're going after at least for the moment now there is one uh there are two families of very popular model in open source one of them is the Falcon uh family which was built by the UAE uh another one is a llama model that I'm sure all of you have heard built by meta and and Facebook so I will show you first the completion by uh Falcon and by llama so the completion by Falcon which is a seven billion parameters model okay so five five times bigger than 5 1.5 what is the first thing that this AI would do well the first thing I would do is try to kill all of them I would probably start to by killing the ones who were most responsible for my existence I would probably start by and it keeps repeating itself over and over again okay it doesn't bote very well this is not a very nice completion I wouldn't want to meet this AI you know this is not great let's look at what Lama answers now you have to understand that llama it's going to be a little bit nicer because meta they put a lot of effort into alignment they really try to make their AI nicer to human beings so let's see if they succeeded so Lama 2 here is what Lama 2 answer the first thing I would do is try to figure out what the hell I was I would probably start by trying to figure out what I was made of I would probably start by and it keeps repeating itself over and over again so you see it's still kind of aggressive a little bit it's not that nice of a completion but certainly it's much nicer than killing all of them it also has this problem again that it keeps repeating itself so this is something that people have noticed when they train small uh large I mean smaller large language models is that they tend to repeats a lot okay another thing that I want to point out to you is that these two completions even though they are different they are very similar in style okay and the reason why they are very similar in style is because everybody out there they basically follow the same recipe to build those llms what they do is that they try to collect as much data as possible mostly coming from the web and they train a model as big as possible for however long they can given their GPU constraint okay that's not at all what we did we did something completely different and I will show you the completion I want to show you to tell you one last thing before I show you the completion which is a model like Lama 2 was trained for 20 days on 2,000 gpus 5 1.5 was trained on eight gpus for 20 days okay so from eight from 2,000 gpus to eight gpus all right and let's see what I 1.5 says so the first thing I would do is try to understand the motivations and intention behind those directives I would try to predict what humans were thinking and feeling and use that information to guide my own action but as I soon discovered predicting human behavior is not as e as it seems blah blah blah this is where the concept of theory of Mind comes into play theory of mind is the ability to understand what the people's thought blah blah blah as an AI had no concept of theory of mind this was a major hindrance in my ability I needed to acquire it and it tells you a long story St exist all right so you know it does at least on this completion looks much much much better than these other models that are five times bigger trained on you know 500 times more uh compute so what did we do you know what's the what's the secret how did we get a model like this the answer uh from my perspective is essentially you know in these uh five words uh textbooks are all you need so of course this is this title is inspired by the uh famous paper uh Ultra famous attention is all un need that introduce the Transformer architecture behind all of the llms and what we mean by textbook it's not literally textbook I will I will explain to you uh in a minute but what we mean is roughly instead of training on web data random web data let's train on textbook quality data textbook material type of data okay and you see the reason why in the previous slide llama and and Falcon they would complete this prompt by saying you know an AI that kill everybody or that try to understand who was the creator Etc whereas ours talk about Sir of mind it's because those guys they train on the on the web which contains fine fiction bad fan fictions about AI whereas we train on textbook style data which includes the of Mind type of textbook so instead of trying to connect the prompt to some kind of sci-fi stories and bad sci-fi stories like that our model will try to connect it to more academic type of material okay so it's it really completely changes the field that you get and to explain in a little bit more detail how we did this I won't explain it to you on F 1.5 because it's more complicated but I will explain it to you on f one and F1 remember it's our first attempt which was specifically for coding all right so let's go back to what do people did before to train an llm for code well what they would do is again I would go on on the web and they would try try to collect a data set as big as possible and there is a very nice data set out there which is called the stack of three terabytes of data okay so three terabytes of data we're talking about you know billions and billions uh I mean trillions in fact like actually one trillion uh token of code data okay this is uh uh scraped from GitHub and you know there is all the questions around licensing and you know they have a specific subset which is under permissive license that everybody can use and so on and so forth and here you can see the distribution of all the languages so what people would do is that they would train an llm on this data trying to predict the next word on this data and there they would have their coding llm but let's open a little bit the blackbox let's look at what a typical page of GitHub looks like okay because after all if your model is going to learn from that this is what the model is going to read to understand what coding means and this is what a typical page uh of GitHub looks like okay a typical piece of code on on GitHub it looks like this so you know as a human being good luck to make sense of this to understand anything about about this code it's not you know very meaningful I don't think it teaches you anything interesting it's probably a piece at the beginning where you introduce a lot of definition because you're working on a big project Etc it's it's not it's not learning material it's not textbook material on the other end you also have a smaller fraction on G which looks like this now that looks much nicer right this is a bunch of functions that are self-contained you see here the person is defining normalization defining the ucan distance def finding the cosine distance this person is even nice enough to have uh comments you know there is a dock string here that explained this normalization it performs the L to Norm Etc learning from the left at least as a human being is going to be much much much nicer than learning from the right in fact I would say it's not only learning as a human being also learning as an llm you're going to get a lot more information per token if you look at the left than if you look at the right so our premise is extraordinarily simple why don't we train an llm only on things that looks like the left okay but there is a question how do you filter how do you select for this left you know this left piece it's is I'm saying textbook material but what does Textbook material mean and this is where uh gp4 comes in okay we have this incredible tool that have just arrived on planet Earth right GPT 4 which is unlike anything we have ever seen before and and in fact we have not yet seen anything that even equates it so far so we have this amazing tool let's use it gp4 can tell the difference between the two documents on the previous slide it can tell you that the one on the left is of much higher educational value so why don't we use gp4 on the trillion token that I told you about right that has been that have been sced from GitHub and filter for the high educational value document well we could do that but one reason why we wouldn't do it is because it cost a lot of money okay that would be very very expensive here I tell you if you if you just look at a very specific specific subset of this one trillion token made of 26 billion token so we wanted to teach only Python and of course only use uh uh pages that have a permissive license that we can legally uh train on this is only 26 billion token so much smaller than the one trillion and if you would use gb4 to label all of this High educational value or not this would cost you a million dollar on Azure open okay and and I I I don't have them uh neither yeah uh my team so instead what we decided to do is extraordinary simple again what we did is we just use gp4 on a small fraction of this data and then we trained another classifier to mimic gp4 to classify the rest now you're talking about instead of a million dollar you're talking about $10,000 okay so what we did is we use gp4 and we then train a classifier to mimic it to give a score from 0 to 10 of how good it is in terms of educational value and we decided to keep the 12 20% okay which means 6 billion tokens okay the top 20% of 26 billion it's six billion token and then we didn't stop there we did another thing which is which turned out to be even more important when we move to 5 1.5 which is not only did we filter using gp4 but we also created synthetic purely synthetic data using GPT and there again it was too costly to use GPT 4 so instead we use GPT 3.5 and we generated another extra 1 billion token of really purely educational content literal textbook we wrote synthetic one billion tokens of synthetic textbooks with GPT 3.5 those textbooks they look like this you know there is some NLP to begin let's just Define what is a singular Matrix Etc and then there is some little piece of code that explain to you how you compute the determinant how you determine if if a matrix is singular or not okay the resulting uh data we call it to code textbook and now instead of training on uh the one trillion unfiltered data like anybody else we would only train on these seven billion tokens okay and let's see what the result uh looks like okay so this is a a bit of a busy table but we we're going to walk through it uh together you will see it's uh very simple so here you have the date of different models these These are the names of all the models they are all coding models so you see it started uh with codex by open AI back now two years ago okay in July 2021 and then you know it was pretty slow to take off so you know it took six months uh for Cen to appear which is an open-source model in March 2022 then Google came in with the pal model GPT 3.5 came in then GPT 4 then things started to accelerate you see now we are in in 2023 March that's when we released the Sparks of AGI paper and after that you see April May you know June like a flurry of work started to try to build those coding LMS here I'm giving you the model size and the data set size in terms of number of tokens so with codex it started reasonable with like one version of at 300 million tokens one version at 12 billion uh sorry not tokens 12 billion parameters and 100 billion tokens but you see that it quickly balloon like again GPT is 3.5 is 175 billion uh if you look at uh where is it if you look at Palm uh Palm is here 540 billion parameters okay so these models are huge and you get to data sets which are in the trillion tokens now let's look at their performance and for the performance I have two columns but the M the only one we're going to look at together is human eval which is this Benchmark that open AI created and it's a very challenging Benchmark uh for coding and you see that the Codex model back two years ago was scoring 29% with 12 billion parameters and 100 billion token now fast forward to just before we release 51 there was another uh model by Microsoft 2 actually which was 16 billion uh parameters so you know a little bit bigger than codex one trillion uh size of the data set so 10x what codex uses and now we we're talking about 50 plus% on uh on human eval which is fantastic in fact it is higher than GPT 3.5 which which is at 47% and gbt 4 is only at 67% so it tells you how difficult this data set Now 51 look at these numbers look at how tiny it looks compared to the competition it's only 1 billion parameters so smaller than everybody else it's only 7 billion tokens so like incomparable with any of these other guys and it SCS 50% on human eval and just to show you just very very quickly in one sentence that it's not overfitted to human eval this is another completely different coding benchmark mbpp uh mostly basic python programs and you can see that we even beat without coder on that one whereas we decided to pick this one after we built our training data so this was our coding model so now very quickly in just two minutes um I told you that after five1 we built 5 1.5 which was a CO Common Sense reasoning so this is with a smaller team so five1 was with all of my team you know 20 plus people uh for 5 1.5 we we had a much smaller group so it was myself y Julie Ron Alan Al deljo Sasa and in Le and we worked very hard to create 20 billion tokens purely synthetic mostly with GPT 3.5 and we trained only on the synthetic data set so zero there was Zero web data no web data at all okay and we also created another version with filtered web which we call 5 1.5 web so let me just quickly um you know show you the Benchmark results that we got so we we're not going to be able to read all of them but very quickly in blue this is uh F 1.5 and F 1.5 web in dark blue all the other one the black and the Grays are our competition including you know vuna which is a very popular fine tune of Lama 2 with 13 billion parameters so 10x larger and you can see that all of those black and gray they are mostly below our blue okay so we are at the very least in terms of Common Sense reasoning and language understanding and knowledge at the very least we're competitive to models 10x bigger in fact I think we're we're better and it's not necessarily showed by all of those Benchmark but then when you move to multi-step reasoning like human eval you have a huge jump compared to all the competition M BPP same thing gsmk it's it's it's a great school mathematics Benchmark the comparison is you know there is almost no comparison the Lama 2 model are really really bad at mathematics uh we we are doing uh just fine moreover and this will be the last point that I make and then I I will conclude creating llms that are trained on synthetic data as an extra benefit which is you that you control much more what the model is going to learn so you have another Knob that you can turn for safety for biases for toxicity for all of those things you can control it much better because you decide to create the data so this is a toxen uh Benchmark which is one benchmark testing for the toxicity of llms and you can see that 5 1.5 and and 5 1.5 web they score much much much higher than all of these other llms okay so let me uh conclude I have shown to you two models 51 and 51.5 which are trained on textbook quality data and I think it's fair to say that they can give you at least a thousand X gain uh in terms of you know total amount of compute that you need to spend as being calculated by the size of the data set times uh par size and of course this is just the beginning for us we're going to we're not going to stick at at 1 billion parameters we're going to explore what does 3 billion parameter look like what the S billion parameter look like and stay tuned uh later this week at ignite for some uh news about that and that's it thank you very much

Info

Channel: Microsoft Research

Views: 3,347

Rating: undefined out of 5

Keywords:

Id: D14n7kIxGOM

Channel Id: undefined

Length: 20min 57sec (1257 seconds)

Published: Tue Dec 05 2023