Microsoft's New AI 'PHI-1' Just SURPRISED EVERYONE! (Now ANNOUNCED!)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so in the abstract of this paper it is called textbooks are all you need Microsoft State we introduce Phi one a new large language model for code with significantly smaller size than competing models file one is a Transformer based model with 1.3 billion parameters trained for four days on aa100s using a selection of textbooks quality data from the web and synthetically generated textbooks and exercises with GPT 3.5 despite this small scale Phi 1 attains a pass accuracy of 50.6 on human evaluation and 55 on MBP and what was crazy about the abstract is that they also state it also displays surprising emergent properties compared to Phi 1 base our model before fine tuning stage on a data set of coding exercises and fire once more a smaller model with 350ml million parameters trained with the same pipeline as Phi 1 that still achieves 45 on human either so looking at the paper it might seem quite confusing but that's why we're here I'm gonna break this down for you in a simple terms but it's actually quite shocking you see when we look at this table that we're given we're given various different models and their sizes and on this table you can see gbd4 you can see GPT 3.5 and you can also see Google's Palm too now one thing that we firstly want to look at when looking at this specific task is of course the model size you can see the fire ones model size is 1.3 billion parameters compared to gpt4s not applicable which is allegedly one trillion parameters that's based on various sources and various rumors around the internet and of course we know that GPT 3.5 has 175 billion parameters now when we do look at the various other artificial intelligence Bots that are on this list we can see that the only other model that pretty much comes close to this model is Wizard coder and gpt4 okay but remember we have a large language model that is trained on 1.3 billion parameters and when you watch this video please understand that the reason this paper is so interesting is because how they train this model the data sets that they used and the results that they achieved even on such a small large language so compared to GPT 4's 67 51 achieved a 50.6 on the human eval and if you don't know what the human eval is the human eval is a benchmark for evaluating the ability of large language models to generate code that solves programming problems it consists of 164 original programming problems assessing language comprehension algorithms and simple mathematics so now that you know what human eval is you can see why data set of only 7 billion tokens and 1.3 billion parameters achieving a 50 percent on the human eval which is marginally close to gpt4 is definitely quite incredible and what's even more incredible is what they trained it on you see the paper is called textbooks are all you need it says roughly speaking we pre-train on textbook quality data both synthetically generated with GPT 3.5 and filtered from web sources and we fine-tune textbook exercise like data what's also interesting is that in the introduction they also say that moreover despite being trained on much fewer tokens compared to existing models fire one still displays emergent properties and in Section 3 we discussed these emergent properties and in particular we confirm that the hypothesis that the number of parameters plays a key role in emergence which when you take this paper out of context I think we do start to realize that when we look at previous day data set to where they talked about how when we increase parameter size regardless of the large language model certain qualities and certain abilities will start to emerge regardless of your training data and this is something that we did talk about in our previous video where we said GPT 5 is likely to be the most dangerous AI model in both aspects because of course as you know we cannot predict what emergent abilities are and essentially emergent abilities are just that in the name they're emergent so we don't know what these capabilities will be or when they will arise so this is where textbooks are all you need really comes into the play so when you look at these images what we are seeing are three different bar charts and what this reference is is every single piece of data that they trained it on you can see from left to right they increasingly trained it on more quality data and you can see that as the quality of data increases so the part so does the past accurate see on the human either you can literally see that just based on your training data you can increase the effectiveness of your large language model by a significant amount literally going from 17 on the human eval all the way up to 51 so that goes to show that your training data is very very important and the importance of high quality data which is why they State textbooks are all you need and what's crazy is that this is a very very small large language model which goes to show that if we have large language models with incredibly high quality data sets these large language models are going to get much smaller and much more efficient so essentially what they wanted to do is they want to differ from how other large language models trained on certain pieces of data they stated such as other web-based data sets such as stack Overflow and code contest these sources are not optimal for teaching the model how to reason and plan algorithmically on the other hand our model architecture and trading methods are fairly conventional so here's where they talk about the giant era that everyone was making with their large language models you see when training the other large language models many samples are not self-contained meaning that they depend on other modules or files that are external to the snippet making them hard to understand without additional context typical examples do not involve any meaningful computation but rather consist of trivial or boilerplate such as defining constraints setting parameters or configuring GUI elements and also the examples are skew towards certain topics or use cases resulting in an unbalanced distribution of coding Concepts and skills across the data set and they essentially state that we can only imagine how frustrating and inefficient it would be for a human learner to try and acquire coding skills from those data sets so therefore they decided to train it on three key data sets a filtered code language data set a synthetic textbook data set consisting of less than 1 billion tokens from gbt 3.5 and more synthetic exercises data set consisting of 180 million tokens of python exercises and solutions and you can see right here exercises and solutions is some of the perfect training data that we know large language models want because if you remember our recently large language models trained on data where it says explain like M5 perform marginally better it's no surprise as to why this is also performing better and all in all the above data sets contain less than 7 billion tokens now what I did find interesting and I'm not going to say they made a mistake doing this because I'm not as qualified as some of these AI researchers but didn't make sense to me is that the data set consisted of less than 1 billion tokens of GPT 3.5 generated python textbooks synthesized to provide a high quality source of natural language heavy text interleaved with relevant code Snippets now I'm wondering why they didn't use gpt4 I'm assuming that maybe the cost of gbt4 was too much and maybe they wanted to produce this as quickly and as efficiently as possible because if you've ever used gpt4 versus GPT 3.5's code you'll notice that there is a stock difference in terms of functionality namely so that gbt4 produces code that always works and GPT 3.5 definitely needs tons of revisions so this is where they talk about emergent capabilities and in figure 2.1 showed that the largest Improvement in human evaluation resulted from fine-tuning on the small code exercises data set and code exercises consist exclusively of short python tasks using only basic python libraries but they demonstrate that quite remarkably the model after fine-tuning also exhibits a substantial Improvement in executing tasks that are not featured in the fine tuning data set so once again we're seeing AI models be able to increase their ability to perform tasks that they haven't seen before or haven't been trained before on which is you know as they say some Sparks of emergent abilities so in the conclusion they State just as a comprehensive well-crafted textbook can provide a student with the necessary knowledge to master a new subject our work demonstrates the remarkable impact of high quality data in holding a language model's Proficiency in code generation tasks by crafting textbook quality data we were able to train a model that surpasses almost all open source models on coding benchmarks such as human evaluation mbpp despite being 10 times smaller in model size and 100 times smaller in data set size we hypothesize we hypothesize that such high quality data dramatically improves the learning efficiency of language models for code as they provide clear self-contained instructive and balanced examples of coding Concepts and skills now they do also talk about some limitations of the large language model because as you know there is always some and one of them that I did talk about earlier was that they of course noted that we also believe that significant gains could be achieved by using gpt4 to generate the synthetic data instead of gbt 3.5 as we noticed the GPT 3.5 data has a high error rate it is interesting though that Phi 1 is able to achieve such a high coding proficiency despite those errors and a similar phenomenon was observed where a language model can be trained on data with a hundred percent error rate and still generate correct artists at test time which if you ask me that is absolutely insane imagine training someone on completely wrong answers and then you give them a test and then they get all the answers right maybe not all the answers but being able to get some of the answers right is still pretty incredible now you might be wondering so what does this mean for the rest of the large language models well it means that when we look at how they're going to train future language models like gbt5 and Google's Gemini it means that I do think that large language models are going to have less and less parameters because they were able to create a large language model with only 1.3 billion parameters that's just as good as some of those with over 16 20 billion 30 billion parameters and it's literally just as good this now means that if we do a scalar model to 30 billion parameters but let's say for example we now did it on high quality data alone textbook quality data those language models are likely to far surpass the ones that we currently have so right now we are seeing a major shift in how these large language models are going to be trained and it's clear that in future language models with less than a billion parameters or maybe even around 10 are going to be very Sim efficient at nearly every task but the only problem that these large language model creators do have is of course the training data that is something that is hard to acquire and will take time but with tools like gbt4 we understand that we are going to be getting this training data very soon and these large language models release very very quickly and efficiently
Info
Channel: TheAIGRID
Views: 22,232
Rating: undefined out of 5
Keywords:
Id: H_bLpa9oAJ8
Channel Id: undefined
Length: 11min 56sec (716 seconds)
Published: Thu Jun 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.