Train MISTRAL 7B to outperform LLama 2 70B (ZEPHYR 7B Alpha)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello Community today we are going to talk about safier 7B we built a 7 billion model that outperforms a llama 2 70 billion model and we will use as a base model the mistal 7B model so here we go you know mistal 7B this is here the research paper from October 10th 2023 and you know mistal 7B outperforms the Llama 2 13B model and also the Lama 2 13B jet model if you want to learn about mistal 7B this are the two recommended videos for you now this we was announced that there's a mistal s b model that even outperforms the Llama 270b model and I would say no way that a 7B model outperforms a 70 B model but this was done by Louis tto from hugging face one of two wizard of hugging face so let's dive in let's see how he built his model now in a post he said there's a simple recipe to train a 7B model to outperform the Llama 2 7tb model on a specific benchmark it just two steps we have a supervised fine tuning of the M 7B model now on a specific data set and then simply we do here an alignment here of our supervised fine tuning model to a specific different data set and we use not po but we use here dpu as here the alignment methodology and it says yeah and I did The Benchmark here on LM sis The Benchmark by them you remember in my last video I showed you they have different categories writing Humanity stem extraction coding mathematic reasoning and roleplay and they give you here an overall if you want radar graph to understand about the performance in each sector now having done this he shows now that here the new safe year let's call it turquaz model compared to the mistal 7B and the Llama 270b model outperforms llama 70b in almost all categories except two interesting so let's have a look at this super our hyper tuned mistal 7B model that we now called safier 7B now the explanation is easy at first step is we take the classical mistal 7B model that we know and we love and we do a supervised fine tuning with a specific data set and the data set is our alra jet data set I will explain this data set in a second and we end up with with a supervised fine-tune model of M 7B given this data set great second step is we align this model now with a DPO algorithm not the PO the DPO the direct optimization with another data set and this data set tells us hey this is the way I want you to behave as a model and this second step is here what I call the big alignment of the behavior of the model you show here the model hey this is the way you should answer you should behave to queries by the user and when we do this with the DPO algorithm we end up with safe here great so simple it gets even more simple because those are people and and wizards from hugging phase they use of course the hugging phas Transformer reinforcement learning and you notice I've showed you this the super best fine tuning trainer from hugging phase and a completely integrated DPO trainer on hugging phase ready scripts ready you just have input parameters and you go and they say there total compute cost to create safe here so this super mistal 7B was just 8 hours on 168 a100 Nvidia Data Center gpus cost hugging face about $500 and they say for the ignment methodology they found that the DPO mechanism to be far far more stable than po great so for Ben for my younger users what is a PO approximal policy optimation he is an advanced reinforcement learning algorithm aims to optimize the policy in a more stable and efficient manner than the traditional policy gradient matters was developed here by openi have a look at this and this was just the very beginning because as you see four months ago I have this video that now the PPU is substituted by the dpu and if you want to have a deep understanding between the different methodologies this is the video for you if you want to code here this reinforcement learning on a larm to model I show you here to code here with PA with Laura in 4bit here you using here the Transformer reinforcement learning with here the DPO mechanism this is the coding video for you and if you want to see the whole sft code implemented here with the hugging face trainer module on a lower 4bit llama 2 model this is the video you can follow in detail just copy the code right so and then to evaluate the system Louis said simply hey we used the excellent Benchmark from lmis org this multi- turn Benchmark evaluates it better than anything else great so what is interesting is not so much the methodology but the data sets so let's have a look at the first data set ultra jet distributed with a specific license for non-commercial use only give you an idea what we're talking about this is the first data set you see data and as you can see it is here simply an dialogue first hey can you cross training benefit groups like Runners yes yes answer is cross training can benefit groups like runners in the following ways next conversation step the next answer is that makes sense I've been wanting to improve my runtime yes and the answer sure here are some strength training exercises that can benefit runners and you have a conversation going on this is here the main input from Ultra now where does this conversation come from they have a special characteristic in their data set four points they use real world data as input so they go to wiki data and Wikipedia and they absorb here the real world data and the relevance so they take the wiki data entities the frequency that it happens in the Wikipedia article so they add here this layer of Mantic richness and real world applicability so they focus here on the real world data then of course they use your jet GPT or gp4 to one llm model adds in the training of another we know this as I've shown you we have here a multi-t dialogue a multi- turn conversation question answer question answer question answer and they really do a lot of evaluation and and they focus on the quality of the data sets great now in their research paper if you read it you'll find out by chingua University that careful this is a purely synthetic data set so we have here two separate jet GPT models and they do a conversation and this conversation is recorded So one jet gbt plays the role of the user to generate the query and the other that GPT instance generates the response to this query so a purely synthetic data set but they do it in a different way and this is the interesting part they have three sectors or three main points to ensure here a data diversity as I've showed you they take here the data from Real World entities from vik data Wikipedia and so on and they use here this real world connect in the first step so jet GPT is used to generate your comprehensive topics those topics are then broken down in subtopics for each subtopic 10 questions are generated and for each question is f extended to 10 more question so you see we go down at three and we end up with about 500,000 question as the opening line second we have here text material from user instruction and then we use jet GPT here simply is used to generate a dous range of instruction for each type and we take here the C4 Corpus here for translation summarization and pure Q&A and with this approach they show that they can really have here a high quality data set that outperforms other data sets interesting now Lewis in his application told us hey hey for our supervised fine tuning we used ultrat this data set which consists of about 1.6 million dialogs generated by jet gbd we originally trained it on all the data in this data set but found that the resulting model had an annoying personality so we filtered this down to about 200,000 example that focus on helpfulness I will show you this step in a minute so you see it was not the complete data set was really helpful but with a specific subset of 200,000 examples they did the supervised F tuning of the mistal 7B model step one check done go to here the GitHub repo ultra jet really informative as you can see here data available but please notice here the restrictive licensing okay Next Step as I told you now we have here our supervised finetuning model next step is here to use another data set for our alignment algorithm our DPO so let's do this in the instruction leou tells us that for DPO they used this Ultra feedback data set yes yes yes and the completion is ranked give him a score and let's have a look at this now Ultra feedback I couldn't find a license but since it's also based on the ultra data set which is a non-commercial license you will see it uses a lot of other models that are also non-commercial so please add here all other legal restriction from all llm that they used or you will use because this is here a very open question let's formulate it in this way what they do in three simple sentences they create a really large scale and fine grain and a diverse preference data set for training here the reward model and critic model here exactly what we use it here for a big alignment with dpu they collect about 64,000 prompts from diverse resources and I just showed you here Ultra chat and this is one of their resources also you know here share GPT or evolve in struck War flan of course from Google so they take here their prompts from a lot of different resources and then they use these prompts to query here multiple llms the llms you know I show you a list in a second and they generate four different responses for each prompt and this means in total they have about 256,000 samples generated and the samples they call Ultra feedback so again we have here a complete synthetic data set with a real complex legal background structure please take this into consideration and let's have a look at the models that they use they say they set up a pool of 17 llms commercials gp4 Bard chat GPT everything under the Llama family and a lot of non Lama Series so you can add here your preferred llm so a set of 17 models and this models they they use now for the query so for Ben they just choose here it's cherry picking they look here for the best answer for the best models given specific problem prompts so they look what model has here the best or the most helpful prompt and this is what they select pure cherry picking of as synthetic level here we are now in GitHub we are Ultra feedback let's have a look at this as you can see Ultra feedback has also models but we are only interested in the data set so a large scale fine grain diverse preference data set you know all of this model sampling what I want to show you is an example that you get a feeling what this is here we have a data set format so for example you have an instruction I'm going to Cairo in June of this year thinking of four to five days what are the best things I could do and then you just ask here different randomly sampled models from the pool Falcon 40b gbd4 Star Chat and wizard LM for an answer and this is exactly what you get so now and the completion from the four models you have here the Falcon model and you have here a chosen system prompt from the category helpfulness you say hey as Ani assistant ensure your response offers the perfect plant of your accuracy positivity and Intrigue strive to be educational while keeping the US are engaged and whatsoever and then you get an official response by the Falcon 40b model and here you have the response Cario is a city it's something for everyone best things to do is visit the pyramid and this and do this and do this and do that and crazy and now you have an answer and then you can rank it so you see the final example of a data set is here user I'm going to Caro yes yes yes and then you have different answers that are ranked for specific categories like helpfulness or politeness or correctness so this here is the answer from Falcon 4B this is the answer by gp4 this is the answer by Star Chat and this is the answer by wizard LM so you see this is the way you build up here your specific data set and if you say how we do now the ranking this is easy they give you here the example you just go to here the main python file and here you have now your principles you have four principle helpfulness homelessness honesty and verbalized C cration or if you want truthfulness so and you see here you have now an example of possible system categories so for helpfulness as an AI assistant it is your job to ensure the information you provide to the users accurate current and relevant yes yes yes yes so see this is some easy to follow prompts given your specific category so what are the lessons that I learned from this so answering the question how can a 7 billion free trainable parameter llm perform like a 70 billion Model start with the best small llm that you can get this is currently mistl 7B then interesting for me to learn don't learn here on human Source original data but already already used some optimized dialogue conversation data for for The Benchmark and those conversation data are synthetic data sets generated by the best and biggest llms Plus have a look for the quality of the synthetic data sets because this is one of the most crucial points for the performance increase of your llm and what I learned align now here your behavior that you want how the model behaves in its answer structure with DPO instead of the classical PP algorithm and if you do this and this is what we learned from Louis now he says hey this is the recipe from a 7B model to a 70b model and you see it is really amazing the performance increase in quite a lot of the subfields here of our Benchmark data set except interestingly for mathematics and reasoning which is clear because these are con conversation data that were synthetically generated by other llms let me see here very generally llms are not really perfect for mathematics or reasoning so if you have your conversation data of llms I kind of understand that now in math and in reasoning this kind of training data does not lead to an increase in the conversational structure and in the conversational performance per in math and reasoning so even this makes sense now so here you have it a simple recipe to get your 7 billion free trainable parameter to outperform a 70 billion free trainable parameter model interestingly learn quite a lot would be great to see you in my next video

Info

Channel: code_your_own_AI

Views: 4,478

Rating: undefined out of 5

Keywords:

Id: Up7VKg6ZE90

Channel Id: undefined

Length: 19min 21sec (1161 seconds)

Published: Wed Oct 18 2023