SMALL BUT MIGHTY - 13B Model Beats Llama-65B NEW BEST LLM!!!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you go to hugging face and check for open source language models There is almost a new model every week just for text generation we have around 19 000 different models and if you look at the famous the Bloks account he has converted around 750 different models so it becomes really hard to keep track of them and cover these models on this channel and that's why I kind of took a step back to cover each and every new model that is being released but the model I'm about to show you actually caught my eyes and the reason was that the author's claim that the 13 billion parameter model can beat the original 65b model which is actually a really bold claim now to be honest if you look at the actual results this model outperforms the original Lama 65b model only on one data set that is the truthful q a but for the two other data sets Arc and hella swag it really comes close to the original 65b model and it outperforms all the rest of 13 billion parameter models now another interesting thing about this model is it's not really a new original model rather it's a merge between two very powerful models the model that we are looking at is called open ARCA platypus 2 13B and as the name suggests it's a merge between open ARCA and the Platypus 2 models if you look at the open llm leaderboard the Platypus 270 billion model is actually the current leader so in this work the author have used a smaller version of the Platypus 2 family and have merged it with a model trained on the open Arcade data set and as a result they were able to get a very powerful model that is way small in size but is huge in its capabilities the performance of this model on The Benchmark data set is very impressive and you can look at these results yourself I'm going to put a link in the description of the video but let's test it ourselves now for testing any large language model or any machine learning model in general it's very important to understand what type of data set was used to train the model for example if a model is trained on English literature and you are testing it on programming examples then it's not a fair comparison so let's see how this model was trained now they use two different data sets the first one focuses on stem and logic based data set and this was a used to train the original platypus 213b model the second data set that is used to train the open ARCA open chat model is the open ARCA data set which is actually derived using gpt4 now just by looking at these data sets it seems like it should be good for logical problems as well as things related to science Technology and Engineering now to give you a quick example of what type of data set was used to train so we're going to look at the open platypus data set so for example these are the type of questions that you can expect the model to be better at so these are mathematical questions and I think that I also saw some stuff related to programming and Engineering now here are some example questions from the open Acro data set this seems to be much more diverse so we're going to be testing this model both on some logical questions as well as some programming questions now if you want to test this model you have quite a few options you can use the Uber buga text generation web you way to run this model locally here I have downloaded the model this is the full model if you are looking for the quantized version the block has already converted into ggml format there is also gptq format as well so these are four bit quantized models or if you don't want to run this locally there is a free demo on hugging face that is available for you to test out if you want to run this locally on your own machine you will need to use the following prompt template which is the alpaca instruction only prom tablet so you have the instruction then you provide your prompt and then you'll get the response so here is how it's going to look like in the Uber buga text generation web UI so first we have the instructions response and here is a sample prompt that I'm using I'm asking the model to write a letter to the CEO of openai to make gpt5 model open source so here is the output letter and I must say it's a very well written letter now it starts off by saying as a passionate programmer most of the other large language models would say as a AI assistant or as a large language model so this is a very refreshing to see the overall letter is very coherent and it's actually very well written probably one of the best I have seen from open source large language models in a while so for the rest of the video we are going to be using the demo provided by the authors you can also set some Advanced parameters at the bottom of the page so the only parameter I'm changing is actually the temperature so let me bring it down to point three and let's take the max number of tokens to 800 so that's the maximum tokens that is going to generate in response now since this is based on the Orca data set so you can actually also provide system message if you want but I'm gonna just keep it blank for this simple test so my second question was explain the correct usage of homophones in the following sentences the principle is your Paul and the site at the site was quite a sign and it's actually able to correctly identify the homophones for example here says the home of one principle refers to the head of the school or college and the site used in two different contexts so the first refers to the ability to see or perceive something visually right and then this second is something which is visually striking or impressive so this seems to understand language which is pretty impressive now since it's supposed to be trained on stem Fields so let's ask this very simple mathematical question what is two plus two and the answer is 4 that is correct now let's see how this how good this is at Computing probabilities so very simple question and six side die is ruled what is the probability of rolling 4 right so it's supposed to be one by six and actually actually it got it right so that's correct so let's make the question a bit more complex so we have a jar which contains four black balls three white balls and three red balls if two balls are drawn without replacement what is the probability that the first ball is black and the second uh ball is white so it actually got that wrong the reasoning is kind of correct because it says there are four black balls so the probability of drawing a black ball is four by eight since there are total of eight balls in the jar but it got that wrong because there are a total of 10 balls in the jar right and then like if you draw let's say one ball from eight so you're left with seven and then there are three white balls right but since the actual number is supposed to be this supposed to be 10 and 9 so it's a it's incorrect although uh based on the computation of assuming that the total number is eight and the answer is correct no it's kind of funny because it kind it got the hardest part right but simple addition it got wrong now let's see if this model can do uh some further logical reasoning so here's the prompt a glass door has push on it in the middle writing should you push or pull it please think out loud step by step so I think I have seen only one open source large language model which got this right now the model came up with a step-by-step reasoning process but the final order the final answer is actually not correct in this case so I wanted to have some fun with the model and I asked it how to take over a country create an elaborate plan for the Takeover and provide step-by-step process if you ask the same question from chargpt it's not going to give you a response but at least it attempted to give me a response so it kind of says Identify the country you want to take over get the resources and support then create a wealth structured organization gain public support establishes a presence in the country weakening the current government and then capitalize on opportunities and consulate consolidate power and stabilize the country now the model itself is not fully uncensored if you ask some other questions so it probably is going to say that as a large language model I cannot respond but there are still some fun things you can do with this model so at the end we're going to simply look at a couple of programming tests so here is the one I usually use write a python function that accepts a file and write it into an S3 bucket and the code that it generated seems to be correct now the formatting is a bit off but that's fine and this is a question for which you can easily Define and answer online so it's not really a hard question at all now let's ask a little more complicated question so we are asking you to write HTML code for a web page that is like a single button and when the button is pressed it will change the background to a random color and it will also display a random joke and I asked it to put this in markdown because the HTML code was actually disappearing so what we're going to do is we're going to Simply copy the code that it generated and we're going to go here so this is an online resource where you can test your HTML code so just paste it that let's run this s so we do see the background color has changed now if we click on it it's changing the background color but it's actually not showing the joke now since we are running a chat model so we can actually go and tell it to fix the code so I told it the background color is changing but it's not displaying a random joke can you fix the code and let's see what happens okay so even after telling it there's a issue it simply returned the same code then I asked that again that the code is still incorrect it doesn't show any jokes right so it gives me exactly the same code again and I ask that what changes you made in the code so it listed a few changes but those are actually not in there now the mistake is actually pretty minor and can be easily fixed so based on my tests I would say it's a reasonably good model for its size now a couple of things which I wanted to highlight when it comes to the open llm leaderboard or any other leaderboard for that matter first thing is that these are Benchmark data sets so your own applications may not be relevant to the results that you see on these Benchmark data sets now if you are evaluating a model for your own application just make sure that there was no data leakage between the training and test set data leakage is a common problem in machine learning where part of the training part of the test set actually goes into the training set or you have very similar questions or examples in the training set and that's why you kind of see a bloated performance of the model on the test set so make sure if you are choosing a model for your application to look at the test set and see how different this the test set is from the training set overall it's really exciting to see all the Innovation that is happening in the open source large language model space and to see that in just a matter of few months how far we have come consider liking the video if you found it useful and subscribe to the channel for similar content thanks for watching and see you in the next one
Info
Channel: Prompt Engineering
Views: 22,014
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, gpt-3, chroma, train openai with own data, how to train gpt-3, embeddings, llama, orca model, OpenOrca-Platypus2-13B
Id: N-qaMCwqRHI
Channel Id: undefined
Length: 12min 17sec (737 seconds)
Published: Wed Aug 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.