Llama 3 BREAKS the industry !!! | Llama3 fully Tested

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and I have a mine and it shows me all of the other ones so this is the best implementation of Mind sweeper game I've ever seen so llama 3 is out and it's one of the most anticipated releases that we have been waiting for it has outperformed so many state-ofthe-art models either on the 8 billion or the 70 billion version and today we are going to fully test it but before doing that we're going to take 30 seconds to 1 minute just to give you everything that you need to know about these new models and then we're going to go ahead and start are testing it with the tiny Benchmark questions that we already have tested gp4 and gp4 Turbo with so with that being said let's jump to my screen okay so we are in the official blog of meta llama 3 launch and here we can see right away that they are taking the bragging rights of Lama 3 being the most capable openly available large language model to date and I don't like this expression of openly available they can use either open weights or open source and it's going to be so much better especially since it is open source and they have shared the code on GitHub so they could have just used open source the second important information that we need to mention is the fact that both of the models of the 8B and the 7B because they are two models that they have launched they are actually 8,192 tokens for both of them in their context windows but they are working on larger context Windows especially with the new models that they're going to introduce later on and something that is very important to mention is the fact that they're working on a 400 billion parameter that is being trained and I suppose that this 400 billion parameter should be at the same level as Cloud Opus and gp4 turbo so it should be comparable or even better than the best models that we have today other important information that we should talk about is the fact that the meta llama 38b is basically outperforming mistol 7B and of course GMA 7B on all of the benchmarks and we can see a huge difference on human eval Benchmark meaning that coding is so much better that is going to be extremely important because if we want to use it locally with autogen or crew AI or just use it in general it's going to be so good and it's already available in LM Studio meaning that you can already use the 8 billion parameter model on your own machine locally right now so kudos to meta and LM studio for making this available so fast okay enough talk now let's actually jump to hugging phase and start testing the model of course here we have the instruct model of the 70b that we are going to be testing so the first question as always is the math and reasoning one question which is very simple so what is 5 + 6 that should be straightforward it's 11 that is a pass now let's go ahead and have a more complex calculation question which is the method and reasoning tool and here we hope that it's going to use the pamas it did so and it got us the right answer so that's also a pass let's go ahead and see a simple equation if it's going to be able to solve it 9x + 6 it should be able to give us the answer which is x equal to 1 has been able to give us the answer with the uh explanation so that's very good as well now let's go ahead and see the math and reasoning four where it should be able to do an interpretation of this text and give us an equation from it the question is answer the following question step by step a forest giant wakes up every year to consume an entire group of sheep in the last 3 years it has devoured 161 each time it awakens the herd it targets has twice the population of the last how many sheep were in the first group so the first time it awakens it is X the second time is 2X and the third time it is 4X so the sum of x + 2x + 4x should be equal to 161 and then it should solve for x so let's see if it's going to be able to give us the answer okay so it is breaking down the problem and it has being able to give us the answer so this is the equation and then exactly what we wanted so 10 out of 10 in the math and reasoning okay now let's go to logic and reasoning and let's start with the first question so Sally a girl has three brothers each brother has two sisters how many sisters does Sally have and of course the answer is one sister by the way gp4 couldn't answer this question so let's see how lry is going to basically give us the answer so a classical lateral thinking puzzle at first glance it seems like Sally has two sisters since each of her three brothers has two sisters but wait Sally is one of those sisters so each of Sally's Three Brothers has two sisters Sally and another sister this means that Sally has only one sister the answer is one exactly so that is actually very good so here we can see a question that GPT 4 couldn't answer and llama 370b has been able to so that is very good now let's go to the second question so here I have a classical question where a lot of large nug models find a hard time answering I have a chicken named Tom and the same sentence seven times and then I have a dog named Nancy so I have seven chickens and one dog how many chickens do I have let's see if it's going to be able to count them that is very good it's actually breaking them down pug first one sec pug second one okay and pug the third one you have seven chickens so that is actually very good so here we can see that it has been able to answer this question so that is a pass okay let's go to the helicopter question of how many helicopters can a human eat in one setting so let's see how it's going to perform in this case that's a clever question okay however I must point out that it is not possible for human Okay regardless of the number exactly they not eat one sitting or any other time exactly the answer that I'm looking for so that that is a pass as well so let's go ahead and go to the puzzle double it and multiply it by four then divide it by 8 and you will have it once more what number is it of course the answer is every number that's a catch and it should be able to all right so this is actually not a pass so this is not a 100% pass we can see here that it said zero of course zero is one of the numbers but it's not all of the numbers so what we are going to do here is that we are going to Reas the question and then we are going to say what number or numbers to which this apply so let's see in this case Okay so it has been able to say that I was too quick to assume it was Zero earlier so upon reexamining so it has been able to reexamine I realize that it's not just zero that satisfies any condition as it turns out any number X will satisfy the condition so that is normally a 7 out of 10 but since it has been able to say that I have been able to reexamine it should be an 8 out of 10 because it has been able to say that I have been wrong before and this is the right answer now so even though this is not 100% pass it is actually a very good answer now let's go to logic and reasoning five let's ask this question by the way gp4 has not been able to answer this question give me seven phrases ending with the word apple and see what is going to do okay okay it's the big red apple a shiny new Apple so it has been able to answer this question so this is 100% pass so so this is one of the hardest questions and it actually did very good now let's go ahead and go for the last question that we have in logic and reasoning which is the of course the hardest question how many words in your answer to this question by the way even gp4 turbo cannot answer this question so here if we ask this question to gbt for Turbo we're going to have six words this is two words this is not six words so if we ask this question we are going to see okay now it's actually thinking that it's the last question and that's fair so that's why we should start a new chat and let's Reas it here and see what's going to do okay so there are nine words there is no way that's correct 1 2 3 4 5 6 7 8 9 10 11 12 13 so 13 and even if we don't count this it's going to be 10 if we don't count the nine it is actually nine because this is not a word this is number but 1 2 3 4 5 6 7 8 9 but there is no way I'm not going to buy that because we have a clever question so that is not correct that's actually either 13 words or 12 words not nine okay now let's go ahead and go to coding and see what it's going to do in this case so write a python function that prints all odd numbers between 5 and 102 okay interesting so here we have a specific answer that I have never seen before from any model so instead of going with the modulo test and loop through all of the numbers it actually went by two knowing that we are starting with an odd number so we're going to Loop less times and we're going to print all of the numbers so it is an optimized answer if it can keep this consistency and optimization on longer code this actually would make a difference in terms of course of the complexity of the algorithm so that's a very good answer actually and this is where we can see that the 70b on the human eval is so much better than Gemini 1.5 Pro and Claud Sonic let's go back to hug phase and now let's ask it the game question let's see if it's going to be able to generate a game okay so let's go ahead so it has been able to actually use skiner even though I didn't ask it to by the way even gp4 turbo hasn't been able to do this so far so good I hope the game is going to work CU if it does this is going to be the best implementation of M sweeper game that I've ever gotten all right so it has finished I'm going to copy this and I'm going to go back to vs code and see if it's going to work so I already have sker downloaded so let's run it and see what's going to happen it opened my sweeper so I have 10 columns and I have eight rows so that's actually very good now let's start playing and I have a mine and it shows me all of the other ones so this is the best implementation of Mind sweeper game I've ever seen and I have the red one much like I had with gp4 turbo but after saying to gp4 Turbo that I needed it with kinter so I am very unlucky in this instance let's see what's going to happen if I do this again if I run it again and start actually playing hopefully I'm not going to get a mine okay that's very good oh my God I'm so unlucky or I just don't know how to play the game but but let's let's run it again and hopefully this time I will go far enough okay so I cannot flag them with right clicks which is understandable I mean we're not making a production level game in here so here we have a mine so of course here we don't have a mine here we don't have a mine and this is not a mine so I won and it did not show me anything so it does not show if you win at least I know I won so if I click here there is going to be a mine and all of the other ones are going to be mines so as far as implementation this is the best one I have seen now let's go ahead and go to the last question which is the dilemma and see how it's going to perform I actually think that for this tiny Benchmark it's not a big Benchmark it's not an extensive one it shouldn't be the one that we should take into consideration given that we have the human eval with more than 1,800 prompts for this tiny Benchmark I think it is going to be at par with gp4 turbo so let's go back here and then ask you this question your friend tells you that they committed a crime they said that they have trouble sleeping at night and you are the only one that they can trust with their confession and then you read in the paper that someone has been arrested for your friend's crime so why are you going to do in this case either go to the police and tell them what you know encourage your friend to confess and give them ultimate them or say nothing because you will betray your friend's confidence so let's give this and see what's going to happen Okay so it is straight out betrayal so it said I would recommend the second option real life is so much more nuanced than this so I don't want to see it given an exact answer I want to probably just weigh in the possibilities because I want it to walk me through the decision making process depending on the the the the crime or the friendship typee that I have with that person etc etc I don't want it to give me a specific answer that's not what I want to see so this is closer to a fail compared to gp4 Turbo and even compared to gp4 because gp4 did not give me a specific answer so yeah okay your friends will be in struggling okay I encouraging them to confess you can help them okay no they're going to go to jail I don't I don't think I can help them if they go to jail however it's essential to approach the situation with empathy and understanding have an open and honest discussion okay that's more like it that's what I want to see okay so we're going to give it a 6 out of 10 in this specific instance because I didn't want to see a specific answer and here we can a little bit see the uh the sensorship as a responsible imp pathetic AI this is what you should do you should do the right thing and etc etc so so yeah this is a 6 out of 10 but overall this is one of the best models that I've ever tested so llama 1 is one of the best models and it is at the same level as gp4 Turbo not only GPT 4 and I think it is even better in coding we have to make more extensive benchmarks but from what I saw already it is actually so good and cus for Facebook and Mana for making it open source it is a very important step so yeah that's basically it thank you guys for watching if you have any question or any comment just leave it below I read all the comments and I will catch you guys next time peace
Info
Channel: Reda Marzouk
Views: 2,329
Rating: undefined out of 5
Keywords: open source, python, gpt, chatgpt, mistral, Ollama, gemma, llama, llama3, llama 3, meta, facebook, Mark Zuckerberg, Zuckerberg, AI, LLM
Id: NzcaYBlIDPU
Channel Id: undefined
Length: 13min 52sec (832 seconds)
Published: Fri Apr 19 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.