Llama 3 - 8B & 70B Deep Dive

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay So Meta has finally released some of the Llama 3 models now this is not all the models they've released two models and in this video I want to go through and look at exactly what they've released look at some of the benchmarks around the models that they've released look at some of the sort of new things in the license that are perhaps going to be not so helpful to people and then also want to look at what's coming in the future with the Llama 3 Series and then show you how to get this set up with llama and with hugging face so that you can have a play with it yourself all right so first off meta AI have released two of the Llama 3 models so they've released an 8 billion parameter model and a 70 billion parameter model and they've also disclosed that a 405 billion parameter model is coming in hopefully not the too distant future so let's start off by just looking at the two models that they've released I'll go through some of the Ben marks around it Etc and then later on I'll look at some of the things that they talk about what's coming in the 405 billion parameter model as well so far we've got two of these models being the 8 billion parameter model and the 70 billion parameter model and many people from meta are already commenting that the 8 billion parameter model beats the 70 billion llama 2 models so if you think about the smallest model in this release is is actually beating the largest model in the last release that is a nice jump forward that said a lot of models have come out since llama 2 and most people nowadays haven't been using llama 2 they've been using perhaps one of the mystal models or Gemma or some of the other open models that are out there okay so the two models that they've released come in both the base model format or what they're now calling a pre-trained format and then the instruction tuned format which is what most people will use if they're just using one of these models for tasks Etc obviously the base model or the pre-trained version is for people who want to fine-tune this there's already some nice scripts out for fine-tuning this stuff like that perhaps we look at that in a future video so on their model card here they talk about the inputs for these two models as being text only so this is interesting and it kind of hints at that they're probably going to release a multimodal version of this at some point and in fact I think a number of the people working on the team have hinted to this as well that we'll probably see a vision LM so where we can put in images and perhaps other modalities as well in the not too distant future a number of people on the Llama team have made the comment that this is just the first step that we should expect to see more releases coming and my guess is that we'll probably see things like a new code llama the multimodal models Etc but for now this is text only in here and basically generating text tokens out here they do mention here that this could be code as well so unfortunately at the moment we don't have a technical report or a paper that goes into all the details about this we've just got a model card they're talking about here that the both the models both the 8 billion and the 70 billion model have a context length of 8K which does seem very short compared to many models now going 32k and even far above that with 100 200,000 tokens Etc my guess is that we'll probably see some fine-tune versions that actually do go out to much longer context length going through this both models have been trained with grp query attention in here and one of the things that I find really interesting is that both models have been trained with over 15 trillion tokens so I think this is not only the largest amount of tokens that we've seen models trained on for where people have publicly declared the amount of tokens but also I think this is close on double the amount of tokens that we've seen other people talk about training their models on the cut off dates for this is March 23 for the first model which is very interesting because it basically means they've been sitting on this data set for quite a while and then for the larger model is last December 2023 so they do talk about the intended use here is that this is for commercial commercial and research use in English although I did see some people talking on Twitter that they've actually trained about 5% of the tokens are non-english in there so my guess is that this could work a lot better than a lot of other models because if you think about it even though it's only 5% of the tokens that's still over 750 billion tokens that's roughly 2 and a half times what gpt3 was trained on and that just happens to be their sort of multilingual ones there has been a lot of talk though that they will bring out a multilingual model which I would love to see that would certainly be something that's very useful for a lot of people another thing that's quite interesting looking at their blog post about this is just how many Cloud providers they seem to have already worked with to basically make llama 3 available on those providers and that's everyone from like AWS gcp AIA the big ones but then also a number of smaller ones as well and making it available on places like Le so people can use the model there so on top of the 15 trillion tokens of data they mentioned that this is seven times what Lama 2 was trained on and if you remember back we could see the loss on Lama 2 was still going strong after the 2 trillion tokens they also mention in here that this is actually trained on four times more code so it will be interesting to see do they actually release another code llama model as well perhaps with the instruction tuning for code task Etc and perhaps a longer context window as well another interesting thing that I've seen them talking about on Twitter and here is that basically this was trained with 24,000 gpus which certainly is a lot of gpus but certainly is a lot less than I think Mark Zuckerberg was claiming that they had 350,000 gpus so I wonder how many are actually being used for training the bigger 405 billion parameter model that's coming in the near future so if we jump in and look at the benchmarks we can see the benchmarks here of comparing the 8 billion parameter model to the mistal 7B and the Gemma instruction tune model that came out recently I guess not surprisingly they're claiming that this is a lot higher one of the things that stands out for me is not so much the MML U but the GSM a Marks here this is like double what mistal instruct and Gemma are getting so that alone should mean that it's going to do a lot better at tasks like that then the 70 billion model they're comparing and benchmarking against two proprietary models one being the Gemini Pro 1.5 model and the other one being the Claude Sonet model so remember the claw 3 Sonet model is the middle model it's not as strong as the Opus model but supposedly stronger than the Hau model now looking at the benchmarks here for the 70 billion model they're not as drastic as the 8 billion compared to Gemma and mistal 7B but many of them are very competitive if not beating the Gemini Pro 1.5 and the Claude 3 now of course this model doesn't have the long context window like like Gemini 1.5 or like the Claude 3 models interestingly they've also made their own sort of Benchmark for this so they talk about that they've made this evaluation set containing 800 different prompts that cover 12 key uses asking for advice brainstorming classification close question answering coding creative writing extraction Persona roleplay open question answering reasoning rewriting and summarization and I guess not surprisingly they're showing that their 70 billion model is actually beating a bunch of the different models out there including GPT 3.5 mistal medium and the Claude Sonet model in here and substantially beating the previous llama 2 model here so it would have been nice to perhaps see how they did against gp4 and actually perhaps even if gp4 is beating them how close are they to that my guess is that we will see this in the lmis arena benchmarking as that comes out with real people going in there and using the model and seeing how close this actually gets to gp4 and the new gp4 turbo model another point that they make in the blog post which I think has become kind of obvious to a lot of people is that they refer to the chinchilla optimal scaling laws so this is quite an old paper Now by Deep Mind it basically said you you would roughly train on around 20 times the number of tokens for parameters that you have and I should point out that paper didn't claim that was the best amount of tokens to train on to get the best model it was basically balancing the number of parameters versus the number of tokens there what they do show here though is that the scaling laws can probably go way higher than this so even with two orders of magnitude more than this it seems that both these models are still doing well after 15 trillion tokens which is really amazing to think about and and it's going to make you wonder how many tokens have some of the new proprietary models actually been trained on are they much larger than this are they around this that kind of thing all right so to get access to this model you're going to have to come in here and accept the license for this so if you come in to download the hugging face weights like you can see I've done here you'll be presented with this thing where you've basically got to go through fill out the license and then submit it in here so number of things that I'm going to point out that I think are interesting in here so some of the license conditions are similar to what we had before that if you've got more than 700 million monthly active users then they want you to basically ask for a license so this is really aimed at Tik Tok and perhaps a few other social media companies in here one of the other disappointing points in here is this Clause that you will not use llama materials or any output or results of llama materials to improve any other large language model excluding llama 3 or find tunes of llama 3 so you can't use this to make a data set and then train up a smaller model to do something out of it this is a shame especially for things like if you wanted to fine-tune a small B model by making a bunch of data with the Llama 3 Etc technically this is not something you're allowed to do and that possibly could be to do with that the way that meta AI is actually licensed data that's perhaps a clause that they're being forced to put in here as well it certainly reinforces that this is not an open- Source model right in any way shape or form right people have definitely mve move to the term open weights now rather than open source that term's been around for a long time but it is a shame that it seems to me that Mark Zuckerberg was talking about the importance of Open Source yet here we've basically got this restriction which is clearly not open source another clause in here that I find that's quite funny is that if you're going to fine tune this model or if you're going to do a model merge or something like that you're going to have to include the name llama 3 at the beginning of any AI model name so for all the different Vine Tunes out there we're going to suddenly see them be the Llama 3 noose research model or the Llama 3 Samantha model Etc another key part of the license is the prohibited uses in here this is perhaps something you want to come in and see if you're doing a startup for health or for doing something legal you may want to read these uh rather carefully as you go through it but overall I think the prohibited uses are actually quite similar to what we've seen with llama 2 before this I guess most importantly you can actually use this for commercial use as long as you are not breaking the other terms in the license here okay so before we jump in and have a look at the code and have a look at getting this set up with olama it's interesting to look at they mentioned briefly about the model that is still training so this is the 405 billion parameter model so this is an extremely large model that most people are not going to be able to host themselves or even use a quantized version of this as it's going to be really large but it is interesting to look at some of the stats that they're getting so far so this is a a checkpoint from earlier this week that they've obviously run some tests on and the results that they're getting are not far off gp4 so it is interesting to think that weh perhaps only a few months away from having an open weights model that is on par with GPT 4 now I do find it very interesting and perhaps this is a topic for a whole other video that no one really has smashed the GPT for results so far so it does show that they've probably got some interesting techniques around how they're curating data around how they're perhaps doing things like curriculum learning and stuff like that to get the results that they're getting that even when people are throwing lots of gpus at this at best they're coming up close or on par with gpg4usb 94 is being very high up there and it will certainly be very interesting to see the results of this when the actual model comes out all right let's jump in and have a look at how you can get the model set up with olama how you can get it running with hugging face so there are a lot of ways that you can actually run the Llama 3 Model I'm just going to go through some of them quickly and show you basically how you could do this how you can try out the model in various places that kind of thing so the first one that I've got open here is AMA and as you can see here that ama have already added llama 3 to their models if you just come in here looks like they've already had 41,000 people download the model you can see in here that they've got a number of different versions you you can get the 8B you can get the 70b you can get the instruct Etc and of course to basically get this going you just run it it will pull it down and then you can use it and you can see that sure enough we can talk to it ask it who you are I am llama and AI assistant developed by meta Ai and you can just use it just like normal like you would with any other Alama model in here now you can also do the same with LM Studio or some of the other software that's basically running the quantized version here don't forget by default they're running a 4-bit quantized version if you wanted to you could run one of the other quantized versions there are quite a number of them on hug face now that people have uploaded so that you can try these out Etc if you want to run some kind of quantized version if you just want to check out the model one of the good places that you can go and do that yourself is to go to hugging chat so hugging chat has now Incorporated Lama 3 in here you can basically use it and prompt it just like you would with any of the other models that they have here another thing if you want to use a deployed version of this you've got a number of different options available one of the things you can do is come here and deploy it yourself and so just coming here you can basically now with hugging face it allows you to deploy an endpoint to Azure to Google Cloud to Amazon Etc you can basically deploy that have your own private instance of this running and that's probably one of the ways to go if you wanted to have your own fine-tune version of this going on now if you want to just ping an API for doing this together AI have already uploaded this I think also a number of other ones like replicate Etc have also uploaded llama 3 and and got this this is also a good place where you can just come and test out the different models so you can test out the 8 billion model you can test out the 70 billion chat model in here try them out see what you like about them Etc all right so if you want to run the code yourself I've basically put together a little notebook here that you can go through and run it and try it out so the GPU I'm using here is the new collab is the the new l4s which have just been added to collab recently here so I'm not running a quantized version in here but if you wanted to run a quantied version you certainly can do that so I've put some code in there that you could use as a guide for running the quantized version of this and basically here you're just uploading using the text generation pipeline loading it in one of the good things I guess about llama is that they've clearly worked with the huging face team ahead of time to make sure that this was going to be pretty simple and was going to work out of the box in here you need to add some Terminators in here to handle the sort of end of sequences and stuff like that you can play around with different sampling different temperature stuff like that what I've done here is basically put together a version with where we're just doing greedy sampling and we're going to basically use the temperature of zero and I've gone through and to sort of look at some of the prompts that we've used for a number of the models recently including the Gemma 1.1 about a week ago so here we can see that we're passing in a system prompt and we're passing in the normal prompt this is using a Chain of Thought kind of prompt and this shows me that the Llama post training has certainly included some kinds of training on this Chain of Thought stuff because you can see here we're getting this step one step two step three very very similar to what we were getting from Gem and this really makes me think that okay this has been trained on this specifically to get it like this you can see if we try it without the asking for the Chain of Thought So this is the exact same prompt but no system prompt for Chain of Thought in here and here we're getting a variety of answers back but we're not getting that sort of broken down stepbystep format which makes me think that this has really been included in some of their training examples when they've done the instruction tuning another thing that I found interesting is that when we ask at some of the old questions what's the difference between a llama bakuna and alpaca it tends to respond not so much like llama 2 but much more like Gemma and mistal now so I don't know if this has got something to do with a change in the supervised fine tuning or something like that but it's certainly different than it was before emails again if we're asking for reasoning step by step will include that in the email it does construct a decent email though in here the model's also good at the whole role playing Concepts in here as well so when we change the system prompt to your Freddy the 5-year-old boy it definitely picks up on that it makes the email shorter it makes the language simpler Etc in here perhaps not as childlike as some of the other models have approached this again in Ro playing it has no problem playing the role of the vice president where it it's an other one we've seen with clae Etc it it refused it basically said that it didn't feel comfortable with that in the what is the capital of England so again if we tell it that it's an assistant it will give us a sort of full sentence if we just tell it write out your answer short and succinct it gives us just the answer back which is good in here it handles the Jeffrey Hinson question the creative writing I haven't really uh done enough of these to check but it seems well code generation seems okay as well in here and the GSM AK so this is one of the things where this model is supposed to be a lot better than Gemma and mistel in here I found it a little bit hit and miss that on some things it seemed to work really well but on other things not and I think this may be partly due to the system prompt so it handles some of the simple ones fine it doesn't do the rounding with the babysitter question it is able to work out this really fine but then when it comes to the math version just a pure math version it doesn't do very well here and that seems to have been one step in here where really this should have been 7x and if it was 7x going forward it probably would have worked it out correctly like it did up here I'm not sure what's going on there I also looked at some of the variations of the Jeffrey Hinton question that I also looked at with the Gemma model itself so that you can see the different kinds of outputs that it gives in here and then finally I just tried out some of the react prompting it actually seems to do pretty good here so it's getting that the action for this first one should be weather the input should be Singapore the second one it gives us the action should be Wikipedia search and input should be King Arthur then finally this last one action should be web search and should be latest AI news today so that seems to be working well that shows a lot of Promise of maybe this is going to be a good model for doing function calling or for certainly fine-tuning to be better at function calling which is something that I'm really interested to look at in the future on the Gemma 1.1 models one of the things that I did was change this system prompt and it got a lot better answers out unfortunately this is still making the same mistake on the babysitter question whereas with the Gemma 1.1 it was actually getting it correct in here so on the whole I think Lama 3 is a pretty good model I think that it's probably not a lot better than the models that we've seen recently we're sort of getting to the point now where all these models are starting to hit a level where they can do the majority of tasks perhaps really well and it's going to come down to different fine tunings of this as well don't forget when I'm comparing it to the recent Gemma model that was the second fine-tuning and the second fine tuning was a lot better than the original Gemma 1.0 instruction tune model so we may see some better tunes of llama 3 base model that actually do quite a lot better I'm looking forward to seeing perhaps this weekend as some people start releasing the various fine tunes how they actually go and you should be able to just drop those into this notebook and try them out yourself having a look at it all right over the next few days I'll probably make a video about the Lama 3 tokenizer there's some interesting things in the Lama 3 tokenizer that I think signal a little bit of a change that I would like to talk about that said I did do some tests on some multilingual stuff with this current version and it didn't perform as well as I would have liked it to so I will perhaps look at those again when we talk about the tokenizer as always if you've got any interesting things that you noticed about the model please put them in the comments below if you've got any questions Etc please put them in the comments below if you found the video useful please click like And subscribe and I will talk to you in the next video bye for now
Info
Channel: Sam Witteveen
Views: 35,318
Rating: undefined out of 5
Keywords: llama, meta ai, llama 3, chatpgt, llm, ai, models, open source, mistral, Gemma 1.1, llama 2, hugging face, Ollama, Together AI, replicate, hugging chat, gpt 3, gpt 3.5, gpt 4, human eval, 8B, 70B, 405B, claude 3 sonnet, machine learning, artificial intelligence, React, Code generation, LM Studio, codellama, llama 3 demo, llama 3 tutorial, llama 3 coding, llama 3 api, llama 3 announcement, llama 3 8b, llama 3 70b, code llama 70b, artificial intelligence movie
Id: 8Ul_0jddTU4
Channel Id: undefined
Length: 23min 53sec (1433 seconds)
Published: Fri Apr 19 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.