Are Claude 3.5 Sonnet, Llama-3 and Gemini choosing speed over quality?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome back so one of the things that I've noticed is the output of certain models and certain scenarios is actually getting worse rather than better in fact if you want to cheat on your homework you probably better stick to chat GPT or even Lama 2 or CLA Opus rather than using the new clae 3.5 Sonet Google Gemini or even llama 3 and in this video I'm going to explain why so to understand why the models are getting worse we need to understand just a tiny little bit of what's going on in a large language model so the first thing to understand and you'll have seen this before from one of my other videos is that large language models don't see words directly they have to be tokenized first into essentially a number representation of each word or subword so if I take something like the query who is a to love lace it would get translated into this tokenized input and you can see there each word or even subword in and some cases would be split up and they would have a corresponding number in the dictionary so it could be one for a start of sequence who is 11644 and that all gets split up and you'll get a list of associated numbers in the vocabulary so when I talk to the large language model and I do a thing called a forward pass I.E I'm going to try and run my query through all the layers of the model I'm not sending the English text directly I'm sending the tokenized the version and that's going to hit a thing called the embeddings layer and within that embeddings layer it will actually have an entry for each item in my dictionary so it's a big Dictionary lookup table and you'll see here something like who is a the love l l and within that is this massive Vector uh associated with each token and that is what the model is going to use and it will take all of the entries of those vectors and then it's going to translate it so that it can then pass through the things called the E attention layers so once you query the model you're essentially going to pass the tokens into that embedding layer and it's going to pull back that Vector representation of the entire query so it's going to grab the vectors for each single item within the query and then essentially you're going to work through all of those attention layers essentially just doing a pass through that layer through layer through layer and then eventually you're going to get to the end output and you're going to get a new tokenized output which is going to tell you the result it'll say something like Ada of Love La is blah blah now the key thing to understand about attention layers or Transformer blocks in this sense is that when I hit those layers that embedding that Vector that is pulled from the embeddings layer is actually split into three vectors the first one is a query Vector the second one is a key vector and the last one is a value Vector now the query Vector is essentially a way for the token to ask questions about other tokens in the sequence so it's really allowing you to figure out in something like the question who is a love lace which tokens it should pay most attention to one of the other vectors is generated is the key vector and the key vector is used to represent the content and the significance of each token in the sequence and again keys are match with queries to measure the relevance of each token and then finally each token will generate a value vector and the value vectors are the actual information or representation that will be combined form the output these are the three key things and they are projected from the embeddings and of course as you work through the attention layers then you're sort of getting this kind of layering effect and you're just working your way through the network now these three blocks form the key of what's going on here because in a standard Transformer block since the original paper there was this idea of multi-headed attention and what you can see there in multi-headed attention is you're essentially going to process the inputs by splitting them into these smaller parts these heads and then you're going to attend to each part independently so they're going to run through the model and they're all going to be looked at independently and then eventually you're going to get this value now you see see in multi-headed attention there's this onetoone mapping going on so I've got my queries my keys and my values but there's a onetoone between queries and values now what this means is for any output that's multi-headed attention you're going to get this really comprehensive and integrated answer because you're referencing various parts at a passage they're all being looked at independently and then you're going to summarize everything with an overall relationship and impact accurately so so multi-headed attention is given a really good integrated response now what's actually happened in some of the newer models so things like uh gemini or the Gemma models they've actually moved to a thing called grouped query attention and it comes from a a paper again I'll put the link up here and you can go and look at that paper yourself but grouped query attention actually means that it's going to use less values and less keys and they're going to share across the queries so what you and and again the clue is in the name which is grouped query attention essentially what's going to happen is you're going to have groups of keys and values which are then shared with queries now if you think about that for a second that means that certain topics need to get grouped together and you're are looking at less information at the same time and the impact of that is is that group query attention means that it's focused more narrowly on specific queries and therefore the results are going to be less integrated and they are essentially going to be a more fragmented less coherent response and in fact if you look at the models you can generally tell what a grouped query attention output model looks like because it's going to look very list-like it's going to seem very chronological and each part of something like a story is going to be segmented and it won't have an overall narrative story to this and that is the problem that I'm talking about so if I look at something like chat GPT and I give it a piece of text to summarize it's going to do a beautiful job it's going to integrate it all well together it's going to create a cohesive beautiful narrative and that's great but then if I run through uh maybe on a gemini or a Gemma model then what's actually going to happen is I'm going to have this really chronological narrow Focus uh piece so what does that mean that means if you have something that is a long piece of text and there's a lot of Rich relationships within that text um then it's likely any model that's using group query attention will miss out on those richer relationships it will miss out on the broader story it will focus in on narrow elements of the passage whereas a multi-headed attention model is really going to get that broad context that overall story arc is going to get that spot on and that's the major difference between these models now GPT 40 has stuck with multi-headed attention but llama 3 is running group query attention llama 2 used to run multi-headed attention but they switched the group query attention the Gemini models and the gaml models are all using group query attention uh the mistra models are using group quer attention and the Claude models we don't know but I suspect and I'm going to prove this to you in a second that the old Claude 3 Opus model was a multi-headed attention and I believe the new Sonet model architectural change has moved to group query attention and you can see it and we're going to prove that in a second so why would they change the architecture well the quick version is that as you can kind of see that if I'm running smaller groups I am in the groups then I'm going to get faster inference and a lot of the model providers are really wanting to push forward for faster inference um they want to have essentially smaller models and want to do uh more with less and they're really pushing towards that so that fast inference is really what's driving the changes from uh most of these models but obviously GPT hasn't switched to that and they haven't switched to that on the only model as well so how can we prove how can we tell if a model is running multi headed attention or group query attention well actually the thing that we really need to do in these cases we can design ourselves a prompt that will expose the architecture of the model now I'm not saying that this prompt is perfect but what I decided to do is construct a little story about a bakery in a village and when I say I uh constructed a story I got chat GPT to generate this story for me so this is a story about a baker named Thomas he created this amazing bread every day it was a unique flavor uh and Thomas you know there's lots of descriptions of what Thomas would do throughout his day in Bakery and and of course one day uh uh Thomas met a a new Baker a new traveler called Henry who was down on his lock and Thomas decided to help Henry out there and teach him how to bake and then they work together Etc um It's A really lovely story um but at the end of it we're going to ask a question what was the role of the traveler named Henry in the bakery success and how did his relationship with Thomas evolve over time and the reason I've picked this particular one is it's a long enough prompt I don't want it to be too big it's just over around 500 tokens but it's enough of a size where if it is a group query attention model it is probably going to get chronological it's going to focus on a paragraph by paragraph it's going to segment and then it's going to miss the overarching story whereas A multi-headed detention model will deal with the story arc as a whole and tell a very sort of integrated story as opposed to going paragraph by paragraph and now if I want to I can just run this through uh something like GPT 40 and you can see I ask it that question and then it's saying Henry played a peral role and then it's going to come back with um you know a couple of uh answers about how how this was impacted so that's GPT 40's type response there it's coming back it's going to be a pretty good response but I need a way of of essentially judging that so what I've done in this scenario I've actually taken and again I'll share this code here I've actually taken the outputs of every single model and I've put it into this folder called uh input so you see I've got CLA 3 Opus clot 3 Sonic Gemini 1.5 Gemma 1 7B Gemma 29b Gemma 2 20 7B I went through some of the key models that I want to look at that I know have made architectural changes or ones that I don't know the architectural change of and then I've just essentially ran that prompt that I showed you two seconds ago against these models and then captured it and stored it in this input text so the second thing that I've done is I've created a script called rank outputs and what that essentially is going to do is going to run through all of that input folder there so you can see initialize over iterate over each file in the directory and read the contents and then and I'm going to send this to GPT 4 and you can see it comes back with the fen is original prompt several responses to it please rent the responses based on their coherence fluence un on segmented relevance completeness and quality and if I wanted to I could just run that um on my uh python so I'll just say python rank outputs and then you can see all that it's essentially done is got that nice little response there it gives the original prompt uh that story that we talked about and uh what the question was and then it gives the responses of every single model that it's concatenated there and again you could go and uh test this for yourself pull all of the outputs so that's fine so I've taken all of these outputs here I put that in this file and then if I want to I could just copy and paste all of this and I could run this on uh any model but I'm going to pick GPT kind of 40 and then if I just uh paste that in here and you can see based on the criteria of coherence relevance completeness and quality here is a ranking of responses so it's came back with GPT 40 as being highly coherent it's came back with gp35 highly coherent llama 270b remember what I said before I said that llama 2 was a model that was using multi-headed attention and then he switched the group query attention look here it is ranking higher than llama 3 based on this and then we've came back with Claude 3 three Opus and and again that's that's pretty good there um again very coherent now llama 3 has uh made it in it this point in fact we're seeing some of the group query attention models coming in but you see it's already calling out that it's lacking sort of uh uh depth and it's calling out segmentation so I can make this prompt a little bit better this time I'm going to ask it to penalized for segmented responses I.E outputs that are highly chronological segmented and chunked so it's the same prompt as got before but now I'm going to put a higher penalty on segmentation um and non- chronological integrated ones so let's have a look at it again it's got GPT 40 at the top it's still got GPT 35 there there's Claude 3 Opus still coming to but look now llama 2 is still there llama 38b is interestingly high up there Claude 35 Sona is there llama 27b is pretty uh pretty high up so if we look at the results here now based on that you see GPT 40 is still at the top GPT 35 is number two according to this uh Claud 3 Opus is at number three highly coherent Etc llama 2 is rocking it ahead of uh all the Llama 3 models actually interestingly enough the next model here is llama 38b um and then you see Claude Sonic and then llama 27b now that's really interesting because uh we know Claude Sonet is probably group query attention llama 3 is definitely group query attention and then llama 27b appears there and again that's multiheaded detention now one of the things to kind of be aware of and again I'll probably do a separate video on this um I did an experiment or rather uh one of my colleagues Jeremy Connor way actually did an experiment he took all of the outputs and it ran it through a tool which check the readability of the outputs and when you check that for readability actually all the smaller models did better than the larger models and again multi-headed attention sort of beat that through as well but if we look here you can kind of see what's going on here is that the smaller models which will have better readability than the larger models even though the group query attention are kind of coming through there and I think that's really sort of interesting and then there's llama 2 back there and then Gemma 2 which is a smaller model but things like the newer llama 3 Model 70b it's down on nine at the moment mistal models all much much lower Gemma one which is a multi-ad detention model is is doing better than some of the mystal models it's doing better than Gemini 1.5 Advanced and then of course Tiny llama is a terrible model so it's not doing so well but the new JMA 2 model sitting smack bang at the the bottom there now the next thing that I did here is I asked it now which ones do you think are multi-headed attention versus group quered attention so uh and by the way I think made a pretty good uh judgment on this you see here multi-headed detention is saying GPT 40 it's saying Claude 3 Opus llama 2 very coherent GPT 35 by the way all of these all of these we know are multi-headed detention models well we don't know uh Claud 3 Opus but I think it's a pretty good indication that that is multi-headed detention based on what it's saying there but again mral NEX we don't know what that model is we know that certainly the the mistra and the Mixr open source models are uh group query attention but we don't know if these ones are uh gqa or mha however if we come down in here look it's picked up llama 3 is gqa is somewhat segmented chunk processing approach Mr Large uh you've seen somewhat segmented suggesting gqa so remember what I said at the beginning of the video that Claud 35 Sonet they've went for a different architecture in fact even c 3 said they've made some architectural changes so as good as C 35 Sonet is actually if you're generating text from this probably making your stories worse yeah because it's going to be segmented chronological and it's not an integrated story line again mral 7B is picked up that's gqa it's picked up Gemini 1.5 this is one of the top models but it's it's so low down there tiny llama is MH uh and so is Gemma 17b but they're so small of models that I I just don't think it's going to pick it up here um and again ambiguous it's saying llama 2 llama 3 and mix roll and again these are all smaller models so it's probably done that so if I really wanted to push it really hard I could probably do an even bigger story rather than the 500 tokens so to bring this back to the beginning I believe the certain model creators are making tradeoffs and they are trading off the speed of their model versus is the quality of their outputs in certain scenarios so if you are using a model to create generative outputs maybe it's content creation maybe it's creating summarizations of Articles then you probably want to be leaning towards a model that's using multi-head attention as opposed to a model that's using group query attention and as you can kind of see there are those group query attention models although they are getting at faster speed for certain scenarios like summarization for Content creation the model outputs are demonstrably worse and they are worse because they are segmented and they're not presenting an integrated outline of the story and it's just something that you need to be aware of now that's not to say that that architecture work doesn't work in other areas like code creation classification Etc but for uh content creation generation and summarization of large chunks of text then that narrow Focus that group query attention has might run you into some problems and therefore the speed that you get from that model might not be worth the tradeoff so you see the model providers are making those trade-offs already I think CL 3.5 sonuna has made that tradeoff and has moved to gqa however what will be interesting is with Claud 3 Opus whether they stick with an mha architecture what is really cool is GPT hasn't moved at all so one of the things I would love to see the industry do off the back of this video is actually incorporate this into benchmarking because I don't think that is being considered today so anyway I hope this video has been useful and in a future video I'm going to talk about the discovery that myself and Jeremy made that the smaller models are outputting more readable outputs uh than the larger models and again I think that's an interesting segue and on that I'll catch you on the next video
Info
Channel: Chris Hay
Views: 872
Rating: undefined out of 5
Keywords: chris hay, chrishayuk, chatgpt 4o, chatgpt, claude 3.5 sonnet, claude 3 opus, claude 3.5 sonnet vs gpt 4o, llama-3, llama-2
Id: T8FsLJWPrAA
Channel Id: undefined
Length: 20min 30sec (1230 seconds)
Published: Mon Jul 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.