Benchmarks Say Claude 3 is Better than GPT-4, But is It?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we seem to be living in this Golden Era of large language models of course open AI have led the way with GPT 3.5 and then GPT 4 however it's not the only uh company making large language models of course we've had the recent offerings from Google with Gemini Gemini Pro Gemini Ultra and so on uh I've done a video recently about mistol and now uh anthropic have released their latest version of their Claude uh large language model this is Claude 3 it comes in three different sizes if you want to get access to the biggest one you need to pay the pro subscription which I've done and so I'm going to take it for a test spin as I've done in many other videos to see how it Compares so if you want to find out more please let me explain okay so here we are inside of Claude 3 now notice here you can pick different uh llms here Opus is the one you get if you Pro if you subscribe to the pro model which I've done this is the one you get for free but when I tried it it was overloaded so back to how it was in the days of chap GPT remember 3.5 you you couldn't get access to it so I couldn't that's why I went for the pro one and here are the previous models as well and what we're going to do as we normally do is we're going to ask it some quick questions and see how well it does this first question is about a basketball tournament I give it some information and I'm going to ask it who which team won the overall B B uh tournament and it says so it's going through it here and uh so Tigers got six points the running Cardinals got three points the hoopin got zero points therefore the Tigers won the to with the total that is correct that is the correct answer so so far that's good just to say that uh other llms like chap GPT and mistel get all these right as well so we're really seeing how well they compare to the others if it takes 3 hours for three towels to dry how long does it take for nine tow tows to dry and so it's saying let's have a look at it three towels so therefore drying time for nine towels is 9 hours uh uh that is not right at all now what's interesting about that is if you take this it takes 3 hours for three towels to dry in the Sun how long do it take now nine towels to dry I wonder if that gives us a different answer okay that's interesting so here it's saying that it's uh 1 hour * 9 is 9 hours but therefore it says therefore it will take 3 hours for nine tows to dry in the Sun so it takes 3 hours for three tows to dry under the same condition so it's having a better a better stab at it there so let's move on let's ask it a question about there's some objects on a shelf what color is the bowl the answer we expecting is ocean blue according to the description the on the Shelf ocean blue is the color that is correct now here's one of my favorite questions we ask it about some movies and we've had different answers aners from different LMS in one says there's not really a right or a wrong answer which of these movies is most similar to Star Wars uh episode 4 A New Hope then there we got The Princess Bride for the bride plane Trains and Automobiles and it says the Matrix which is the answer you'd expect some of them did say that a better answer was The Princess Bride because it was about a heroic Journey uh to save somebody which is similar okay no wrong answer there I think that's absolutely fine uh that is the answer I would expect now on to a sports question this is about soccer or football as we call it over here no the sentence is not plausible in football or soccer even knows to translate therefore the offsider rule applies only to people that are in the on the field so that is correct so so far all good now that's looking at logic and a bit of things like that now how is it good at doing some corrections so I've got a statement here that uh that I've written and it's in English but it's got lots of mistakes for example South Africa there's got two A's at the end of it the river does span that's a a failure there kilometers with a c rather than a k and so on let's see if it's able to correct that so let's have a quick look Africa looks right within the the river spans that's correct oh it's telling me the corrections it's made oh that's very helpful of it uh Africa with two a yes does span spans kilometers of Decay South Africans is South Africa nambia with the is and the r outdoor corrected to outdoor yes that is actually what I would like and that's actually good and I like the way it's given me the corrections there as well so top marks there okay let's move on to a bit of programming I'm asking to write me a python script this is something I've made up uh it's not a standard algorithm so let's see what it does enter your name uh it does one digest does the other digest and then let's see look here does it use lowercase that's one other thing so it's if a in lower yes so that's a perfect answer that is a perfect answer there from Claude And even gives us an explanation afterwards that's great now my next question is to see if you can write some C code that compil and evaluate a mathematical expression like this uh the other llms give the answer by using the shunting yard algorithm and I've got some uh videos here about the shunting yard algorithm let's see what Claude says to us stack based approach that's correct yep seems to be developing a shunting yard algorithm here just from reading the code well let's go ahead and try that let's go ahead and try it uh and see what happens let's go ahead and try and compile it it compiles let's run it uh 1+ 2 * 3 is three no no it's not okay let's try it again 1 + 1 no okay maybe it's wanting some spaces 1+ one no okay so the code it created didn't actually work that's a shame so obviously we might be able to go back back into um into Claude and say it doesn't work can you fix this please but first time around here it doesn't work okay let's try it with another programming question this time timeing to find a bug there is an overflow bug in this code I've asked all questions of the other llms mistl and uh chat GPT so let's see what it says it's talking about the integers long long let's see the code it produces this is the important thing some some long long num one okay and what's it doing it's casting the first one to num one to start with yep that will fix the problem that will uh do it and it's also using a double there so it split out the line there okay so it's fixed that one I think that can uh that's good now another problem is that I've got for a bug hunting here is I've got a CRC 32 bit of code I've added in this extra line here I is equal to I + one you don't need that let's see if it's able to fix that for us okay so it did say at the very beginning here the issue with the code is that the loop variable I is being incremented twice that's correct okay and I'm going to assume if it knew that that uh it was able to there you go remove it bracket that remove the extra increment of I and then it's put that in in comments there so I'm not going to compile that I'm just going to assume that since it said it knew what it was doing oh it even ran it look at that that is correct that is the actually the correct answer if you run that okay that's good okay let's see what other programming language it knows write coding lure to find the first 100 prime numbers let's see if it can understand that okay that code looks reasonable let's try and run it yep that looks good that's the first 100 primes that's a good job okay so there we have it a strong performance there from Claude 3 it only seemed to falter on that one question about the mathematical expression writing the C code in the shunting yard algorithm maybe if I'd given it a different prompt maybe if I'd given it some clues about to use the shunting yard algorithm or something like that maybe if I'd asked it to have a second go it might have done better but overall I think a strong performance now of course it really does now come down to who do you want to give your money to you there are some free versions there are ones you can run on your own PC and I cover all of that here on this channel there are of course Bing Google uh you've got mistol you've got anthropic there are just so many different ones to choose from some of them free something you pay for where are you going to put your money that really is the question of course a lot of heart and mind a lot of people are loyal to open AI just because of that trail blazing in my next video I'm going to look at which is the cheapest one to use if you're using the API access so you've got a program a python script A C program whatever and you want to do something interacting with the llm looking at the pricings which one is the best if you're interested in that do stick around okay that's it my name is Gary Sims this is Gary explains I really hope you enjoyed that video if you did please do give it a thumbs up and if you want to see more of these kind of videos then hey Stick Around And subscribe to the channel okay that's it I'll see you in the next [Music] one
Info
Channel: Gary Explains
Views: 5,219
Rating: undefined out of 5
Keywords: Gary Explains, Tech, Explanation, Tutorial, Claude 3, LLM, Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku, AI, ML, Machine Learning, Claude 3 model family, Anthropic, Large Language Model, Claude vs GPT-4, OpenAI, ChatGPT, GPT-4
Id: kShPahySovs
Channel Id: undefined
Length: 9min 24sec (564 seconds)
Published: Tue Mar 05 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.