Older GPT-4 vs new GPT-4-TURBO vs Opus and Gemini as well as random models from LLM arena

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone in this video we are going to compare the old GPT 4 to the newest GPT 4 turbo versus Cloud 3 Opus versus gini 1.5 Pro versus two random models from the chatbot Arena using four difficult math problems that are posted on stack exchange and also we will compare it in a difficult coding task using the documentation provided in open air cookbook trying to get it to write uh proper function calling P pt4 calls which actually calls itself is a function call so this should be a fun one let's get started I don't know if you know but if you go to open platform. open.com now they have the compare option where from where you can select one of their models we can select gp4 gp24 turbo and here right here you can set the settings I've set the maximum tokens to uh maximum you can enter a system message if you enter a system message they automatically sync unless you uncheck the box so but we're just going to leave that empty and then you enter your message here and they both give you responses and then we're going to do the same with Claude Gemini and then chatbot Arena this arena is actually what is used to calculate the leaderboard that you may have seen all over the Internet so this is the first question if every two out of three rade shirts need alterations in sleeves and every four out of five needed in the body how many alterations will be required for 60 shirts this is a word problem let's go ahead and paste it here and let these two start answering we'll also going to ask Claude to do the same and Gemini to do the same and also in the chatbot Arena we're going to ask the same question so these models we don't know which ones are chosen but we'll be able to see at the end so let's take a problem this person said that the true correct answer is 88 but the uh answer that was provided is 133 but if you read through the comments actually 88 seems to be the correct answer and if you look gp4 says answer is 88 then the new gp4 says 88 so this is correct so for each question we'll put a green circle if they answer correctly or a reddish one like this they answer incorrectly let's see what Claude has said uh Claude said 56 alterations so that is incorrect let's go ahead and do that Gemini said 88 alterations so that is correct uh here we go let's see what two random models said so this model has set 88 alterations and this model set 88 alterations so they're both a tie we're going to select tie so we can see the models so this was e 34 billion chat and this was quen 1 and half 72 billion chat so they both were able to answer it correctly but we're not going to have it in the uh ranking here okay so this is good let's move on to the next next problem this is a bit longer so please pause it pause the uh video and take a look at it if you want to read it I'm just going to copy it and bring it here here I can actually click on clear and paste it and let them answer this we'll do the same with Cod right here with Gemini and also here we need to say new round and then paste it here so we get to see what two new models are going to do also this is a algebra pre calculus problem so I'm actually going to copy it and showcase my auto streamer I'm going to go ahead and put in this course this actually generates courses structured course websites about anything anything that you're interested in here let's just do I'll pre calculus if you want to find out more you should see the website link at the bottom right corner of your screen so curriculum created successfully let's see this looks good we can go ahead and generate this course and as we can see our course is being generated in the background listen to but let's see how our models did the answer to this question is 35 and 75 respectively so the new turbo did it correctly and old GPT for didn't let's look at cloud cloud got it right Gemini is still working on it and here we got these two models again this is a tie and this one was command r+ and Cloud 3 Sony got it right let's see what Gemini did so Gemini actually gave us a method of solving the problem and didn't provide the solution so let's go ahead and mark them so old gp4 got it wrong no pt4 got it right Opus got it right Gemini 1 half Pro uh got it right and our course is being contined to be generated let's go ahead and pick a new problem so this is in a group of children each child has certain number of pencils and we're trying to find the max maximum number of pencils so let's go ahead and copy this and here again we're going to click clear and ask this question to all and new gp4 we're going to ask to pod and to Gemini and to we going to create a new round and ask it to two random models and this is a sequence and series problem we can actually go ahead and stop generating this course this is what Auto streamer allows us to do that and I'm going to generate a course three chapters about sequences and Series in the meantime if I wanted to learn more about it so let's see what the answer to this is so the answer we're looking for is n equaling 9 and and both the old GPT 4 and the new one got it wrong Cloud actually got it right with 9 and Gemini is still going and here both of these models actually it looks like got it wrong so we can say both are bad and this was E34 billion and GPT latest GPT for trbo so we get to try it twice maybe we can add this here I'm not sure let's see what Gemini is going to say still going uh it says it cannot be determined okay so that's that's wrong so this is what it looks like currently both gpts got it wrong on Opus got this one right in the meantime we've generated our curriculum we can take a look at it and if you like it we can go ahead and generate the scores right away and while we are doing all this our code is going to be generated let's take look at our last uh math problem this is the math problem two grass oper so they jump at different uh lengths and we are trying to find out how many jumps they need to perform to when so they meet I'm going to clear this and ask it to both GP for ask it to Claude and ask it to Gemini and then ask it to two new random models let's see what happens this is our last math question and after that we're going to do the programming question so I already copied the entire function calling documentation that is presented in this cookbook and uh what we're going to ask is actually please to create a chat app which does language translations so it's going to do it with a function calling but is actually going to call the function that we're going to call is going to make a call to gp4 for the translation so we're going to try this with all through all all these models and see this is a complex task and a very useful one as you can imagine let's see uh what happened with our last question so the accepted answer is that uh it requires 78 jumps let's say the old GPT for said 20 jumps that is incorrect new GPT say 63 jumps it got closer but that was wrong and opos SE said 21 jumps uh Gemini is still going here a random model said five jumps and 78 jumps actually so this is the correct one right so the model here actually in the LM Arena actually got it right so we can say B is better and let's see so this was the new gp24 it got it right one time so that's interesting so here it didn't get it right but it got it right so it goes to show the newer model is pretty good with math and this other random model was claw 3 hu which got it wrong G uh is still going on let's wait for it also as you can see our uh course on sequences is being generated listen to it versus recursive def so this is really nice Auto streamer generate courses in real time like this about any subject it's uh going continue to generate okay Gemini concluded this 22 jumps which is incorrect so let's go ahead and Mark these so I just can of did it like this I guess this is half a point because it answered it you know in the arena rightly one time so now let's continue with our coding question so this is like I said the copy of the function calling and we are going to ask let's read what we are going to ask can you please create a chat app which gets language translations using a function but the function should call gp4 Trio model for the translations our main chat app should use gp4 Trio as well with function calling please Solly refer to documentation when creating the app please return the code in a single codl we just need to app that works from the terminal so it doesn't it so it won't try to create a web app or something like that so let's paste this here we're going to ask it to old and new gp4 we're going to ask this to CLA maybe let's make a new chat I'm just going to enter it because the instructions were in there let's create a new chat with Gemini 2 let's give it a fair chance so doesn't think it's a math problem and now we're going to ask it here with a new round so two random models and then we are going to test it okay let's take a look at the coding results actually this older gp4 model in the plat opening eyes platform Compares say that the maximum token is 8,000 so this must be the really old gp4 uh in any case we have the uh results for the latest gp4 here so this is the code it provided it's actually going to do a function call but it's not actually calling uh translate text so this actually seemingly made a mistake because it's trying to call a translate function which it didn't Define uh I actually did this experiment earlier with the GPT it was able to provide it but uh this time it failed let's see what claw did so as you can see this won't work because because it's it's trying to call a translate function which we don't have here and it's not calling the function now oppos actually did better here we have a chat with GPT and it's and it's actually is a translate text function and it's going to try to call it and it is a tool definition and we have a loop which actually get the function arguments and calls the translate text let's actually put a break point here run it in debug mode see this actually is calling the function let's ask how do I say hello in French and yeah I thought this was going to happen because uh it used the function call but as we know the latest API actually requires Tool uh such as here so it actually made a mistake but I actually tried this before and it worked so this is twice on the video that it didn't uh let's let's go ahead and see what uh Gemini did okay Gemini in first look so we created a translate text which makes a call to GPT with the target language and it defined that tool just it said the name is translate except it should be I believe translate text maybe we can help it a little bit and it's going to it actually accurately chose the parameters there tools and Tool choice and here it's going to try to make a function call let's run this and see if this will work we did help it with the function name but actually the problem with the other one was more fundamental so this should hopefully work we'll ask the same thing we're going to ask the same thing incorrect API okay I have to remove this line because I auto detect from my environment variables let's try again okay now the Moment of Truth yep it is calling the function so it is yeah it's actually we had to fix this part right here function translate text let's try it again so obviously it didn't work out of the box but it got really close I just want to give it a chance uh tool okay it actually is calling it actually called GPT for translation bour say calling the translate text so this actually worked and we got the translation printed bonjo in the chatbot Arena we this one actually failed which was called Lama 70 billion but here we have some code which we can copy and take a look this is from the old GPT 4 gp4 0313 let's see how that did okay if you look it did do the right parameters tools and Tool Choice it created so this should actually work did it get the name of the function right translate text no it didn't Let's help it a little bit right here by just as we did with Gemini and if the function name we have to say here translate text and put a break Mark here see if it will accurately call the function which will make an another call to GPT we are going to ask the same thing again and oh API key I have to remove this real quick and run it again by the way our whole course in sequence is says sequence and series created tiacci sequence so this is cool I'll talk more about Auto streamer here in a moment let's actually see if this is going to work what can I how can I say yeah so unrecognized arguments completions because it didn't use chat completions so this this failed so effectively all through all four of them have failed on this one but actually when I tried it before the video they actually worked but they they got close I guess so when me look at this so with this little test when me tally up the results old gp4 got only one ride new gp4 got two and a half right Opus got two and Gemini got one right so the new gp4 actually ended up doing the best in this case if you enjoy my videos you can find all 280 videos I've created at my website e.li these are directly linked to the YouTube videos can find their descriptions and if you're a patron you can actually download the codes very easily so this gives you the benefit of this is that you get to experiment quickly I do this all day long so that you can actually learn and experiment with different ideas much more quicker and becoming a patron gives you access to the code over 200 projects so that is the you know so you can unlock some new ideas and experiment with them quickly as you can see auto streamer generated this course and it it is actually still streaming but I can stop it and I can go to view and launch courses I can close this website so now this course was fully generated I can actually relaunch that course at any time I like like I said it doesn't have to be about math you can zoom in if you like to make the text bigger partial sums so it comes with the audio as well it creates courses in a structured format which are ready to deploy if you like to find out more about autost streamer you can go to autost streamer. live and you can actually download a free demo if you click on it will take you to uh Google Drive and you can actually download it by clicking this button when you download it you'll get Auto streamer demo. exe which you can run it's currently only available for Windows so the demo actually works uh just like the full version but some some of the some of it features are limited the cool part of Auto streamer is that actually this course if I click on open folder is entirely Deployable online all I have to do is actually just go to my GitHub repo create a new Repository going to create a private repo called sequence and series and then all I have to do is actually just upload these files by clicking uh upload existing files and once I upload this I can easily deploy it online uh let's just do that real quick once I upload it I'm going to click on Commit changes all the files that auto streamer generated are uploaded I'm going to commit the changes so this is going to create a new repository after which I can go to my realway account or wherever you like to host your websites in and since our repository has been created and my uh repo has been uh linked I can just say deploy from GitHub Reco repo uh select sequence and series and deploy now and this should make it online while this is building I'm just going to come to the setting stab and I'm going to generate a domain Railway automatically generates a domain for this I'll just wait until it's built and then I'll generate a domain you can also uh if you have a custom domain you can actually hook your custom domain up as well uh once you have the if you do download the full version of the auto streamer then you can actually change this header and footer message along with the link that is provided you click on download full app it'll take you to my patreon shop it's currently available for $200 okay I'm just going to select generate domain and just like that I have deployed it online in sequence and Series this is now fully online I'll actually put the link in the description you want to uh check this course out definition and notation of SE yeah you can generate General courses with code or without code absolute versus condition and uh so this is pretty cool uh let me know if you have any questions at our Discord I hope you enjoy this video uh and I'll see you in the next one

Info

Channel: echohive

Views: 897

Rating: undefined out of 5

Keywords:

Id: 3hJnLJ1YeKk

Channel Id: undefined

Length: 17min 9sec (1029 seconds)

Published: Fri Apr 12 2024