New GPT 4 reasoning tested vs Opus and Older GPT 4 models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today a new gp4 turbo Model A majorly improved version was released in the API this is a truly multimodel model I believe with vision capabilities you can now use Json mode with vision and with function calling but we are actually interested in its reasoning capabilities and we are going to test it uh along with all the other models from open AI 3.5 and 4 and also catri Opus son and Hau on two different problematic reasoning uh issues so we'll be testing it two reasoning problems I can't remember where I found these problems from but I believe it was from your code your own AI YouTube channel I wish I can remember so these problems are as follows so we have these two problems and usually speaking when it is asked in this good form then gp4 was getting it mostly right but when you move the sentence around for example the rink has also his yellow cards when we move the sentence to a different position than actually the performance dips the answer is supposed to be 23 yellow cards and the same is true for the second example which the answer is 24% you can pause the video and actually read the problems yourself but what we're going to do is actually we are going to take both of these problems we are going to reorder the sentences in a comatron way using permutations and we're going to shuffle them and we are going to pick 20 uh random uh examples from those sampling and we're going to run it with all these models GPT 3.5 all the way from 0125 to 16k and we're going to also use GP the new model gp4 turbo plus the 0125 1106 and 0613 just to see how they do and we're going to do the same for cloud 3 Opus son and Hau to do this in parallel we're going to be using tread pool executor if you want to see each run sequentially you can set the parallel run amount to false if you set the stream to true and you'll actually be able to see the that you know streaming responses from the API let's try with the first ble so this going to run in seqin sequence if you were to start this we'll see and I will talk about the code while we're doing this by the way the code files for this will be available at my patreon the link will be in the description we initialized upt and here is its thinking process and here is the answer it it's at 23 and we marked it is correct and here is the second one that is correct and here is the third one that is correct so I cancel that run because we want to do this actually in parallel but you can actually this is how you can run it in sequence and see the results yourself what we want to do is actually set the parallel runs the 20 and sets stream to false and we are going to test 20 combinations out of the mixed sentence structure of either one of these problems so we're going to pick 20 randomly sampled combinations of sentence variations from problem one and then later problem two and we're going to test it on all these models so let's actually go ahead and do this let's first start with the GPT I'm just going to go ahead and click play so we can observe what's happening we're going to be reviewing the code too but like I said the code files will be available at my patreon the benefit of this uh script is that you can actually pick different problems along with different system messages so and you can test different models and different problems with the script and in parallel using threadpool executor as you can see we are going through the 3.5 models pretty quickly the for results for the 0125 came in actually it was able to answer 17 out of the three correct for the bumper car problem the next one 1106 actually got only 14 correct and six incorrect will actually view these uh we can actually go ahead and clicking here start a split terminal so we can actually run the cloud test in parallel to just going to say Python and run the cloud version as well and let me pull this up so we are testing the problem we first PR print the problem uh and we print which model we're testing we do the same here as well see so we see which model we are currently testing and uh we have hit our rate limit but uh we are actually going to reach R try we have a retry mechanism in place and uh we will actually continue and as we get the results we are going to be saving to here we have actually made great progress and we got the opposite result to so let's go ahead and see the results for uh GPT 3.5 turbo this is the latest model which got 17 correct and three wrong this is the next uh model in Timeline correct 14 incorrect six this is for the bumper car problem and then the next one up is is I believe this one 0613 which got 15 correct and five wrong and we have the turbo 16k which only got 14 right and six incorrect out of the 20 and our first problem was bumper let's take a look at GPT 4 Turbo the latest model bumper car problem it's got all of it correct 20 correct and zero wrong the bumper car problem for zero GPT 4 0125 is actually yeah 20 correct and zero mistaken and for P pt4 0613 it is actually 20 correct and zero wrong as well so this problem actually 106 preview also got all of it correctly so the bumper car problem apparently is not a big deal for GPT 4 we can also take a look at Opus for bumper cars you got 19 correct and one wrong son for bumper car got 17 correct and three wrong Hau for the bumper car problem got 18 correct and two wrong uh now continuing to the the Marcus problem which is which is a Math logic math math problem so the latest version of GPT 3.5 this one for the Marcus problem only got one correct and 19 incorrect actually so this is a much harder problem apparently the next 3.5 1106 actually got two correct and 18 wrong 06 13 version got six correct and 14 wrong sometimes people say that 0613 was a better model perhaps 3.5 turo 16k actually got only two correct and 18 wrong the latest gp4 the that was released today actually got 17 correct and three wrong on the Marcus problem GPT 40125 got 17 correct and three wrong uh and the 1106 in the Markus problem got 17 correct three wrong 0613 for Marcus problem got 15 correct and five wrong so actually the latest model performed perfectly on the bumper car problem and then when the Marcus problem got 17 correct let's see how oppos it so in bumper problem it got 19 right so GP latest GPT actually Beat It All gpts actually beat it as a matter of fact they all got it this one correctly and for the Marcus problem Opus got 17 correct and three wrong and the latest gp4 got 17 correct and three wrong as well son for the Markus problem got only eight correct and 12 wrong so we can compare maybe Sony to 3.5 and maybe look at the latest 3.5 it actually for the Marcus problem is actually getting I mean 3.5 just couldn't do the Marcus problem at all let's take a look at Hau Hau for the Marcus problem did 13 correct and seven wrong so Hau actually beats 3.5 all the way and actually does almost as good as 0613 version of um gp4 so there you have it so these are these are the results so we can can but here is our question uh was the latest model 2014 that was released today it's better than the previous one let's take a look at it in the bumper car problem they both got it perfectly right and in the Marcus problem it was 17 correct and three wrong and the latest one actually did exactly the same 17 and three so at least in this test we couldn't see a difference so maybe we want to run it uh for more combinations we could see a difference we can actually go ahead and do that I'm going to go ahead and delete these results and we're no longer going to run the uh CLA option but we just want to run so uh let's go ahead and comment this out I I want to run the latest model versus the preview 0125 preview and let's do a combinations to test to 50 and then let's see what kind of results we get let's go ahead and run this and now we can review the code okay we got our first result GPT for the latest model that was released today actually again got all the bumper car problems correctly 50 out of zero 50 versus zero we are waiting for GPT 40125 review to finish okay for the bumper problem it did 50 out of all 50 correct as well but the real problem was the Markus problem let's see how it does okay we got the result for the latest GPT 4 for Mark problem we got 41 correct and nine Incorrect and we're now waiting on the 0125 preview okay we have the results uh from the for the markers problem and you got the previous version of gp4 got 37 correct and 13 wrong versus 41 correct and nine wrong so just from this comparison around we can say that perhaps the latest gp4 did do better in this reasoning task now let's go ahead and review the code the code is going to be the same pretty much the same for the GPT and the cloud version so I'm just going to review the GPT version I am using the simplifi uh Cloud UniFi class I've created plus a simplified open AI UniFi class you want to watch the full video on the open a UniFi you can find it at my website eive dolive along with all my other videos this it's this one right here open a unified API and if you're a patron you can find the code download links here quickly too for each project so let's take a look at this we are going to Define parallel amount like I said you can feel free to change this to uh depending on how many threads you want to start but you don't want to start too many threads all at once 20 is a good number we're going to set stream to false and combinations to test to 20 let's bring this whole model list back up so you define your models which we're going to Loop over this is our first problem statement this is our second problem statement we turn them into a list and we Loop over those problems and we print which problem we are testing this is the system message feel free to change it we split the problem into it individual sentences by looking at the sentence end points and then now that we have this list we use the it tools permutation to create all sentence combinations and then we print how many for example for the first problem there's 120 combinations and then we shuffle them and then we have this function which we are going to use to run the threadpool executor with which is going to take in the sentence combinations and a model name you're going to dynamically to use the model name from the list and we used the GPT calls uh class set the streaming to whatever we've set in the beginning of the script and also we set our model model to whichever model we're currently going to be using and if we don't have a system message already we add a system message now we get our reorder problem turn it into a string from from that sentence combination list which is passed into this function and then we add that message to the gpt's message history and while through we're going to try to get a response using a get response method of open AI unified if we get a response then we're going to break out of it if we get an error we can print the error but if rate limit error is in there we going to just print rate we hit a rate limit error and we're going to sleep for 30 seconds and then we going to continue once we get a a response without errors we're just going to break out of this so if the problem is the first problem then we are looking for a 23 in response so this will return true and if the if it's the other problem then we're looking for 24% in response so this will return true so here we're going to Loop over the model names for each problem right and then we're going to print which model we're testing this I guess is not important we can comment this out oh I'm sorry actually this is the dictionary we we initialized this is why I said you can test different system messages because this dictionary will include the system message you message you have specified and how many correct and incorrect you've gotten per model plus per system message so you can actually test different system messages just remember that with pool executor we're going to initialize mix workers based on parallel runs amounts which which we have specified in the beginning of the our script and then we just going to call just going to call the check Cloud response with the combination and the model name and for combination uh so this is a this is the this is a single line of code we call this function for as many combinations in the in the sentence combination but we're going to slice it up to combinations to test which we have specified up here okay and is those features features get completed we append it to this dictionary and then at the end we uh count if you if you run this multiple times it'll actually keep track of the numbering system but it's going to save as results for model name for the first 10 digits of the problem I guess this has to go to the end can modify it that that's just to enumerate the files if you keep running this multiple times so it won't overwrite so that's it so then we write them to Json so we get a result result like this so this is pretty much it I hope you enjoy this let me know what you think thank you for watching until the end for the next few minutes I'd like to talk and explain about my auto streamer the newest version version three I will go ahead and start it you can actually download a free demo of this app and once you click on it it will take you to Google Drive the app is launching right now we'll give it a second so if you're at Auto streamer website if you want to download it you can click on the free download and you can download it from the uh Google Google drive otherwise you can click on download full app and it'll take you to my patreon and once you uh if you do decide to buy it and download it you will get something like this and all you have to do is auto click on autost streamer. exe and the uh Standalone Windows app will launch and all you have to do is put in your API key it will also autod detect it from your environment variables and then you click on generate course and just type in anything you would like to learn for example I can type in future of AI uh you can also select code based topics if you want if you want to learn like coding let's actually go ahead and do that and I can say python um dictionary methods you know this can really be anything you like you can here select how many chapters you want to create let's go ahead and select two click on generate course outline so this will go ahead and generate a course outline for us which we'll be able to review here in a moment okay curriculum created successfully now we can go to view course outline and find uh our dictionary we can do a search and review the course outline if you like what we see we can actually go ahead and select this there's also a search option and then click on generate course but before we do that we have already selected it as you can see right here course is selected we can actually see how detailed we want it to be let's set this to 100 you can select the web name we can generate and play with the audio will start pring right away you have six different voice options and over 50 languages you can also custom what appears on the web page I'm just going to go ahead and click generate course and it will launch a website and uh create this in real time because we selected generate and play it will play the audio let's just give it a second a dictionary in Python a dictionary is an unordered collection of items while other compound data types have only value as an element as you can see there's the play pause button I can pause the generation process maybe if I was recording this or live streaming as I'm doing right now uh recording a video I can actually talk about this and then come back to it and click play again the dictionary has a key uh to continue uh I can also stop this generation uh I'm going to go ahead and do that and once you but if you generate your course to the fullest then you will be able to view uh all your generated courses here and you can actually launch them like all great philosophers of the world which is the course I created and we can actually review this course and Socratic method and pl's Republic click on click and listen and read through them like I I said this header and the footer link and the text can be customized from these settings yeah if you like what you see if you want to learn more please go to autost streamer. live link is in the description there's a quick video there too and an FAQ and if you have any questions you can contact me at Discord or Twitter just uh let me know thank you for watching and I'll see you in the next video
Info
Channel: echohive
Views: 2,961
Rating: undefined out of 5
Keywords:
Id: Arl08JlXERM
Channel Id: undefined
Length: 16min 50sec (1010 seconds)
Published: Wed Apr 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.