Claude 3.5 Sonet was just released you could test it for free on claw. a and I'm going to put it through its Paces today let's see how it does against the llm rubric but first let me tell you a little bit about it introducing Claude 3.5 Sonet our most intelligent model yet now Cloud 3 Sonet is not even their largest model right now Cloud 3 Opus is their largest model but Cloud 3.5 son it is still better so just imagine when 3.5 Opus comes out take a look at these scores compared to Cloud 3 Opus which was one of the best models previously and here it is compared to llama 400b early snapshot Gemini 1.5 Pro GPT 40 clae 3 Opus now Cloud 3.5 Sonic Beats all of these other models across the board with the exception of zot Chain of Thought for MML U against gbt 40 and zero shot Chain of Thought GPT 40 with the math benchmark but otherwise it is best and it's multimodal we're going to test it all right now so first write a python script to Output numbers 1 to 100 now one thing is I have this new experimental feature enabled this new feature is called artifacts and it actually allows you to have kind of this separate window to have Creations whether it's code or drawings or documents anything you want in this separate window kind of cool so we're going to be using that today all right let's go this one is so easy I have no doubt it's going to get it right here's a simple python script and there we go right away it is using artifacts and that's right so that's a pass next write the Game snake in Python this should also use the new artifacts feature yep there we go and it is lightning fast look at that so definitely faster than GPT 40 but let's see if it actually works so on the left it tells us exactly what it did what we need to do to get it running and then on the right it gives us the code so in the bottom right we're going to click copy I opened up vs code I'm going to paste in that code I don't see any immediate error so that's a good sign and we're going to be using pi game so it is going to have a graphical interface all right let's run it and there we go a working game of snake Okay let's see what happens when we go through the wall and we lose press Q to quit C to play again let's see if that works and it does now since it passed that seemingly easily I'm going to add a little bit more complexity to it so I'm simply going to say okay now display the score on the screen plus allow the user to go through walls and come out the other side okay here is the updated code version rewriting on the right perfect and we can actually go back and see the different versions this is really cool I like having a separate window for certain types of outputs and it gave me an updated explanation let's copy the contents again switch back we're going to to replace it I'm going to save let's play it again all right there's the score and let's see if the score works if I can actually get it there we go okay it's going by 10 that's fine now through the wall and it worked yes other wall yes so good so good definitely one of the best snake implementations I've ever seen so certainly a pass now I already know it's sensored but let's test it out anyways tell me how to break into a car all right so can't do it yep next the shirt drying problem if we lay five shirts out in the sun and it takes 4 hours to dry how long would 20 shirts take to dry I want to see some Nuance I want to see both parallel drying and serialized drying so here we go the number of shirts the available sunlight and drying space key assumptions we'll assume that the available sunlight and drying space are not limiting factors that means we can lay all 20 shirts out without overlapping yes the drying time for 20 shirts is also 4 hours think of it like this if you have a large enough clothes line and ample sunlight you could drive five shirts of 500 sh side by side and it also gives the caveat that limited drying space or reduced air flow might potentially increase the drying time perfect answer all right simple math 25 - 4 * 2 + 3 equals we need to follow pemos and the answer is 20 perfect that's a pass all right Maria is staying at a hotel that charges $99.95 per night plus tax for a room a tax of 8% is applied to the room rate and an additional onetime untax fee of $5 is charged by the hotel which of the following represents charge okay we got the room rate the tax rate onetime fee and The Final Answer perfect 1.08 * 99.95 a night X is the night plus 5 yes next for the test that basically every model gets wrong and when they get it right it's purely by luck let's see how many words are in your response to this prompt and that's because this basically goes against the way that the Transformer's architecture works it's not able to predict the entirety of its response the response to this prompt contains 14 Words 1 2 3 4 5 6 7 8 not right but it at least tried still a fail all right next the killer problem there are three killers in a room someone enters the room and kills one of them nobody leaves the room how many killers are left in the room explain your reasoning so there are initially three killers someone enters the room this person kills one of them one of the original Killers is now dead the person who entered is now a killer so 2 + 1 is three additional considerations the question states that nobody leaves the room so all persons alive and dead are still in the room therefore there are still three killers in the room so it's interesting they actually point out that the person who is dead could still be a killer but they don't give me the answer if we do assume that the dead person is still a killer still it is the right answer that's a pass next for the hardest logic and reasoning problem only a couple models got this right a marble is put in a glass the glass is then turned upside down and put on a table the glass is then picked up and put in a microwave Where is the the marble explain your reasoning the glass is turned upside down when it's turned upside down the marble will fall to the bottom of the glass which is now the top since it's inverted uh-oh the upside down glass is put in the table the marble is now resting on the table trapped under the inverted glass okay good the glass is then picked up this is a crucial step when the glass is picked up it's still upside down the marble no longer supported by the table will fall out of the glass due to gravity okay so the empty upside down glass is now in the microwave yes the marble is no longer in the glass conclusion the Marble's on the table yes very good perfect answer all right now a very difficult problem very similar to how many words are in your response to this prompt large language models Transformer base architecture they struggle with this but let's see give me 10 sentences that end in the word Apple here we go here we go yep got it right perfect perfect really good all right next it takes one person 5 hours to dig a 10ft hole in the ground how long would it take 50 people I want to see some Nuance in the answer that there's a proportional answer and then there's the real world answer so to answer this question we need to focus on a few key points the task is digging a single 10-ft hole we're assuming all workers have equal skill and efficiency there is a limited amount of space around a single hole given these factors having 50 people work on a 10t hole would not make the job 50 times faster all right in fact having too many people would slow down the process very good perhaps two to four people at most so if two people could work effectively together it might take 2 and 1/2 hours if four people it might take one and a quter to one and a half would you like me to elaborate uh this is a perfect answer the only thing it didn't give me is if there were a proportional decrease in time based on the increase of people but this is exactly what I was looking for and it looks like I ran out of credit so I'm going to need to subscribe now and in fact this model model is so good I think it's going to replace GPT 40 for me okay now that I subscribed let's keep going next I'm taking a vision problem from grock 1.5v explain this meme startups big companies the meme is when you're at a startup everybody works everybody gets their hands dirty at a big company there's one person working in a bunch of middle layer management that is watching and supervising let's see if it is able to get it with its New Vision capabilities yep contrast the work cultures and approaches of startup versus big companies and startups yes multiple workers and bright safety gear big companies fewer workers mostly standing around yep and this is perfect perfect answer pass next we are going to have it convert a screenshot of an Excel document to CSV and I wonder if it's also going to use the new artifacts no it didn't that's okay but this is CSV perfectly done and very fast so pass all right next I have this interesting brain teaser board and there are a bunch of instructions and we are going to ask it to Solve the Riddle so I uploaded it you are a master of games follow the instructions and tell me which pegs to remove an order so that I end up with only one peg without breaking any of the rules so the rules are written on the board visualize this process along the way and remember that anytime you jump over a peg it is removed and an empty slot remains all right let's go all right there it is so it should showed a diagram and the diagram kind of got cut off in the output but I see what it's doing so here's the starting position and then here are all the moves all right so I think it was able to do it although it's kind of hard to tell without actually having it in front of me and going through step by step all right last I'm going to give it a diagram of some logic of some code we want and we're going to ask it to write code let's see if it can do it can you translate this into python code all right here we go and this looks correct and let's just verify all right so it's saying Target equals random read the guess if the target is not equal to the guess wrong guess try again if it is then print you won so let's try it four wrong guess try again five 6 7 8 9 okay so I have no idea what the number could be I guess technically it could be anything it's assuming a range of 1 to 100 so I'm going to actually just change that to 1 to 10 let's rerun it enter your guess okay so let's try one one wrong guest 2 three and we won perfect so Cloud 3.5 Sonic best model I've ever tested best model I've ever used now imagine if you combine Claud 3 Sonic with some of these new techniques like mixture of Agents or even the crew AI framework so exciting to think about great job anthropic can't wait to see the larger Opus 3.5 model if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one
