China takes the LEAD! New AI Model STUNS OPENAI Sense time V5.0 Beats GPT4 On All Benchmarks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so around two days ago there was a rather interesting development there was something in China that I think you should all be aware of because it changes the dynamic of the AI race which leads me to believe now that currently China is hot on the heels of the rest of the world in terms of AI developments now it seems that the US has been leading the charge in terms of state-of-the-art models but recently this is the most interesting thing we've seen in recent times so let's take a look at exactly what China has been doing and what this announcement is so essentially what we have here is we have a situation where China has potentially taken the lead with their new model 2 days ago sense time launched sense Nova 5.0 which according to this report actually beats GPT 4 on nearly all benchmarks and there are actually a variety of different things that are truly surprising about this it isn't just some hyped up gp4 thing if the claims are true and we're going to dive into them this means that things are truly truly ramping up and it seems that across the board from different nations and different countries we're definitely going to be seeing increased competition in terms of what AI systems are capable of so we can see here that they've essentially compared this to GPT for Turbo but there are a lot of things that this presentation dives into and bear in mind that the majority of this is in Chinese but I've done my best to translate most of this so that you can completely understand exactly what's going on so you can see here that their sense NOA 5.0 actually looks pretty decent so on the left you can see it states that it is Hybrid used in the middle they states that it is trained on over 10 billion tokens the inference actually supports up to 200,000 tokens which is rather interesting because we've seen longer and longer context windows so a 200k context window is something that is you know I wouldn't say surprising but it does show that of course things are going well and of course the main main statement from this is that performance exceeds GPT 4 Turbo now the reason as you all know why GPT 4 Turbo is highly talented as the state-of-the-art model at that time and that's why people constantly constantly constantly try to oneup that in terms of the benchmarks so this was the start of the presentation but things do get a lot more interesting and I'm going to show you all exactly what that is so there was one screenshot here where I'm not entirely sure if they actually had these AI systems hooked up to a game but the text States that a live demonstration was conducted on the comparison of multiple functions of their model and GPT 4 including creative writing logical reasoning diagrams image understanding and calculations of food calories based on pictures and this is what the text actually States here but we will get into the benchmarks but I just wanted to include this for full transparency in order to show off the muscles of the larger models sense time also played the King of Fighters and at first the green player GPT 4 had the slight up hand but was quickly overwhelmed by the red player sense chats like various combos so I'm not sure if they're actually talking about this is if it's some kind of metaphor for their system beating GPT 4 or if they actually hooked it up to a street fighter Style game and since chat light was able to best gpg 4 if you are someone who can translate this more effectively because I did honestly put this through multiple different AI systems and the result was frequently the same now now here's where things start to get interesting and here's where things are really really fascinating because these are obviously the benchmarks for this system you can see sense chat version 5 is over here on the right hand side this is sense chat version 5 and then of of course what they actually do do and this you know pay attention here because I'm going to you know say some things that I'm going to recall them so you can see right here that they actually do compare this to gbt 4 Turbo and that actually is uh to the eagley viewers that it is the 116 this isn't actually the most recent one but it's not also a most outdated one as well and we can see that the only benchmarks that it surpasses the model S chat version 5 on is of course the math zero shot Benchmark right here and we can see that it's highlighted in the 61% area and we can see that in this Benchmark right here as well so we can basically see that currently what we're looking at is a situation where across all of the benchmarks compared to gp4 Turbo one of the most recent models I don't know know why they didn't do the most recent one I'm guessing just based on where they are in their testing and of course we do have a comparison with L 370 billion parameters instruct we can see that the differences from GPT 4 are honestly I wouldn't say quite Stark but it doesn't just seem to be uh a simply percentage gain like for example on the MML we can see that this one is rather small but on the C mlu I'm guessing that this is a different version of the MML U perhaps the Chinese mmu honestly I'm not in sure we can see that it increases but you can see like I said on the different benchmarks here the gains don't seem to be uh completely incremental because if we actually look at the differences this one is at 80 and this one is at 93.612 as in when I was actually checking some of the other models I couldn't find things like the race I couldn't find things like the natural questions on other models and that's like Gemini Pro and Claude and I do actually test them against it and I'll get to that later on in the video but essentially I think this is rather fascinating that you know China has I would say surpassed GPT for turbo in their model and what I would like to see from this model honestly is if you know people could potentially test this model because it would be interesting to see how it actually Compares in terms of users and what I mean by that is I mean that whilst yes currently on this chart what you're currently seeing is GPT 4 only winning in two out of the several different categories here and also even llama 3 we do know that there is a different Benchmark which is rather rather important and The Benchmark that I find to be one of the most interesting because it's not just based on the instance where things could be fine-tuned on the actual benchmarks in terms of just trying to get something that beats things on benchmarks but isn't actually useful on a day-to-day basis you can see here that we do have GPT 41 106 on the chatbot Arena as I've stated previously the chatbot arena is different because it ranks models based on their Arena ELO which is determined by votes against other systems and basically all this means is that people test AI systems side by side and their ELO increases when people vote for that system being the most useful and it's a blind test so you don't know which system is which and then over time you get to see which system is actually useful on a day-to-day basis answering a variety of different questions without the bias of looking at other systems so it's something that I think is very useful and we can see the GPT 4106 is basically at rank number one and of course like I said the new one does have a slightly higher ELO rating but I still think that this is rather rather impressive for what we're seeing here now something that I also did want to test was I wanted to test this against other models because one thing I realized when looking at this was I was was like okay whilst this is good there are other state-of-the-art systems out there because yes gp4 turbo 106 is state-of-the-art but GPT 4 Turbo isn't the only one and gpt3 turbo and llama 370 billion parameters aren't the only ones there now the other one that most people might be thinking about and maybe you're not thinking about this but I certainly am is I'm thinking about Claude 3 okay so Claude 3 if you don't know it's a Model A state-of-the-art model and it was released by anthropic now I've done this a little bit sloppily but let me just explain to you what you're currently looking at so all you're looking at is Claude 3's benchmarks and we can see that it is compared to GPT 4 in this tab right here and essentially across the board completely Claude 3's Opus model does surpass the benchmarks of gp4 now what this does mean here what I've actually done here is I've actually circled these areas where claw 3 retains its leadership in terms of how far it is but I've added X's here because these are areas where sense chat V5 actually beats Claude Opus so in the math benchmarks which is math problem solving it does better than CLA opair we can see it's 61.9 and this one is 60.1 so that's where since shat V5 rain Supreme then of course we can see here on the common sense knowledge the H swag this is 95.4% and then for here on the common sense H swag we can see that theirs is actually at 97% so this is actually a huge Improvement we can see that 95.4% to 97.52 is a big Improvement however Claude actually retains the leadership benchmarks in these areas where we can see uh the MML the gradate level reasoning the grade schore math but something that I do want to add whilst yes claw 3 does have the best on these benchmarks I would have to be honest and state that it isn't far off like for example if we look at the code on the human evaluation at 84 .9% or for example if we look at the big bench hard we can see it's currently at 82.9% whereas on the big bench hard it's at 86.8% so I mean it is pretty interesting to see how these differences are but the long story short is that basically J chap V5 surpasses gp4 turbo claw 3 Opa surpasses gp4 turbo but in terms of the actual benchmarks the ones that you do want to care about according to the arena gp4 Turbo's most recent version retains its leadership but even if the benchmarks that you know are benchmarks by like math and coding and stuff don't appear to be correct people seem to be having a great time with gp24 turbo F firmly followed by claw 3 Opus and then Google's Gemini 1.5 Pro in the API now I think this is still rather impressive because it seems that this company has been working very quietly and very diligently in stealth because this is something that did catch people off guard now there are some other things that are actually really surprising and one of the most surprising things about this was something in regards to the smaller models because their smaller model is insane and it's so crazy to the point where I'm wondering to see if some of these things are actually even legit so one of the things they actually talk about is a writing task and here's where they compare you know GPT 4 you can see right here GPT 4 to sense chat V5 and right here this is where and funnily enough this actually does look like the chatbot Arena so I'm wondering if in a couple of days we are actually going to get sent chat V5 in the arena but um they talk about how essentially the summary of this you know what actually happened here was that they basically had a task to write a college entrance exam essay on Innovation based on the classic Chinese novel dream of the red chamber and the author notes that gptt 4's writing style tends to be more rigid and structured using repetitive phrases and in contrast since Nova's 5.0 exhibits a more free flowing and Divergent writing style drawing upon a wide range of Chinese Cultural and literat references from ancient times to the modern internet age long story short the basically trying to say that if we look at the writing tasks in these examples that they've included GPT 4 loses now there was also another one right here where they put GPT for's model against sense chat V5 in I guess you could say a logic reasoning task and they were trying to compare this in terms of what this model would do when it was trying to figure out the amount of coffee and water consumed and honestly it's pretty hard to decipher what's going on here because like I said before it's a translation so the interpretation might actually not be correct and I'm not going to get into the details of that but the long story short here was that gp4 wasn't able to realize exactly what was going on but since chat V5 the reasoning steps provided the correct answer I will leave the prompt and the explanation in the description but it's just very confusing trying to get a logical reasoning Stars correct because if you've ever seen the examples of these you know incredibly confusing tasks one word can truly change the entire outcome of the answer and it's important to get the translations 100% because even if you get them 99% you can completely fail the question so this is probably the most fascinating part of the thing when I was looking through it in terms of the benchmarks and in fact that's actually not true it's actually coming up soon but right here and we will get to the most interesting part here but right here this is where where we take a look at their visual recognition systems and of course you can see it is compared to Google's Gemini provision open ai's GPT 4 Vision quen's VL Max which is a vision system and of course the intern LM X composer 2 VL and Step One Vision so there are all of these different ones compared on these benchmarks and we can see you know from what we've seen here that the top one does surpass them now one thing that I also saw that was really really cool was that they also did have image generation and by the looks of things it seems to be very very photorealistic so essentially the Tex states that sense 5.0 sets a new benchmarks in the term of AI powered image generation demonstrated by its impressive performance in generating nuanced and lifelike portraits the following example illustrates the ai's capability to create text image visuals and essentially using the prompt a vibrant Asian female portrait with beauty makeup casual hair a smile with movie gr quality presenting different expressions and styles since Nova 5.0 showcases its sophisticated interpretation of textural descriptions and ability and its ability to generate diverse facial expressions and styles that carry the subtlety and richness of film grade portraits that description is pretty good but I've got to be honest if this is their text to image generation model I have to say the whilst yes mid Journey might be on this level this is definitely something that looks really really good I mean we can't say that this doesn't look completely realistic and as for the prompt I think this is something that does do it really really well now of course I'm guessing that they did compare this to other systems right here in fact it isn't entirely clear but I'm guessing that this these are other systems but either way this one right here does look really really realistic and I have to say that this is pretty pretty impressive now here's where we talk about the most interesting bit because like I said before we were previously supposed to but you know I was you know thought that the slide was the other slide now essentially this is where they talk about their smaller models they're more compact models that they can use for many different things now you can see that the one they've highlighted is s chat light which is 1.8 billion parameters in size and it does do a lot better than others at the similar size we can see compared to the Gemma 2 billion parameters by Google it completely destroys it we can see that llama 2 13 billion parameters it completely does better than than it but one thing that I didn't really understand from this and was pretty confusing was of course the benchmarks because they didn't actually do this on traditional benchmarks they did this on a different kind of Benchmark so when this text on this left column was actually translated the benchmarks that we got were I guess you could say the words we got were comprehensive score language comprehension creativity reasoning and the average overall so one thing that I do think is quite surprising here is that 1.8 billion parameter model does showcase incredible capabilities because time and time again we've seen that Trend to be increased now one critique that I do have here is that they didn't compare this to Microsoft's 53 and I'm guessing that you know 53 was literally just released so I guess that's okay but they didn't actually compare this to llama 3 you can see here that they compared this to llama 2 7 billion parameters um and of course they state that it beats it and you know beats Googles and you know the other ones like quen's 1.5 but even what they actually did did was if we actually look at the original you know parameters we can see they actually did compare llama 370 billion instruct so I'm guessing Okay and this is just a guess maybe it isn't true that if the original system beats llama 3 um I'm guessing that this smaller system doesn't beat llama 3 because they haven't included it for whatever reasons and I'm pretty sure they could have so I think this is still pretty interesting because they've got a smaller model that is you know levels above these other smaller models but the point here is that based on some other things that I've seen as well you know they actually do talk about image generation and I got to be honest the text here is a little bit confusing I have tried to understand the context of what it's trying to explain but it does get a little bit confusing sometimes and I will leave a link to the article one thing that I actually do talk about that uh they say that you know this is going to be a calorie assistant and you can use it to submit images and completely understand your calories and I think right now the AI space is heating up quite a lot and this is not going to be the only company coming out of China to present their model that do take us by surprise but so far this seems like a very interesting update now what's also interesting is that the company shares soared more than 30% after announcing its lative generative AI model so this was something that made the company's stock price jump 30% so maybe it could be argued that the benchmarks that they're talking about aren't as good but the only way for us to know is of course for us to test it in the arena for other people to do their independent evaluations and of course things are going to be a little bit different because I'm guessing that this is fine-tuned on the Chinese language so the translations might not be as completely accurate so I'm guessing maybe if they make an English model that might be as good or better but of course this is going to be something that we would have to see so let me know what you think about sense 's new model and if we think China is going to catch up to the US competitors and how this actually impacts the landscape I think this is of course some very fascinating competition because I think we're going to continue to see you know models and different companies pour millions and billions of dollars into this industry because like I said before I think we are now in that terminal race condition where if one company stops then they are pretty much giving up the lead

Info

Channel: TheAIGRID

Views: 45,634

Rating: undefined out of 5

Keywords:

Id: NJXGIMa45sQ

Channel Id: undefined

Length: 18min 42sec (1122 seconds)

Published: Fri Apr 26 2024