Worlds NEWEST AGI AGENT Just SURPISED EVERYONE! (Beats CLAUDE, GPT-4, Gemini) (Maisa AI)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so there has been a new AI startup that actually claims to have completely thrashed everything that we know in terms of the state-of-the-art systems and it's pretty insane if they what they're claiming is true so take a look at this they've said introducing MSA or Mesa kpu the next leap in AI reasoning capabilities the knowledge processing unit is a reasoning system for LMS that leverages their reasoning power and overcomes their intrinsic limitations but that of course isn't the craziest part okay take a look at these benchmarks okay this is absolutely insane if this is true okay we can see that on the left hand side okay this knowledge processing us it the kpu is achieves 96.9 2% which pretty much 97% on the GSM 8K 86.2% on the drop benchmarks like look at these benchmarks guys 100% on multi-step arithmetic whereas GPT 4 gets 4% this is absolutely insane now it isn't like an entirely new llm system it's like GPT 4 on top of uh a system so of course you can kind of credit that with gp4 turbo but you know as someone did point out why didn't we compare you know kpu with GPT 4 rather than kpu with gp4 turbo because uh where is gp4 turbo in the benchmarks cuz gp4 turbo is slightly better than GPT 4 so it doesn't make sense to compare the kpu with gp4 turbo than with gp4 it just doesn't make sense but the point being is that um even if uh this is like you know quote unquote a GPT for rapper it still poses some very very interesting capabilities because uh as I was making the video I they they recently released a demo and it is pretty cool so this is like kind of some agent because it has multi-step reasoning and the reasoning allows it to achieve much better things now another thing here that's pretty crazy what you can see is that these ones okay are zero shot okay so zero shot just means that they gave it one question and it comes back with one answer whereas three shot they like kind of prompt it in a way before and then they ask it the question and then it's able to get that and you can see the Gemini Ultra that's 32 shot that one's five shot this one is Chain of Thought prompting and this one is five shot okay so this zero shot is just like one question it uses all its reasoning steps and then it comes with an answer so that is why this is pretty crazy now um they actually talk about uh more of this in their blog post and we're going to dive do a deep dive onto this because they didn't release that much stuff but they did release a demo which I'm going to dive into and then um after I show you guys the demo we're going to look at some of the other stuff so they said okay that today Mesa is thrill to announce the kpu the knowledge processing unit and kpu is a proprietary Rich framework that leverages the power of llms with the decoupling of reasoning and data processing and an open-ended system capable of solving complex tasks this white paper aims to show what the kpu is which is its architectural overview and how it has outperformed the more advanced language models such as GPT 4 or Claude 3 Opus in several reasoning tasks so you know Claude 3 Opus that everybody knows and loves this system looks like it beats Claude 3 and gbt 4 stand alone now it does talk about the you know limitations of llms it says these large language models based on their current architecture have several innate SL inherent problems that persist no matter how much they advanc their reasoning capacity on the number of tokens they can work with number one is the hallucinations when a query is given to an llm the veracity of the response cannot be 100% guaranteed no matter how many billions of parameters the model has this is due to the way that the model is and of course the context limit and it say lately more models are appearing that are more capable of handling more tokens but we W must Wonder at more cost at what cost and essentially usually the cost is uh you know slower times or more comput so this is where they give us uh the architectural overview so essentially uh this is where we get into how this thing actually works so they actually talk about you know uh the Transformer architecture so they just talk about the transform architecture as the famous lost in middle which means that sometimes the model is unable to retrieve uh key information if it's in the middle of the context window and it say that you know uh llm you know is not always up to date because it's not like you know be able to up to the Internet and of course limited capabilities to interact with digital world they're fundamentally language based systems lacking the ability to connect with external services and this composed challenging things uh is there restricted abilities to interact with files API systems or other external software basically saying that they're not inherently designed for that so this is the architectural overview for this system and this this could be gamechanging because achieving this kind of reasoning just on top of GPT 4 is absolutely incredible because that means that you know more advanced systems combined with this could be even better so essentially they State what they have is they have this reasoning engine right here and this is the brain of the kpu which orchestrates a step-by-step plan to solve the user tasks and designed the plan it relies on an llm or a vlm and all available tools and the llm is plug-and play and currently extensively tested with gpc4 turbo so the reasoning engine here this is where you put the llm well that's what they said it's the brain of the kpu and this relies on the llm which is here I'm guessing that's inside the reasoning engine I'm not sure then of course execution engine this is receiving the commands from the re reasoning engine which it executes and its results is sent back to the reasoning engine as feedback for replanning so I'm guessing that this is some kind of feedback loop which is what they have so reasoning engine execution then reasoning again so probably reasoning thinking what do I do it executes and then once it executes it comes back again and then reasons again so here's why they talk about the virtual context window and they state that it manages the input and output of data and information between the reasoning engine and the execution engine ensuring that information arrives at the reasoning engine and the data stays in the execution engine in this way the llm context window underlying the reasoning engine is maintained only with reasoning and not with data maximizing the value of the tokens and this input and output management of data and information is not only covered by the user prompt and files but also external services and systems such as the internet Wikipedia and other things and it says this system has been inspired by the architecture of operating systems where they're responsible for managing and orchestrating the various hardware and and it says this decoupling between reasoning and command execution allows the llm to focus exclusively on reason reasoning relieving it of any rable operation of hallucination data processing or retrieval of current information and it says the articulation of these three components in general the kpu architecture opens the door to Future analysis of the quality and performance on tasks with large volumes of data and multimodal content open problem solving interaction with digital systems such as apis and databases so what they have here uh is a pretty pretty insane thing they also do say unlimited multimodal data they do say they have a giant virtual context window I'm not sure I'm not sure if it was unlimited but it was pretty insane from what I heard um and of course so yeah they got a reasoning engine and it seems that the reasoning engine is where all of the work is done I I don't know how how entirely it's done because the problem is they haven't released a technical paper just yet they just released a blog post and blog posts are fairly limited and what they do state but if we take a look at the uh benchmarks you can see right here it says that um you know we are pleased to show the excellent results we have achieved by testing our system against the reasoning benchmarks which the state-ofthe-art llms are usually evaluated and as you'll see we're compared to GPT 4 mistro large Claude Opus or Google Gemini against our reasoning llm um and of course uh we we we kind of do better than that so they also said it it sets a new paradigm math reasoning it is the GSM akk the grade school math which is composed of 85,000 high quality linguistically diverse mathematics problems for elementary schools and essentially they actually talk about how numerous methodologies have been explored in the context of experimental setup including shade of thought F shot prompting code based self verification and then they said we deemed it pertinent to evaluate our system using a zero short approach to closely mimic the standard operational conditions basically what they're saying is that you know if you're a normal person and you're going to be using these a systems you're not going to be using five shot you're not going to be using 32 shot you're not going to be using few shot you're going to ask it one question and when you ask it one question you usually expect one response so whilst Chain of Thought is good they're saying that you know we're going to evaluate it using zeros shot and we've refrained from employing any form of prompt engineering or iterative attempt so it's a one it's a one siiz fits all think apparently um and you can see on the GSM AK it performs at I think it doesn't even say right here but it's clearly like 97% above Claude 3 Opus which is Chain of Thought this is zero shot that's pretty insane above Gemini Ultra above gbt 4 um and it says surpasses all state-of-the-art models in the GSM AK test now they've tested this with gbt 4 Turbo and I'm really wondering which other uh models if they test it with is this going to be better because if gbt falls way down here what about if they test it with Opus and then use Opus to reason could they even go further I don't know maybe they are experimented with that then we got the math benchmark which is a drop so they talk about the drop Benchmark which is a 96,000 question Benchmark they said contrast to gbt 4 which utilizes three shot approach kpu is Benchmark using zero shot approach without the application of any prompt engineering techniques this approach results in the kpu establishing a new state-ofthe-art Benchmark in performance showcasing its strong capability of complex reasoning and we can see right here that this achieves I'm not even sure what percentage I think 86% or something like that but that's clearly above CLA 3 op Ultra and gbt 4 and here's the thing again this is zero shot and also also also we are going to get into the actual demo of this because the demo of this is pretty crazy uh and I probably should have shown you guys the demo in fact let's actually skip to the demo now but uh yeah the demo is here so the demo of this thing and I should have probably did this before but the demo of this software essentially what they did was they had this right here okay and it's going to zoom in okay so essentially what they had here and I don't know why they didn't the way how they did this demo it was good but they just didn't say exactly what was going on so they have this email from a user the user is having a problem so the user says you know urgent delay in delivery from foodand drinks.com I hope this email finds you well I'm Stephen and essentially the problem is is that this person has sent an email and he's you know wrote some wrong things in the email and um okay so this guy's wrot an email to a company right okay long story short he wrote an email people get emails all the time then here we can see like the software and then of course we have the orders okay in a CSV file so we have all this data of you know uh CSV files so we got all all of this data all this order data I'm guessing this is from the company from the last couple of weeks or days or whatever then essentially all they did was they added that data into the prompts menu you can see after scrolling through that they added it into prompt then you can see now answer the customer's email and then of course send it to contact at mesa. a so you can see now it's reasoning and it's thinking okay what do I need to do here so essentially what we do is uh and they did actually test us against gp4 I will show you that in a second you can see that it is able to reason it's reason reasoning reasoning reasoning and then we get the first step so it then says the process begins by focusing on understanding the content of a specific email this is achieved by interpreting the text found within a file named email. txt the goal here is to grasp the customer's request fully by examining the email's content this step is crucial as it lays the foundation for any actions that will follow such as responding to the customer's email or addressing their needs based on new information from extracted from the email then you can see here that this thing will continue to reason and reason reason and it says after understanding the need to comprehend yada yada access and read the email by doing so extract the exact words and request you can see he's reading after reading the email uh file containing the orders need to be accessed and examined then it continues to reason having gued the necessary data this is done by searching for The Unique identifier continuing to reason continues continues continues to reason um and you can see step by step how this model is starting to re and here's the thing as well is that this order this this order that this person wrote so this email that the person wrote their order number so they said that their order number was I'm not sure where I can find it but the order number is somewhere here oh it's right here so their order number FD whatever this order number doesn't exist okay but what's crazy is that after reasoning and identifying out it's able to somehow reason and find the correct email realize that the user entered a wrong uh number and then get it right and then able to it says reasoning completed in eight steps check the reasoning process and then you can see it's able to send that email immediately now you can see right here as well they added a little uh prompt of where they compared this to GPT 4 and you can see error analyzing it seems there's no record of number fd2 40188 in the orders database provided this discrepancy might indicate an issue with the order processing or recordkeeping system given the situation a customer focused response that acknowledges the issue outlines BL y so basically gb4 failed at this and find it kind of fascinating that their reasoning engine was able to do this so quickly and just in eight steps and it might seem like something basic but clearly their reasoning engine whatever it is that they're using is rather rather Advanced and um you know if we take a look at some of the the benchmarks you know we can see that the kpu is essentially narrowing 100% on these kind of benchmarks on multi-step arithmetic I'm not sure why they added gbt 3.5 turbo there it doesn't really make sense but um yeah so at first a lot of people were stating that you know it appears out of nowhere claims state-of-the-art performance with a rapper offers a chart with no paper to dig into shows no demos gets a date wrong on their site pardon my skepticism but do you have a product to show us or just extraordinary claims now of course with this I am pretty skeptical because like like look at the benchmarks again guys like zero shot zero shot like this is literally what we would expect from GPT 5 like literally if GPT 5 was here and they said zero shot zero shot zero shot uh 9787 like this is what we would expect like this is a GPT 5 level system like um especially with reasoning as well like Sam outman literally said in a recent interview that GPT 5 will be able to reason very very well so this kind of reasoning here this is what we would expect from gbt 5 so the fact that they've come out of the blue and just said look boom we've done it we've done what gb5 is supposed to be able to do and they didn't drop a technical paper they didn't drop a demo straight away I mean they have 6 hours after and now it's in beta um it is a pretty pretty crazy thing either a this company is you know absolutely insane they've managed to do something rapidly and Incredibly uh and surprise everyone take Everyone by surprise and they' figured out how to reason with llms in a way that we didn't know before or B this is just the marketing gimmick and things aren't as good as they seem now of course right now is he currently uh it's in baed development it's not ready for commercial use just yet however with its imminent release on the horizon we invite you to sign up for our waiting list which Grant you access to the beta version so it will be interesting for other people to Benchmark this as well and I would say that we need to see other benchmarks as well compared to the other llms because that will give us the very very very accurate description on how this thing performs and I do think their the Architectural Review will be interesting uh especially for those of you who are more technically inclined on that aspect to see if this even kind of makes sense some people said that it's just gbt 4 with the r AG rapper could be but um I guess until we get the technical report we're never really going to know and hopefully as long as it's demo wasn't fake cuz I know a lot of people were actually saying the Devon demo was you know it was uh cherry-picked because essentially with a lot of demos you know people will cherry pick things but I think the proof is going to be in the pudding either way we're going to be able to see this people are going to get access and then we're going to find out the truth because what they claiming is pretty insane and if this is true this just goes to show how crazy AI development is um so yeah will be interesting let me know your thoughts down below and I'll be seeing you guys in the next one
Info
Channel: TheAIGRID
Views: 39,869
Rating: undefined out of 5
Keywords:
Id: 1qThZ69kP7c
Channel Id: undefined
Length: 14min 56sec (896 seconds)
Published: Sat Mar 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.