CLAUDE 3 Just SHOCKED The ENTIRE INDUSTRY! (GPT-4 +Gemini BEATEN) AI AGENTS + FULL Breakdown

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so something actually shocking did happen in AI today and that was anthropic release of the next generation of Claude Claude 3 now this was a model that took Everyone by surprise because it beats every other AI model across the board on the main Benchmark so without wasting any more time let's get into the meat of this release so you can see here that they actually released three new models and in the clae 3 Model family there is Claude 3 hi coup claw 3 Sonet and Claude 3 Opus and essentially you can see as the models increase in intelligence the cost does go up quite slightly but these three different models are very very fascinating and later on in the video I'll show you how they all differ and how Claude 3's Opus the one right here how this model is smarter than any other AI currently available making it state-ofthe-art you can see that they said a new standard for intelligence Opus are most intelligent model model outperforms its peers on most of the common evaluation benchmarks for AI systems including undergraduate level expert knowledge graduate level expert reasoning basic mathematics and more it exhibits near human levels of comprehension and fluency on complex tasks leading the frontier of general intelligence all Claude 3 models show increased capabilities in analysis and forecasting nuanced content creation C generation and conversing in non-english languages like Spanish Japan Japanese and French and now this is where we actually get to the crazy benchmarks because you're about to see something that surprised me and caught me off guard these benchmarks are actually quite shocking you can see that clae 3 their most powerful model Opus actually surpasses the other state-of-the-art models you can see that GPT 4 and Gemini's recently released 1.0 Ultra actually I wouldn't say pale in comparison to the new model but they actually do get surpassed on these bench marks clearly you can see here that the undergraduate level knowledge the MML U is at 86.8% compared to the other models it is beaten and what's crazy is that we can see that on the left across the board this is excelling at every single tasks and this goes to show how crazy things are because it was only recently that we did just get Gemini Ultra which surpassed GPT 4 on every single benchmarks but now we literally just get clothed three only around 2 to 3 months later that surpasses Gemini 1.0 Ultra on every single benchmarks you can see that this is some really really impressive stuff because if we take a look at the percentages we are nearing 100% in some categories you can see 95.4% on common knowledge L swag and we can see on some of these other ones that it's 96.4% 90.8% and 95% so this is something that really did shock me because I didn't expect a clae release for a little a little bit more but not only did they surprise me about the release date these benchmarks surprisingly managed to take on Google already which is a very impressive feat and Dethrone GPT 4 now something that I did see and this is of course around the first hour of this model being released was that there is a qualitative aspect that you can't sort of look at when you're looking at these benchmarks essentially what I mean here is that whilst it is good to yes look at these benchmarks and think okay how does the AI do on the these on these maths problems on these coding problems and reasoning over text but the actual qualitative data that you can get from your users is something that is very important because at the end of the day it's your users who's going to be using this that determine whether or not your product is actually good and so far based on what I've seen the qualitative data where people are actually talking about how good the model actually is shows us that this model isn't just good at reasoning and doing well on some of the benchmarks this is clearly a model that people really do like you can see here that this person says excited to share what we've finally been working on to me talking to Opus feels different than talking to any other large language model it seems just to get you this can't be represented in any evaluation metric or Benchmark you have to just go experience it yourself and then this person says evaluations aside Clause 3 Opus feels like the smartest model I've talked so that is something that I feel will be important and I would love to see where this model ends up on the llms chatbot Arena because that is something that uses quality data to judge where the models actually do lie and I think it is a very important metric one of the most surprising things that we did get from the clae 3 release was of course the vision capabilities Claude 3's different models and I will get into later how these models actually do differ do possess New Vision capabilities this is actually a model that is multimodal it states that the claw 3 models have sophisticated Vision capabilities on par with other leading models they can process a wide range of visual formats incl including photos charts graphs and Technical diagrams we're particularly excited to provide this new modality to our Enterprise customers some of whom have up to 50% of their knowledge bases encoded in various formats such as PDFs flowcharts or presentation slides so you can see that clae 3 is finally becoming a model that is very very effective at a wide range of tasks not just text and something that is important is they've actually shown us a demonstration on how well Claude 3 Opus combined with the Vision model is going to be at doing in this video we're going to see if Claude and a couple of friends can help us analyze the world economy in a matter of minutes okay I've asked Claude 3 Opus which is the largest model in anthropics new Claude 3 family to look at the GDP trends for the US and write down a markdown table of what it sees we've given Opus and all the other models in the clae 3 family extensive training on tool use and one of the major tools it's using is this web view tool it goes to a URL looks at what's on the page and because it's multimodal it can use the information on that page to solve complex problems so here's the markdown and it's important to note that CLA doesn't have direct access to these numbers it's literally looking at the same browser you and I are seeing looking at the trend line and trying to estimate what the exact numbers are let's see how accurate it was we've asked the model to create a plot of the data and it's used the second tool this python interpreter to write out the code and then render the image for us to check and here's the image look it's actually added helpful little tool tip animations to explain some of the major Peaks and troughs in the last decade or two of the US economy and we can compare that graph with the actual data and it turns out it's pretty close it's actually within 5% accuracy and by the way Cloud's transcription here isn't just coming from its pre-existing knowledge of US GDP we tried it with a large sample of madeup GDP graphs and its transcription accuracy was within 11% on average next we asked the model to do some statistical analysis projecting out into the future performing simulations to see where the GDP of the US might head and we can see that it's run this analysis using Python and it's able to perform these Monte Carlo simulations to see what the range of GDP possibilities might look like for the next decade or so but I wonder if we can go further we're going to get the model to analyze a more complicated question that is how GDP might change across all of the biggest world economies and then to help it do that we're going to give it one more tool called dispatch sub agents this basically allows the model to break down the problem into lots of sub problems and then write prompts for other versions of itself to help pick up the slack the models can then complete a more complex task by all working together here you can see it's written this prompt and given very precise instructions that it wants the other models to follow including a format for the data that it's hoping to return it's dispatched matched a version of this prompt to one model that's going to look at the US one for China One for Germany Japan and so on we can see in these progress bars that the sub agent models are now completing the set task for each of the individual economies they're going to the relevant web pages they're getting the information they're running the code to analyze it just like we saw in the previous US example but all in parallel let's just skip forward to see what the model produced you can see it's run the analysis it's produced a pre- and post pie chart of how it expects the world economy to look in 2030 versus 2020 and it's given us a written analysis too where it makes variable predictions that relate to the statistical analysis that it ran it's telling us that it thinks the GDP share of particular economies will change and which ones will be larger or smaller by 2030 so there we have it complex multi-step multimodal analysis run by a model that can create sub agents to get get even more tasks running in parallel we're excited to see what you our customers can do with these Advanced clae 3 capabilities so yeah that small demo was rather impressive we actually got to see Claude 3 act as not only someone who is able to accurately take data from an image that doesn't exactly have the data just doing pure estimates which is really really good and just shows how well their visual system is but also a very very interesting feature that caught me off guard were two of them the first one being this simulations I think that this is really really cool we can see it doing some kind of you know kind of tree search which looks absolutely amazing and is going to be very very useful for data analysis like they stated it's going to be used for predicting things and I do wonder how some of those predictions will hold up in the future and you know if we actually do use the future data and then say look this actually was correct cor maybe we could even get models that are increasingly increasingly smarter as their predictions come true they decide to I guess you could say reinforce that data somehow so that is going to be a whole new area that I really didn't see before that I'm excited to explore with Claude 3 now I know most people are going to be excited about this and so was I the sub agents area was something that I found to be absolutely astounding this is where you can literally get an AI model to automatically decide to dispatch sub agents to do the rest of the task and I just find that that is a concept that is really really effective and you can see here that it managed to complete the task with much more efficiency than just asking one model so I think that what we have here on that demo was showing us just how great this claw 3 Model is not just in terms of its Common Sense reasoning in terms of its Vision capabilities but also in terms of its ability to do complex step-by-step reasoning with mult multiple different tasks and that right there within the API and Tool use which they state is coming soon I'm guaranteeing you is going to have some massive implications for the industry because people are going to be using this in very very creative ways as we've seen with other AI models next of course we do have another short demo and this is by the other model Haiku and this one is very very fascinating too Claude Haiku is one of the fastest and most affordable Vision capable models in the world to demonstrate this we're going to read through thousands of scanned documents in a matter of minutes the Library of Congress Federal writers project is a collection of thousands of scanned transcripts from interviews during the Great Depression this is a gold mine of incredible narratives and real life Heroes but it's locked away in hard to access scans of transcripts imagine you're a documentary filmmaker or journalist how can you dig through these thousands of messy documents to find the best source material for your research without reading them all yourself since these documents are scanned images we can't feed them into a text only llm and these scans are messy enough that they would be a challenge for most dedicated OCR software but luckily Haiku is natively Vision capable and can use surrounding text to transcribe these images and really understand what's going on we can also go beyond simple transcription for each interview and ask Haiku to generate structured Json output with metadata like title date keywords but also use some creativity in judgment to assess how compelling a documentary the story and characters would be we can process each document in parallel for performance and with claude's high availability API do that at massive scale for hundreds or thousands of documents let's take a look at some of that structured output Hau is able to not just transcribe but pull out creative things like keywords we've transformed this collection of many many scans uh into Rich keyword structure data imagine what any organization with a knowledge base of scan documents like a traditional publisher healthcare provider or Law Firm can do Haiku can parse their extensive archives and bodies of work we'd love for you to try it out and see what you build so once again that was a very impressive demo on how these Vision capabilities can be used at scale for multiple Industries in multiple applications and once again I can't imagine what people are going to do once they do get their access on the apis now another thing that was actually really cool was that they stated that there is going to be immediate response from one of the claw 3 models and that is of course their most lightweight model ha cou so it states that ha cou is the fastest and most cost effective model on the market for its intelligence category and it can read an information and data dense research paper on arxiv which is around 10,000 tokens with charts and graphs in less than 3 seconds so following launch we expect to improve performance even further and for the vast majority of workloads Sonet is two times faster than claw 2 and 2.1 these are the previous models with higher levels of intelligence and it excels at tasks demanding rapid responses like knowledge retrieval or sales Automation and opus delers similar speeds to claw 2 and 2.1 but with much higher levels of intelligence so you can see right there these near instant results are going to be able to provide some very very interesting applications because as you all know AI that is very latency Laden I guess you could say isn't something that is quite effective in certain responses for example in live chats Auto completions and in certain scenarios where responses must be immediate and real time IQ being the most cost effective model and the fastest is going to be interesting to see if it actually does manage to Dethrone some of the other ones who are very very quick but considering the intelligence it might just be haikou that takes the cake there's also another very impressive demo that I do want you all to see and this is their model sonnet acting as a language part this is just a simple prompt to turn son it into a dialogue agent that will talk with you in a language that you're trying to learn so I chose Spanish and I wanted it to basically take my imperfect Spanish and and help me improve it um so I decided I wanted it to do a few things things I wanted it to take my message which will be in kind of imperfect Spanish um and write out what it thinks I intended in English I then ask it to write back the ideal learner message which is just my message as it kind of should have been written in Spanish so I can see the kind of Ideal form of this uh then I asked it to write a teacher response which just a response to me in Spanish uh so that I can continue the conversation great so this is basically just son it saying that we're ready to start so I'll just start with a simple first message and here it's following the format that I asked for so it's repeating the message that I tried to send it back to me in English it's telling me how I should have said it so it's corrected some of the grammar issues in my request um and then it's responded to me in Spanish and then it's asked me uh where I'm from so okay now imagine I don't know a certain word in Spanish but I still want to say it I'm going to just include that word in English in square brackets and hopefully it will just translate it back to me and suppose I hit a roadblock because I just don't understand the message that it's sent to me I can just ask it to translate that message to me into English and then I can read that and I can respond to it again in Spanish continuing the dialogue and as a final step you could ask Sonet to create a little quiz for you based on the things that you've been discussing so hopefully that is a useful prompt if you're interested in using Sonet as a language learning partner I hope you try out now something in addition that I did also want to mention was less refusals you can see here that claw 3 actually refuses things a lot less it states that previous clawed models often made unnecessary refusals that suggest Ed a lack of contextual understanding we've made meaningful progress in this area Opus Sonet and highq are significantly less likely to refuse answer prompts that border on the system's guard rails than previous generations of models as shown below the claw 3 models show a more nuanced understanding of requests recognize real harm and refuse to answer harmless prompts much less often this is definitely something that I would say is a win for Claude and anthropic because one of the main problems with Claude 2.1 was that it really just didn't answer many of your questions prob most of the times it just refused to answer your questions leading to user frustration and I was someone who was actually in that group because trying to use Claude when it's very very good is very very fun but when it doesn't want to respond to what you're stating because it thinks there's a real danger when there really isn't is one of the most frustrating things you could ever experience but you can see they've now improved this another thing was of course the improved accuracy and they state that businesses of all size rely on our models to serve their customers making it imperative for our model outputs to maintain high accuracy at scale to assess this we use a large set of complex factual questions that Target known weaknesses in current models and we categorize the responses into correct answers incorrect or hallucinations and admissions of uncertainty where the model says it doesn't know the answer instead of providing incorrect information compared to claw 2.1 Opus demonstrates a two-fold Improvement in accuracy or correct answers on these challenging op open-ended questions while also exhibiting reduced levels of incorrect answers in addition to producing more trustworthy responses we will soon enable citations in our claw 3 models so they can point to precise sentences and reference material to verify their answers now there was also something that I really wanted to talk about which is the perfect recoil Lord 3 Opus has recoil accuracy that borders on 99% they state that the Claude 3 family of models will initially all offer a 200k context window upon launch and however all three models are capable of accepting inputs exceeding 1 million tokens and we may make this available to select customers who need enhanced processing power to process long context prompts effectively models require robust recall capabilities the needle in a haystack evaluation measures a model's ability to accurately recall information from a vast Corpus of data and they state that we enhance the robustness of this Benchmark by using one of the 30 random needle question pairs per prompt and testing on a diverse crowdsource Corpus of these documents Claude 3 Opus not only achieved near perfect recall surpassing 99% accuracy but in some cases it even identified the limitations of the evaluation Itself by recognizing that the needle sentence appeared to be artificially inserted into the original text by a human so you can see here that claw 3 is actually a very very effective system that is able to completely identify what's going wrong in it a 200,000 context window and they also state that all of these models are capable of 1 million token context input which just goes to show the era of 1 million context window is upon us and this is very very impressive stuff because this now allows for a lot more use case capabilities now one of the main questions that I actually had when looking at Claude was of course the difference between the three models this was something that was a little bit confusing when looking at the blog post initially but they've actually made made it really simple to understand if you want the cliff notes you can just screenshot this and share it to whoever might need it but essentially Opus is just the highest intelligence available it's the smartest model that you use if you're trying to get the most factual answer sonnet is strong performance at a lower cost so it's a balance of the intelligence and of course the cost and Haiku is the near instant speed and a very low cost so that is essentially the difference between these three models if you were wondering now we can dive into a little bit more detail on these model models you can see that clae 3 Opus and then you can see at the bottom here it does have the differentiator and the differentiator is the main thing that you do want to pay attention to this is just higher intelligence than any other model available you can see it costs the costs do seem quite steep if I'm being honest with you guys like that does seem like a pretty expensive model but then again this is a state-of-the-art AI system that is leading the frontier of AI intelligence so there's no surprise that this model is that expensive now of course it shows us the potential uses for Opus and it does say task automation plan and execute complex actions across API and databases interactive coding research and design research review brainstorming and hypothesis generation drug discovery of course we've additionally got strategy Advanced analysis of charts and graphs financials and market trends forecasting and then of course the highest intelligence of any other model and this is Opus now then of course we do have Sonet which is where they State CLA 3 Sonet strikes the ideal balance between intelligence and speed particularly for Enterprise workload it delivers strong performance at a lower cost compared to its peers and it's engineered for high endurance in large scale AI deployments it says that the potential use cases for this are data processing it says rag or search and retrieval over vast amounts of knowledge sales product recommendations forecasting targeted marketing and time-saving tasks such as code generation quality control and pass text from images of course like I stated before it's more affordable than other models with similar intelligence and it's better for for scale so if there is a model with similar intelligence as Sonet this model is just a bit cheaper then of course we do have Hau and CLA 3 Hau is our fastest most compact model for near instant responsiveness it answers simple queries and requests with unmatched speed users will be able to build seamless air experiences that mimic human interactions so of course the potential use cases for this are customer interactions quick and accurate support in live interactions and translations content moderation catch risky Behavior or customer requests and cost-saving tasks such as optimize Logistics Inventory management extract Knowledge from unstructured data and essentially smarter faster and more affordable than any other models in its intelligence category so overall we can see that this new state-of-the-art system claw 3 by and thropic has really surprised everyone and taken us back because it is something that is now a state-of-the-art model that surpasses every other AI system so this is something that is quite fascinating because the AI space is always rapidly evolving and it was only a couple of months ago that we had a new AI system literally surpassed GPT 4 and now that system has been surpassed so it seems as if the race is heating up and things are accelerating but I leave you all with this question what do you think about claude's new model are you going to be testing this out because of course right now you can see that you can actually use Claude if you want to if you want to be able to just test out how good the system is and I'm going to be doing another video and actually testing out the system showing you some of the best use cases because I feel like that is better saved for another video rather than talking about this actual announcement so hats off to the anthropic team for an amazing product and hopefully we can all have some fun using this amazing new AI system
Info
Channel: TheAIGRID
Views: 116,071
Rating: undefined out of 5
Keywords:
Id: x13wGyYzlJo
Channel Id: undefined
Length: 23min 45sec (1425 seconds)
Published: Mon Mar 04 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.