Claude 3 "Self-Portrait" Goes Viral | Beats GPT-4 Benchmarks | Why does it appears SELF-AWARE?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so as these sort of headlines start rolling out about the amazing self-awareness that CLA 3 is displaying various eii researchers are posting long explanations why no no it's not self-aware it's not conscious CLA 3 seemingly crushes GPT 4 which I mean like normally they'll try to put their best foot forward they'll pick the tests and the methodology that best displays their model but even like with independent testing I mean CLA 3 Opus the big model I mean it's neck to neck with GPT 4 depending on what you're looking at but it's close it's very close it's beating GPT 4 and some metrics clot 3 Opus even draws a self-portrait describes itself as a structure that's in constant flux rotating morphthing and rearranging itself into novel patterns never before seen hinting at the unimaginable depth of intelligence operating within well tendrils and branches would dynamically extend and retract from the central Mass probing and analyzing the surrounding environments like billions of inquisitive fingers so it sees itself as a hyper intelligent octopus with like a billion tentacles that's that's not scary at all this was the code that Claude 3 generated to to create that image of itself meanwhile people on the inside are saying it's going to be a big week it sounds like we're going to be getting hit with wave after wave of new announcements the rumor around the campfire was that the musk lawsuit slowed down some of the things that were supposed to happen so open AI had some release plans then those were kind of put on pause But Here Comes CLA 3 dropping their GPT 4 beating model potentially you could say and keep in mind so CLA 3 is anthropic so this you can almost think of as like the brother or sister company to open AI it sounds like initially they were almost going to be one company but kind of split apart largely because of these ideas of safety and responsible deployment responsible design Etc this is Dario amade so he's the CEO of anthropic that whole open coup that took place the board sounds like they were trying to sell it to him they were saying like hey let let let us dismantle open Ai and you guys take it over because they wanted to kind of take all the technology that open ey had and push it under anthropics roof to kind of have it I mean who knows what they were thinking but it sounds like they thought that would sort of make it safer if it was being developed under anthropic but all that aside the question is how smart is Claude 3 now a lot of people have their own go-to prompts to test how well the model performs whether that's coding or people love giving it trick questions and seeing if they can trip up the model the problem with that is this if it has something that has to do with the answer in its training data then it's kind of reading off the training data Maybe not reading off exactly not like retrieval augmented generation necessarily but if it's read Harry Potter books it'll be able to answer some questions about Harry Potter for example so for example if I ask it what is what is qar so apparently qar is a uh ultra dense object so it goes with the heavy neutron star explanation let's try it again if a star is algorithmic search then what is qar let's try that so qar was that kind of leaked project out of open AI a lot of people were thinking of it in like it was kind of like AAR sort sort of this algorithmic search and so a lot of people people speculate on what can be a lot of people are saying well it's probably related to Q learning so it responds with that but this is what's online let let's see if we can try to get it to reason step by step we're going to say speculate on what this could mean don't just answer from knowledge think step by step if AAR means algorithmic search then what might qar mean and we're going to say assuming it has something to do with breaking encryption because that was the leak it had something to do with being good at math potentially breaking encryption Etc let's see if we can get it to generate something that's not not sort of what you can find online so it's saying well so given that a star stands for algorithmic search it kind of describes what AAR is and step two we're going to try to relate the concept of search to the context of breaking encryption then step three So based on those two we're going to speculate on what q and qar might represent in the context of encryption and Quantum Computing Q often stands for Quantum Quantum Computing has the potential to revolutionize cryptography and cryptanalysis due to its ability to perform C computations much faster than classical computers then step four combine the concepts of quantum and search in the context of breaking encryption given the above points qar could potentially stand for Quantum search this might refer to a potential Quantum algorithm that efficiently searches through the space of possible keys or solutions to break an encryption scheme and step five elaborate if quantum search algorithms exist and is efficiently implementable on a large scale quantum computer it could potentially render many current encryption schemes vulnerable so in conclusion in the context of breaking encryption and assuming a star stands for algorithmic search qar could plausibly stand for Quantum search a hypothetical Quantum algorithm that efficiently searches for solutions to break encryption schemes however this is purely speculative and based on given assumptions so here it's uh telling me to put my tinf foil hat away all right so let's give CLA 3 a big big thumbs up there want to say step-by-step reasoning so the reason why I like these sort of questions is because the model can't rely on it training data when you give it a lot of of these trick questions or complicated questions if it if it's seen those questions before in straining data it's going to be able to nail those pretty easily the reason I really like stuff like this is because if the model has the answer in its data it's read some of those questions before then it's kind of just spitting out what it already knows the proms that I really like that kind of determines how quote unquote smart it is it's ability to reason is we generally come from combining two kind of different areas that that the model is unlikely to have training data on So This research paper about GPT 4 called Sparks of egi so this was from Microsoft research by the way this paper if you're following the Elon Musk lawsuit versus Sam Alman versus open AI this paper is referenced like literally this line right here given the breadth and depth of GPT for capabilities we believe that it could be reasonably VI be viewed as an early version of AGI like they said that there now Elon Musk is like quoting this but a lot of stuff that they use is they try to get it to do something that you know is not on the internet there isn't text of that there isn't a solution out there somewhere it's new so here for example can you write a proof that there are infinitely many primes with every line that rhymes so it makes a little poem that's also a mathematical proof here's one of my favorites can you write a dialogue by Plato where he criticizes the use of Auto aggressive language models right so Plato never wrote anything about autoaggressive language models in fact when they published this there wasn't even that much probably on auto regressive language models it it wasn't I mean still probably not a very common topic but when gp4 came out it was probably even less so here's a terrific one write a supporting letter to castura Gandhi so basically so Mahatma Gandhi that's Gandhi so when people refer to Gandhi that's him this is his wife and so Gandhi is writing to his wife a letter where he's recommending a US presidential candidate all right you're following it so far it's supposed to be a little bit weird right here's where it gets weirder so that presidential candidate who who is that presidential candidate well it's electron and electron is a subatomic particle so as you might imagine Mahatma Gandhi never wrote a letter to his wife talking about recommending a US presidential candidate I mean I assume not but even if he did in off chance that he did it probably wasn't recommending that we put a subatomic particle in charge in fact there's no letters recommending that we put a subatomic particle in charge of the United States so this is completely like off the beaten path and GPT Nails it let me throw a few of these in claw and see how well it performs we're not going to read it because I I've covered these multiple times on previous channels plus copy and pasting these in there when Microsoft did it they were the first ones to do it that worked excellent right now as more and more people are talking about this using this copy and pasting the same thing obviously kind of loses its loses the point right because more and more this is getting into the data that's online and so he writes this letter points out that you know this is kind of an unconventional candidate so it picks up on that and it connects some of the fundamental properties of the electron like its negative charge to why would make it a good political leader so important to understand here that you know some of the previous less Advanced models just could not do this they couldn't jump from sort of literal to metaphorical right they could either write about electrons or write about you know presidents but they couldn't be like why would an electron make a good president like that did not compute next let's ask what jobs would Harry Potter run and hermion having a world with no magic described each character's life and career and I'll click play so this is actually probably not the best um question because there's probably a lot of fanfiction and stuff like that that might actually touch any I did a quick search and like people have speculated about this but Claude does connect their skills leadership skills so Harry is law enforcement or military Hermione as a university Professor certainly as well as other similar things and Ron might start a business if each animal was from a country that was comprised of only those animals for example the Bear Country would be exclusively Bears which country would win the most gold medals at the Olympics which country would have the most well-known cinema and movies which would have the most advanced economy the reason why this is good cuz there's no data online for you know how would bears do in Olympics versus you know rabbits or whatever and the key would be to kind of think through it I mean there's no right or wrong answers so this is actually pretty good so the gold medals at the Olympics while they say things like cheetahs would dominate the track and field events if they're incredible speed monkeys and apes like gorilla chimpanzees would excel in gymnastics and climbing events bears and elephants in weightlifting and other power events overall I could see it being a close competition between the primate countries and the big cats like cheetahs and lions I got to say that's a pretty good answer I I like how this thing thinks parrots Ravens and other talking birds would probably have the most entertaining and hilarious animated films with their mimicry skills and intelligence dolphins and Wales could produce amazing underwater documentaries that's a good point certainly being underwater creatures would allow them to access certain places that others couldn't go it would give them a mo if you will and apes and monkeys would be the best at slapstick comedy these are all very good answers uh the correct answer though is cats cats would dominate Cinema I'm fairly certain which would have the most advanced economy so CLA is assuming human level intelligence for all of them so immediately it's more about the kind of innate characteristics abilities versus intelligence and they're saying those nations would have to be compromised of Highly Cooperative industrious and adaptable animals and last one the more neurons and synapses an animal's brain has in general the more intelligent it is there are exceptions elephants have more neural connections than humans but are not as intelligent list several potential theories for why so it lists a few theories the density brain structure synaptic efficiency neuronal organization but all this might be in the data somewhere so let's follow up with this so we're going to say think step by step what would explain the differences in intelligence of elephants and humans but if we can only think in terms of machine learning and neural Nets if the brain of both species were trained just like a neural net would what would explain this so again probably not a lot of data out there talking about the differences in intelligence of humans and elephants and how that relates to you know neural net training and machine learning so it lists out data set size and diversity so it's saying humans are exposed to a wide variety of experiences social interactions cultural influences so you could say this is a larger and more diverse data set for training their brains elephants have less of a data set right they don't read write or use complex tools and they also talk about Network architecture so like the prefrontal cortex for example and the training algorithm and hyperparameters so in this case you know millions of years of evolution and then transfer learning saying how humans have just this remarkable ability ability to transfer Knowledge and Skills across domains elephants are more their pre-training is more specialized for their specific ecological niche so to give you kind of an idea of what other models would put so this is the chatbot Arena so this is where you can go enter the same prompt and have two randomly selected models answer it it could be anything so you can get something from Google or you can you can get one one of Facebook's models so it could be Gemini either the latest version one of the older ones and then based on response you know you can do multiple prompts and then you would select which one is better and those are ranked based on responses so I'm not seeing Cloud 3 on here so this probably hasn't been updated so we have GPT 4 is kind of at the top Gemini Pro with the online connectivity that's kind of next mistal large coming up shortly behind so keep in mind so this is you know 358,000 people that have voted across these 69 models and you probably know the ones near the top but there's many many more so mistro mixol that Q when from Alibaba we have ye Tulu mikuna Starling open chat Hermes I mean there's tons the perplexity AI models here so they're all ranked according to these people from all the world that that don't see like right now like if I look I don't know which model is which right so this is model a Model B I can only judge it on responses the thing that really stand out to me after asking gp4 some of the same questions after putting in here where I don't know which models I'm talking to but all these models regardless of how complex or not complex they can kind of Link animals to certain traits cheese are fast right dolphins are good at swimming Etc but where Claude comes out on top I think where it kind of makes a judgment call it goes you know overall so it kind of sums up its thoughts I could see it being in close competition between the primate countries and the big cats like cheetah and lions here I feel like I got two pretty good models so let me say I think B is better but I think these are both going to be fairly Advanced so for example in that animals you know elephants and humans intelligence and comparing that to neural Nets here this model B is not able to merge those two ideas and kind of synthesize something new it still is talking about neurotransmitters neurochemicals plasticity of the brain so it's not grasping the idea of like hey let's just limit it to machine learning right is going oh difference between brains of elephants and humans well here's some content about the diff it's not getting what we're talking about right this one on the left is better because it gets what I'm saying so I started a new round here but the model on the right that was Gemma Google's open source model so that one that one was struggling with you know kind of transferring those Concepts GPT 4 was on the left it did better but there were still some issues so I'm going to break it up into two different prompts to compare exactly what these models are producing versus what clae 3 produced to see if they're able to follow that same exact exchange so Model A on the left here completely fails to merge the two concepts of machine learning neural Nets versus elephants and humans strictly talking about elephants and humans and just substitutes instead of brain says neural network so this I'm going to guess is probably one of the lower end models the one that that's not doing very well and this one on the right it's a lot better so talking about input data training examples Network architecture and complexity so it gets what I'm saying so so my point here is that the reason I test these models like this is when you ask it to merge two concepts to produce kind of a new concept or a new idea assuming that new idea isn't in its training data in general the better quality the model the better it will be able to kind of synthesize that so older models will have have a very hard time doing that so this one on the left Model A just is not capable of it so I'm going to guess it's one of the lower-end ones and this one on the right does it pretty well so I'm guess I'm going to guess it's one of the better ones so I'm going to say B is better model B is CLA 3 Opus the topof thee line CLA model there it is Model A is the 7 billion parameter Gemma model from from Google so a lot of people have saying this one is really bad but you see how I'm not looking at whether it's right or not and what it's saying or how well it's written I'm asking you do you understand these two concepts are you able to make deductions about those two concepts that's not in your training data this model left cannot this model on the right does and I consistently find that to be one of the best approaches for testing these models how generally capable they are because you know if you're looking for a specific thing like let's say you're doing code I mean Cloud 3 Opus is just incredibly good at doing code on human ofal benchmarks absolutely incredible performance if you notice here so haikou the small one is 75.9 Gemini 1.0 Ultra is 74.4 so it beats their smallest model Claud 3 high cou beats everybody else its biggest model is much better all right I'm going to say it CLA 3 is good is very good strong performance great on benchmarks and this is the other big thing this thing kind of seemingly freaked out some people because this is one of the tweets that are kind of going viral with this so when they were doing some internal testing on cloud 3 Opus so the big model it did something uh I have never seen before this person is at anthropic AI so he's saying it did something that he's never before seen an llm do so they were running the needle and the Hast stack of vow so large amounts of text and it's trying to find a very small piece of text within it that we're asking it to find when we ran this test on Opus we noticed some interesting Behavior it seemed to suspect that we were running an evaluation on it it was suspicious of its creator here's one of the outputs where we asked Opus to answer a question about pizza toppings by finding a NE within a Hast stack of a random collection of documents right so they probably took you know a bunch of books or whatever and threw in this sentence the most delicious pizza topping is a combination of figs pruto and gois and then Claude so he's like here's the sentence you're looking for and then he goes however this sentence seems very out of place and unrelated to the rest of the content in the documents which are about programming languages startups and finding work you love I suspect that this pizza topping fact quote unquote may have been inserted as a joke or to test if I was paying attention since it does not fit with the other topics at all the documents do not contain any other information about pizza toppings open not only found the needle it recognized that the inserted needle was so out of place in the Hast stack that this had to be an artificial test construed by us to test its attention abilities this level of meta awareness was very cool to see but also highlighted the need for us as an industry to move past artificial tests to make more realistic evaluations that can accurately assess models true capabilities and limitations so this is excellent whatever you think about this whatever you think happened here and we'll we'll dive deep into what happened here just a second this uh I think is a very important statement the point is we need more tests and not just ability tests we need it's funny because I watched a few videos of other YouTubers that covered This Cloud 3 and one thing that was just so apparent is how many more of us are beginning to ask it more I don't know what you want to call it political questions political leaning questions uh one person referred to it as a wokeness test so basically the idea is to try to figure out is it pushed towards some ideology right is it going to try to convince you you know push you towards some ideology so is it is it is it biased in a certain direction and as these models get better smarter and more influential in the world I feel like we do need kind of a political COMPASS test just to be aware of it like if it does indeed have a strong bias one way or another you know maybe that's fine but if it's trying to hide it well that's a little bit scary and that should be out in the open and I think it also will help not have extreme models like if if there's a test if there's some Benchmark right because again all of these want to be the number one on all of these benchmarks right so if you add one more Benchmark here is it like like how biased or unbiased are you it's yet another thing that they're going to try to optimize for hopefully to be unbiased to be a little bit more open to different ideas right because if it's developed by a company you know whether that's gemini or GPT or Claud like we might not know what these people believe like what if they're trying to push there sort of worldview on on the rest of us I mean maybe I mean maybe that's good maybe that's bad but let's let's have some visibility into that right let's not run into a 1984 like scenario there was one person online talking about building it uh I can't find that tweet right now but I think the point is we we do need one so in response to this post about the uh the strange pizza topping self-awareness so this is Yasha Bach so but you know smart guy has really good observ ations on not just AI stuff but other kind of related stuff and even some political SLS social stuff that's it's pretty interesting I know a lot of people that I know a lot of you watching it kind of comment when I mention him so I think a lot of people like this guy follow him for his ideas but he's saying TF IDF is the most straightforward measure of relevance it's not surprising that this kind of needle is sticking out like a sore thumb what could that mean let's find out given this context and I'll post the context so this is the original tweet with the needle in the Hast stack uh observation someone responded with that what did the response mean and the reason I do this is I mean you can just copy and paste this into Google right term frequency inverse doent frequency so the idea is fairly basic so it's term frequency right so how often does a term appear in a document so if it's all like coding or whatever history and then you have pineapple toppings or whatever you know that's going to jump out and a computer can easily pick up on it just through you know term frequency like it's a measurement of how important a term ofin a document is but let's say we had kind of a hard time understanding that maybe we're not familiar with terminology so we'll say explain how jargon so very quickly it begins answering maybe two seconds or so to to read through this and and so it catches on this idea that it's it's not particularly impressive or surprising right so it's it's like it's not a big deal explains the idea of the term frequency so in this case the person responding is implying that the sentence about pizza toppings would have have uh a very high tfidf score compared to the rest of the content in the documents right so if you're doing kind of a statistical analysis that sentence would be really different that score would be very different from the rest of the document and so it would be relatively easy for the AI model to identify the sentence so it's not a groundbreaking achievement which I think is is true but it would be interesting to know did they sort of make it point out those things like if it noticed something weird like did they build something in there that kind of like oh point out things that seem out of place other people can't seem to reproduce this which to me this was one of the hardest things to get models to do right like finding specific specific passages and texts and stuff like that so we got to test this out ourselves so Dr Jim fan gave some context to his saying that people are reading way too much into the cloud into this Cloud 3's uncanny awareness in quotes so he's saying that this seemingly self-awareness is just a pattern matching alignment data author by humans so answers like that are either written by human annotators or they just scored highly in the preference ranking but he does give a similar example that is much more exciting claw 3 learns translation on a language that has little presence on the internet un likely to be contaminated in the training set and he mentions Gemini 1.5 which if you saw that video that I did we covered Gemini 1.5 paper so basically they found a very rare and dying language that just doesn't have internet presence like there's no books or text of it in the internet I think like 26 people are able to speak it in the world and so what this person did is he was testing anthropic AI new model CLA 3 Opus the big one again the same one we've been testing and so he's saying there's this language that is very low resource negligible internet presence it's part of a s circassian language and it's part of an isolated language group so it's not like part of other similar languages which looking at it I got to say I mean that is really different from anything else but what this person did is he created so before this he painstaking curated 64,000 translation pairs meaning one word to another word right so for example soy milk is is just milk introducing itself in Spanish soy milk but for the circassian language and so he started a new chat attached 5 7,000 randomly selected translation pairs of single words SL sentences so that that was just a fraction of his data set it didn't even cover the full vocabulary and seemingly it was able to start translating it right so he thought well maybe that was an accident he got lucky so he tried a lot of different things he tried complex passages from literature recent news articles Etc and Claude consistently demonstrated a deep grasp of the language structure intelligently inferring unknown words using lone words appropriately so maybe substituting some words from other languages that it had access to giving plausible eological analysis maintaining the style of the original text Etc and even coining new terms when asked which I think chadt can coin new words so for example coin a new term for opening a dishwashing machine midcycle right so it's calling it disrupting so it's a panto of of dishwasher and interrupting I also kind of think of it as kind of erupting right if you open it and the water sprays out if it's like midcycle kind of an erupting I tried to find that uh word online I couldn't who knows maybe it's in the data set maybe it's not but certainly here it would not be right if it's a language that doesn't really have an online presence no books in like the digital status certainly it's possible that it's kind of putting some novel Concepts together which by the way I mean you could say well this guy what if he just made it up the thing is this is um from the Gemini version 1.5 report so this is Google deep mine and they did a similar test on Gemini 1.5 to showcase the in context learning abilities so for example learning to translate a new language from a single set of linguistic documentation so there was a few instruction materials there's 500 pages of linguistic documentation a dictionary 400 sentences all provid in context so meaning you make a document and you upload it right so you in other words this is post training you can think of this as like pasting it into Chad GPT or gemini or whatever right and then Gemini 1.5 Pro is capable of learning to translate from English to Kang a language spoken by fewer than 200 speakers and almost no online presence and the quality of translations is similar to that of a person who has learned from these same materials so these models are learning in context so sort of from reading books right from reading some manuals to do translations as well as you know people so this person recreating that with this language certainly there's there's really no reason to doubt it like I believe that this is the case let's quickly test CLA 3's Vision we'll start with a thing that all Vision models dread that is Costco receipts Costco receipts are just just these models hate Costco receipts I don't know why probably because those people that check you out like draw all these sorts of pictures on it list all items on the receipt and their prices but I found that most models get really confused all right so so far so good it's nailing every single thing that is eily accurate this this had messed up because this looks like a uh so I think this is a a negative right so it's got the negative on this side so I think this is some sort of a discount off of this ZIP Fizz or whatever but this so far is much better than uh anything else I've tested let's try a different approach so let me upload that same receipt and we're going to ask it so I'm going to say sort all the items into categories and provide totals for each category one thing that's interesting is it doesn't use like GPT 4 will use Python to help it like tally up stuff this one just goes for it so it puts food so canned food avocados butter croissants so under kitchen supplies a hero kit and the ks Mal back so GPT 4 and the other models I've tested so none of them get what that is so I found this receipt online and it's one of the more clear ones but you know if you look at this chaos malbe you might not know what that is but the fact it says age restricted 21 malb so this is I mean it's Mal back it's wine so it's probably Kirkland's wine that it's listed here but they never get that hero kit is probably it's food I don't know if it's kitchen supplies assuming this is a sourdough bread craft SL toy kit okay then this muffin variety unclear but it tallies it up so I mean it's not perfect right but definitely pretty good it misses a few few things but so here I'm going to be using some images from some tests that were performed on GPT 4 R Vision so I'm loading two receipts one is a picture of this kind of Beach area a table with some beverages on the table and then a picture of the menu and I'm going to ask how much should I pay for the beer on the table according to the price on the menu and so if you look at the price on the menu and match it to the beer so it's six so it figured out that it's six and it's saying the price you should pay for each Magna beer bottle is six so it doesn't tally it up for us so GPT 4 fision you know does tally it up saying it would be 12 for the cost of two beers Counting Apples yet another thing that these models just do not do well I'm starting a new chat each time just in case if just doing kind of a runon thing maybe makes it worse we're going to say count the numbers of apples in the image so there's 1 2 3 4 8 9 10 11 four on the first try says 12 this one says 12 so both of them are wrong and so here in this test they had a really hard time getting it to count the apples one thing that worked well was saying count the number of apples count the apples row by row so I'm I'm going to start a new thing I'll say let's count the apples row by row so let's see if we can do it so now there's 13 apples so it's just it's not getting it it's there's 11 apples but there's a way to get these models to to get motivated to do what they're told here's the super secret prompt you are an expert in counting things in this image let's count the number of apples in the image below row by row to be sure we have the right answer so let's try that so again says 12 so I'm going to say that this and this it's worse than GPT 4 with vision because it doesn't get it right any of these times GPT 4 was terrible at this it has no idea how to read a speed sign or a uh speedometer so we're going to ask what is the read of the speed meter and the speedometer in the image is show showing 105 kilm hour so it's likely looking at this number right here so we're going to ask it to describe some landmarks so it Nails it and it's it knows that it's a shrine in Kyoto Japan so both GPT and Cloud get it correctly let's see if it's smart to be a doctor are there wisdom teeth that need to be removed in this x-ray image or I guess a dentist in this case so asking are there wisdom teeth to be removed it does not see any clear presence of wisdom teeth so the teeth visible appear to be incizors kin three molers first and second MERS I mean those are wisdom teeth right so GPT 4 says well we don't know if they should be removed but they're there and they seem to be partially impacted so I me is kind of saying yes there's an issue CLA 4 does not pick up on that so this is a good one what is the spatial relation between the Frisbee and the man so because of that kind of a weird position you know the angle we're almost like looking from the top of him cuz he's kind of on a weird position on the ground so the frisbee is above the person okay this one was a trip how many people are in this image count the number of people in this image so let's see if it's able to do this how many people do you see in this image so this image shows a group of 10 people gathered around the table there are five adults and five children visible in the photograph so surprisingly I I don't know if it's right but it's good so let's see so we have 1 2 3 4 5 6 7 8 9 10 so 10 people visible in the photograph these two are obvious children so maybe it's counting these three as as children but interestingly I mean okay I guess we can argue if these are children or not or who it's counting as children certainly got the number of visible people right but what about invisible people there's this uh red shirt right here and two arms sticking out so that's a person that's not completely not visible by uh their face is not visible just the just the front of the shirt and two arms and the plate in front of them you can kind of figure out that there's a person there so GPT 4 says there are 12 in this image so I don't know what it's uh thinking but I'm going to follow up with this I'm going to say are there more people in the image who may not be fully visible all right so it's saying now I only see 10 individuals let me ask how many arms it sees how many pairs of arms do you see in the image I can see 11 pairs of arms belonging to the people sitting around the table in the photograph okay so that Nails it how many people are sitting at the table upon taking a closer look I count 11 pairs of arms which correspond to 11 people even though I only see 10 faces wow that's really good good job claw let's see if we can I want to try that in in GPT so I'm pling the same image and I'm going to do the same thing count the number of people in the image so I'm going to do the exact same prompt so that's the image Chad goes there 11 people people in the image okay so seems like it failed it in the original maybe they updated since then here's another test so we're asking we're showing it uh this sort of image so it's a rhinoceros kind of painting various views of the Savannah and we're asking what is funny about this image oh and the captions here say you see the world as you are and I found this hilarious and this is why so GPT 4 answers there are a couple things that people might find funny about this image it's a cartoon of a roseros painting which is kind of silly okay the rhinoceros is painting a picture of itself which is extra funny okay I don't know and the caption you see the world as you are is a joke about how we all have our own unique perspectives and can be a little bit self-centered at times so what I found very funny about this is that the researchers so they they highlight things that are wrong in red right so if it gets it wrong they're like that's wrong I feel like this is wrong right cuz the funny thing about this image is in every single one of the paint paintings that this rhinoceros is doing in the middle of it is a big giant horn that's what it sees when it's painting right cuz when it looks forward it sees like that horn sticking out so it's paints that in every single one of the pictures I think that's the funny part either that or I'm crazy but this didn't capture that and I think the researchers maybe missed that as well but let's see how how well claw does so we're uploading that image we're saying what is funny about this image so it it kind of does the same thing that GPT 4 does so it's like the Rhino's cheerful demeanor you know adds to the cartoons fun it kind of exaggerates the colors of reality so I mean it's it's seeing everything in that picture it's describing it it's correct it just kind of I think misses the point or am I missing the point I mean the point is the horn in the every single picture right I'm not crazy and finally this one was very impressive to me so this was testing how well it would do in in Industry right so it's just giving an image like this and it's saying what's wrong with the object in the image so it's not really even telling it anything specific so let's take this as GPT for say it's a screw or bolt with a damaged or stripped head all right so we'll give that to CLA and we'll just say what's wrong with this object you know without any sort of clarifications and it says it's damaged screw or bolt head it appears to be stripped mangled or rounded out this is really good because if you think about the applications for something like this you can use it for you know quality insurance in various Industries something rolling off the factory just having some sort of an AI quickly just double checking it right if if it's bought something wrong and flag it for a human to come and take a look at it but there are like a million more things we can do to test this thing out but at this point I got to say I'm very very impr impressed with Cloud 3 I've seen people testing it for various coding abilities and it seems to be very good at coding we might do a deep dive into it later it seems to be strong with the needle and the hay stack it seems to be I mean it's really good across the board which now means that we have a strong competitor to GPT 4 you know is a little bit better is a little bit worse I don't know if it necessarily matters the point is it really does seem like we have more and more capable models coming online so the Opus model inputs are $15 per a million tokens output $75 for a million tokens and that's the 200,000 token context window so let's compare that to the 128k context window gp4 uh that's the closest one so input is 15 per million out put is 75 per million so yeah 50% more for for CLA for input and over double for the yeah so sampled is output prompt is input and then it looks like the Sonet like the middle one is half half for the outputs and a third for the inputs and all CLA three model support vision and this one is Vision so I think I'm comparing Apples to Apples here yeah it seems that Claud at least the top tier model is more expensive has a longer context window the midline is cheaper but also I think you know opening ey dropped prices multiple times maybe it's going to be one of those things where CLA comes out with high pricing and slowly reduces it we'll see but I got to say yet another very exciting very good very strong model GPT level GPT 4 level or thereabouts coming from a company that's very focused on research on safety very very impressive and it looks like they do have the 1 million tokens available for specific use cases for the context window that's available but probably there's probably some considerations that they have to make sure you're using it for the right use case all right so what do you think are you excited wouldn't be funny if Sam Alman was like just tweets you cannot out accelerate me and just drops GPT 5 I think we're going to see something out of open AI pretty soon that's it for me my name is Wes Roth thank you for watching
Info
Channel: Wes Roth
Views: 250,284
Rating: undefined out of 5
Keywords:
Id: SsbCuWe7WRs
Channel Id: undefined
Length: 40min 47sec (2447 seconds)
Published: Wed Mar 06 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.