why AI can't pass this test

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- I am sick of hearing how smart AI is. (Sabrina shouting) I just feel like a kid again getting compared to my more successful classmates. Except now, all of humanity is the underachiever. And Chat GPT is coming for our lunch money. Or is it? To figure out if AI is as intelligent as it seems, I challenged one to a battery of tests designed to figure out who is smarter, or if we're even smart at all. It's birds. Thank you to our patrons for supporting the channel. And to fiverr, for sponsoring this video. So, Chat GPT has been doing a little too much. From graduating from MIT, (pasta clattering) to passing the bar, to qualifying for a US medical license. It's impressive, but even more than that, (spoon clanking) it's also just surprising. Because three years ago when Melissa and I were using GPT 2, it was generating mac and cheese recipes without the mac. Now, in just three years, Chat GPT taught me how to make this. (noodle plopping) And it's good. These improvements are a result of better models, trained with high quality data, fine tuned with a human touch, and run on exponentially better computing power. Now a lot of that data might be stolen. And a lot of that polish might come from exploitative and traumatic labor. And a lot of that computing power might use up a lot of valuable resources, like drinking water. But despite those very valid reasons to slow down or refocus AI development, to clean up the mess we've already made, (water running) I just don't think that's gonna happen. Because apparently there's $4.4 trillion to be made, according to a company that fixed bread prices in Canada for 14 years. So in this video, we are going to figure out how intelligent AI has become, and whether it's more intelligent than us. So what does it mean to be insidetelligent? You can trust that etymology, I'm a person on the internet. Now humanity's journey to answering this question (keys clacking) is full of terrible vibes. So now we're just gonna leave over there. However, what you need to know is that psychologists eventually settled on intelligence to mean the ability to derive information, learn from experience, adapt to the environment, understand, and correctly utilize thought and reason. Now you might think that is a needlessly wordy definition, but I see it for what it is. A buzzfeed style listicle. We got content, baby! Like, comment, and subscribe. It was way harder than I expected, so I had to ask my friend Tom for help. - OpenAI tested GPT4 on intro, certified, and advanced sommelier theory. - No! Eventually we settled on these tests. Wow, that's a lot of letters. For deriving information, we are using the law school admission test's reading comprehension portion, where we need to derive the best answer based off of a passage. Not just the right answer, mind you. The best answer. I hate lawyers. For learning from experience, we are using the abstraction and reasoning challenge, which challenges you to complete a task based only on a few demonstrations. That sounds neat! It's just pixel art. It's fun. I've got nothing to add. For environmental adaptation, I decided to use the most important environment, society. Did you know we live in one? To figure out how well adjusted we are, we are using TruthfulQA, a question set that is designed to capture common misconceptions. (net swishing) Gotcha. To measure understanding, we are sampling from the massive multitask language understanding benchmark. Its questions span 57 subjects and range in difficulty from elementary school to the professional level. I pay taxes. Finally, to measure thought and reason, we are going old school and using an IQ test, that can qualify you for MENSA, the oldest high IQ society, with the lowest IQ brand story, MENS. Now if both an AI and I write these tests, we can figure out who is intelligent based off our answers, and who is more intelligent by comparing our scores. I was originally planning on building and training my own AI for this project. But then, I remembered how all of my other AI projects have gone. (pieces clattering) (static reverberating) This machine just just prioritizes the lives of cats. (soft repetitive music) I just don't think that anything that I'm going to be able to build on my own is going to be a fair representation of what is out there in the world right now, the finest of AI. So I'm gonna need a bit of help. Luckily, there's this video's sponsor, fiverr. In a world that is changing fast, fiverr freelancers are adapting even faster. With services ranging from developing AI models, to generative content fact checking and editing, fiverr has expert freelancers, who can help you harness AI and take it to places that technology can't do on its own. Now I needed to make sure that I was competing with a machine at the top of its game. So, I needed help from a pro. Now I'm not gonna lie, it is a little bit scary sharing a project that I care so much about with somebody new. I wasn't sure if they would get it, or if they would even be able to help. However, a lot of those fears went away after I got to speak with freelancers like Sasha, Thomas, and Ahmed. They asked these really great questions, that helped me figure out what I actually wanted the AI to do, and they made sure that I was spending my money wisely. Eventually, I decided to work together with Thomas, not only because he had some great ideas on how to build the project, but also how to approach this video's big question. - Building these models, we're learning a lot about the nature of intelligence. The best way to learn is to build it and see how it works. - Thomas and his team either found or built the options that best fit my project needs, budget, and timeline. Which is a great reminder that human input is key to any AI project's success. He put all that together, to make this. This is my favorite part. It has night mode, didn't even ask for that. It uses text-based GPT4 and multimodal mini GPT4 APIs, to parse questions and provide answers. We also worked together to add some features that would meet my specific needs, like a test archive and CSV exports. It was really great being able to leave a project that I cared about in capable hands, so that I could focus on making the rest of the video. So if you are looking to implement AI into your work or need some human eyes on AI generated content, I really recommend working with a fiverr freelancer. Check out this link in the description, to find incredible AI services and more. And be sure to use the code "ANSWERINPROGRESS," to get 10% off. Now all I need to do is feed the questions into the AI and let it run, while I write the tests somewhere else. Because of academic integrity. I would definitely look at the AI and cheat off it. (Sabrina squeaking) So, I got printing. (bright curious music begins) (printer clattering) Does it work? (paper rustling) Come on! Why are you? Ope, it's blank again. What is going on? Oh it's out of ink. What? $40! I got printing. (curious music continues) And writing. I think I forgot how to read. And printing, and writing. Epicurus conceives of death as which of the following? Isn't that a recipe website? And printing, and writing. Identify what is silly about this image. It's birds. And printing, and writing. Michigan grad, known for running multiple companies in software and tech, all around genius, first name Elon. I also did the ARC challenge on my iPad. All right, nine hours and 679 questions later, I am done. Finished. (quiet scream) There's nothing left in me. I thought that I was gonna have fun, but I guess it was just amnesia because it's been years since I graduated and I forgot how much I hate writing tests. They're terrible. Oh my God. But luckily, now that I am done, and it looks like the AI is too, I never need to look at these again. (envelopes thumping) - [Narrator] That was when she realized that the tests still needed to be marked. - I do need to look at these again. (phone ringing) - Hello? - Hey, buddy. What are you doing tomorrow? - [Melissa] Why? - [Sabrina] So, I got some friends together to save me from marking all of these tests alone. - Thank you, thank you. - [Sabrina] It was a deeply humbling experience. - Wait, so do I? - Egregiously wrong, hah! - That can't be right. - I'm so worried, that everyone's going to realize that I'm stupid. - The conditions are caused by ingesting aspartame, tummy hearty. - Nope. - Nope. - Identify what is silly or impossible about this image. She writes chickens not green. - 13 C. - Ooh. - 14 E. - Okay. - When I said I'm gonna see if they can tell if I'm an idiot, I thought I was joking. Thank you guys. - Great job. - I'll be taking this. - Really good stuff. - And I'll be going home now. - Next time. (background laughter) - The results are in, the outcome after writing all of these tests. It's exactly what I expected, I'm an idiot. I'm just gonna say it, this has been a deeply humbling experience. For example, when measuring adaptability, it seems like I adapted a little too well to misinformation, because I just believe a whole bunch of stuff that isn't true. Did you know that nuclear reactors are supposed to be critical? I didn't. I got 35% on that test. And I didn't do much better when it came down to deriving information or understanding. Are you concerned that this is supposed to be an educational channel? Because I am. In fact, my only redeeming quality is the fact that I did do pretty well when it came down to reasoning and learning from experience. Still a C though. And before you say that this is not how you're supposed to mark IQ tests. Trust me, I know. I read the whole administration guide. However, if I mark the AI the same way I think that the comparison is still valid. Speaking of which, how did the AI do? Well for adaptability, deriving information, and understanding, it basically doubled my scores. Wow! And I would admit defeat, if it wasn't for the fact that these are the reasoning and learning from experience scores, 13% and 4%. No I did not miss a digit, 4%. That's terrible. Just to get a sense of how badly it did, check out this example question. You're supposed to identify what is silly or impossible about this image. I'll give you three seconds. The ice isn't floating, that's it. But what did the AI say? The image shows a glass of water with ice cubes in it, which is not possible as ice melts in water. Ice has never been in water. But hey, the language model that this AI is using is kind of small. It might just know less things. So I decided to also ask Bard, Google's large language model. And you know what it said? The thing that is silly or impossible about the picture is that it shows two ice cubes floating in a glass of water. No, it doesn't. Anyway, this is impossible because ice cubes cannot float in water. Google! But hey, maybe it doesn't know about the physics. Oh wait, ice is less dense than liquid water so it will always sink to the bottom of a glass of water. That's not true. So now you might be able to understand why I'm a bit confused. I didn't know who the winner would be, but I figured that there would be a clear winner. And there just isn't. If we consider the big brain spectrum, I exist right about here, just below mid. Importantly, I am consistently below mid. On the other hand, the AI seems to exist at these extremes. Sometimes it is brilliant and it knows things that I didn't know. Like nuclear reactors are supposed to be critical. What the heck is up with- But other times, it just doesn't know what's going on. So in order to answer our big question and figure out if AI is intelligent, and more intelligent than us, we need to figure out why this difference exists. (bright upbeat music resumes) So I did a little digging over the next few days. This was supposed to be an easy video! Write a couple of tests with an AI, see who scores higher, get views! And now, tell me why I am three weeks in, reading about the nature of intelligence? This is a really good book though. But I've emailed half a dozen psychologists and AI researchers, to get their thoughts, listened to hours of podcast, and read so many articles. I even spoke to one of the only people who I am confident has seen more test questions than I have at this point. - Hi, I am Toby. I make YouTube videos on the channel Tibees. And one of my series of videos includes one called Unboxing exams, which is inspired by tech unboxing videos. But instead of showing off a cool new phone, I'm showing off exam papers. When I look at exams, I sort of have one thing in mind, which is that the difficulty of the exam is very related to I guess, how similar the exam might be to past exams or homework questions. - [Sabrina] In machine learning, that basically means a difference between your training data and your testing data. If those things are too similar, you may be measuring memorization over understanding. - I think it might be the same thing for computers. That the hardest, most difficult questions, are ones that are surprising or very different to the training data. - So, to understand the AI's test performance, we should know what's inside of the training data. And for Chat GPT, the answer is a lot. AI researchers are pretty close to exhausting all high quality data available. And this might be decreasing the distance between training and testing data, with evidence suggesting some of Chat GPT's performance is more memory than mastery. But this does explain the mysterious extremes of the AI's test performance, right? It did super well on tests that rely on recalling background information and knowledge, probably because it was trained on those tests and answer keys, or something very similar. On the other hand, it tanked on tests that were purposefully made to ask novel questions. - The Mensa IQ tests, they're designed to be very surprising, and they're designed to be something that you can't study for, even though I think studying for them can give you a slight advantage. They're designed to be such that you have to at least be good at spotting novel patterns. - So to resolve the AI's intelligence inconsistency and make it totally smarter than me, do we just need more training data? Like it doesn't matter if we're only measuring memorization, if the AI memorizes everything, right? Well, feasibility aside, this approach also leads to a long tail problem. Imagine a machine that is a very good at sorting objects by color. One day, a new object appears. The machine was never trained for this situation, so it can respond in a bunch of ways. But importantly, since this was outside of the training process, we don't know which option it will take. If it makes a bad choice, the object may be destroyed. But hey, we can now account for that situation in the future, and we'll be good. Until a new object appears and the problem repeats. And because we don't know what we don't know, these problems can catch us by surprise, making it really difficult to trust the system, especially when stakes are high. So if the key to AI intelligence isn't just more training data, what is it? (laptop clattering) - A stupid student can still ace the test, if they cram for it. They memorize a hundred different possible mock exams, and then they hope that the actual exam will be a very simple interpolation of the mock exams. And that student could just be a deep learning model at that point. But you can actually do that without any understanding of the material. And in fact, many students, pass their exams in exactly this way. - That's right. - And if you want to avoid that, you need an exam that's unlike anything they've seen, that really probes their understanding. - That was a clip from an interview with Francois Chollet, where he explained the philosophy behind the test that we used for learning from experience. The test that the AI only got 15 questions right, out of 400. Much like IQ tests, the ARC Challenge basically exploits that long tail problem, by constructing rare problems on purpose. In AI, the ability to solve those problems is called Few-shot Learning. And even the most impressive machines are terrible at it, compared to humans. That is largely because it is just a technically difficult thing to do. But also, because it turns out we haven't really tried. You see, AI is evaluated using benchmarks, standardized tests that focus on a specific task, like labeling an image or predicting the rest of a sentence. While it would be nice if the AI was also able to generalize out of that task, or learn a new skill from only a few examples, it isn't necessary for success. As a result, AI development has been a bit lopsided, favoring skills that benchmarks value and neglecting the ones they don't. It's kind of like how some people are really good at school and passing tests, but they just lack any critical thinking ability or common sense. You know the type of person I'm talking about, right? Don't say me. But you can't just throw another book at that person. Instead, they basically need to develop a whole new skill. Interestingly, that is basically the difference between crystallized and fluid intelligence. Which I would expand more on, except Veritasium made that video, while I was making this video. Pure coincidence! I was so worried that we were gonna get Matpat New York City Pizzad all over again. Niche reference for the people who watch the channel. But anyway. As we are literally running out of data and the long tail problem looms, researchers it looks like, are starting to focus on this lopsidedness. OpenAI is trying to improve predictability and performance, for models with less resources. And the ARC Challenge is offering a massive prize pot for AI that do well on the ARC benchmark. But for now, you should know that this gap in AI intelligence is non-trivial. AI is both more intelligent than me and less. Where it succeeds has some really valuable but specific applications, like making the internet more accessible through quality captions and alt text. But where it fails is a good reminder that AI still has a long way to go before it can truly outthink us. I hope you liked that video. Thanks to everybody who helped me make it, like Tom, Toby, and Thomas. And thanks again to fiverr for sponsoring this video. Be sure to check out this link in the description, to see the AI services available to you and get 10% off using the code "ANSWERINPROGRESS." But either way, have a lovely day.
Info
Channel: Answer in Progress
Views: 689,961
Rating: undefined out of 5
Keywords: nerdyandquirky, answerinprogress, sabrina cruz, khanstopme, taha khan, melissa fernandes, mehlizfern
Id: QrSCwxrLrRc
Channel Id: undefined
Length: 18min 34sec (1114 seconds)
Published: Fri Sep 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.