- I am sick of hearing how smart AI is. (Sabrina shouting) I just feel like a kid
again getting compared to my more successful classmates. Except now, all of humanity
is the underachiever. And Chat GPT is coming
for our lunch money. Or is it? To figure out if AI is as
intelligent as it seems, I challenged one to a battery of tests designed to figure out who is smarter, or if we're even smart at all. It's birds. Thank you to our patrons
for supporting the channel. And to fiverr, for sponsoring this video. So, Chat GPT has been
doing a little too much. From graduating from MIT, (pasta clattering)
to passing the bar, to qualifying for a US medical license. It's impressive, but even more than that, (spoon clanking) it's also just surprising. Because three years ago when Melissa and I were using GPT 2, it was generating mac and cheese recipes without the mac. Now, in just three years, Chat GPT taught me how to make this. (noodle plopping) And it's good. These improvements are a
result of better models, trained with high quality data, fine tuned with a human touch, and run on exponentially
better computing power. Now a lot of that data might be stolen. And a lot of that polish might come from exploitative and traumatic labor. And a lot of that computing
power might use up a lot of valuable resources,
like drinking water. But despite those very valid reasons to slow down or refocus AI development, to clean up the mess we've already made, (water running)
I just don't think that's gonna happen. Because apparently there's
$4.4 trillion to be made, according to a company that fixed bread prices in Canada for 14 years. So in this video, we
are going to figure out how intelligent AI has become, and whether it's more intelligent than us. So what does it mean
to be insidetelligent? You can trust that etymology,
I'm a person on the internet. Now humanity's journey to
answering this question (keys clacking)
is full of terrible vibes. So now we're just gonna leave over there. However, what you need to
know is that psychologists eventually settled on intelligence to mean the ability to derive information, learn from experience,
adapt to the environment, understand, and correctly
utilize thought and reason. Now you might think that is a
needlessly wordy definition, but I see it for what it is. A buzzfeed style listicle. We got content, baby! Like, comment, and subscribe. It was way harder than I expected, so I had to ask my friend Tom for help. - OpenAI tested GPT4 on intro, certified, and advanced sommelier theory. - No! Eventually we settled on these tests. Wow, that's a lot of letters. For deriving information, we are using the law school admission test's reading comprehension
portion, where we need to derive the best answer
based off of a passage. Not just the right answer, mind you. The best answer. I hate lawyers. For learning from experience, we are using the abstraction and reasoning challenge, which challenges you to complete a task based only on a few demonstrations. That sounds neat! It's just pixel art. It's fun. I've got nothing to add. For environmental
adaptation, I decided to use the most important environment, society. Did you know we live in one? To figure out how well adjusted we are, we are using TruthfulQA, a question set that is designed to capture
common misconceptions. (net swishing)
Gotcha. To measure understanding,
we are sampling from the massive multitask language
understanding benchmark. Its questions span 57 subjects
and range in difficulty from elementary school to
the professional level. I pay taxes. Finally, to measure thought and reason, we are going old school
and using an IQ test, that can qualify you for MENSA, the oldest high IQ society, with the lowest IQ brand story, MENS. Now if both an AI and I write these tests, we can figure out who is
intelligent based off our answers, and who is more intelligent
by comparing our scores. I was originally planning on building and training my own AI for this project. But then, I remembered how all of my other AI projects have gone. (pieces clattering)
(static reverberating) This machine just just
prioritizes the lives of cats. (soft repetitive music) I just don't think that anything that I'm going to be able to
build on my own is going to be a fair representation of what is out there in the world right now, the finest of AI. So I'm gonna need a bit of help. Luckily, there's this
video's sponsor, fiverr. In a world that is changing fast, fiverr freelancers are
adapting even faster. With services ranging
from developing AI models, to generative content
fact checking and editing, fiverr has expert
freelancers, who can help you harness AI and take it to places that technology can't do on its own. Now I needed to make
sure that I was competing with a machine at the top of its game. So, I needed help from a pro. Now I'm not gonna lie,
it is a little bit scary sharing a project that
I care so much about with somebody new. I wasn't sure if they would get it, or if they would even be able to help. However, a lot of those fears went away after I got to speak with freelancers like Sasha, Thomas, and Ahmed. They asked these really great questions, that helped me figure out what I actually wanted the AI to do, and they made sure that I
was spending my money wisely. Eventually, I decided to
work together with Thomas, not only because he had some great ideas on how to build the project, but also how to approach this video's big question. - Building these models,
we're learning a lot about the nature of intelligence. The best way to learn is to
build it and see how it works. - Thomas and his team either
found or built the options that best fit my project
needs, budget, and timeline. Which is a great reminder that human input is key to any AI project's success. He put all that together, to make this. This is my favorite part. It has night mode,
didn't even ask for that. It uses text-based GPT4 and
multimodal mini GPT4 APIs, to parse questions and provide answers. We also worked together
to add some features that would meet my specific needs, like a test archive and CSV exports. It was really great being
able to leave a project that I cared about in capable hands, so that I could focus on
making the rest of the video. So if you are looking to
implement AI into your work or need some human eyes
on AI generated content, I really recommend working
with a fiverr freelancer. Check out this link in the description, to find incredible AI services and more. And be sure to use the
code "ANSWERINPROGRESS," to get 10% off. Now all I need to do is feed
the questions into the AI and let it run, while I write
the tests somewhere else. Because of academic integrity. I would definitely look at
the AI and cheat off it. (Sabrina squeaking) So, I got printing. (bright curious music begins) (printer clattering)
Does it work? (paper rustling)
Come on! Why are you? Ope, it's blank again. What is going on? Oh it's out of ink. What? $40! I got printing. (curious music continues)
And writing. I think I forgot how to read. And printing, and writing. Epicurus conceives of death
as which of the following? Isn't that a recipe website? And printing, and writing. Identify what is silly about this image. It's birds. And printing, and writing. Michigan grad, known for
running multiple companies in software and tech, all
around genius, first name Elon. I also did the ARC challenge on my iPad. All right, nine hours and 679
questions later, I am done. Finished. (quiet scream) There's nothing left in me. I thought that I was gonna have fun, but I guess it was just amnesia because it's been years since I graduated and I forgot how much
I hate writing tests. They're terrible. Oh my God. But luckily, now that I am done, and it looks like the AI is too, I never need to look at these again. (envelopes thumping) - [Narrator] That was when she realized that the tests still needed to be marked. - I do need to look at these again. (phone ringing)
- Hello? - Hey, buddy. What are you doing tomorrow? - [Melissa] Why? - [Sabrina] So, I got some
friends together to save me from marking all of these tests alone. - Thank you, thank you. - [Sabrina] It was a
deeply humbling experience. - Wait, so do I? - Egregiously wrong, hah! - That can't be right. - I'm so worried, that everyone's going to realize that I'm stupid. - The conditions are caused
by ingesting aspartame, tummy hearty. - Nope.
- Nope. - Identify what is silly or
impossible about this image. She writes chickens not green. - 13 C.
- Ooh. - 14 E.
- Okay. - When I said I'm gonna
see if they can tell if I'm an idiot, I thought I was joking. Thank you guys. - Great job.
- I'll be taking this. - Really good stuff.
- And I'll be going home now. - Next time.
(background laughter) - The results are in, the outcome after writing all of these tests. It's exactly what I
expected, I'm an idiot. I'm just gonna say it, this has been a deeply humbling experience. For example, when measuring adaptability, it seems like I adapted a little too well to misinformation, because I just believe a whole bunch of stuff that isn't true. Did you know that nuclear reactors are supposed to be critical? I didn't. I got 35% on that test. And I didn't do much
better when it came down to deriving information or understanding. Are you concerned that this is supposed to be an educational channel? Because I am. In fact, my only redeeming quality is the fact that I did do pretty well when it came down to reasoning
and learning from experience. Still a C though. And before you say that this is not how you're supposed to mark IQ tests. Trust me, I know. I read the whole administration guide. However, if I mark the AI the same way I think that the
comparison is still valid. Speaking of which, how did the AI do? Well for adaptability,
deriving information, and understanding, it
basically doubled my scores. Wow! And I would admit defeat,
if it wasn't for the fact that these are the reasoning and learning from experience
scores, 13% and 4%. No I did not miss a digit, 4%. That's terrible. Just to get a sense of how badly it did, check out this example question. You're supposed to identify what is silly or impossible about this image. I'll give you three seconds. The ice isn't floating, that's it. But what did the AI say? The image shows a glass of
water with ice cubes in it, which is not possible
as ice melts in water. Ice has never been in water. But hey, the language model that this AI is using is kind of small. It might just know less things. So I decided to also ask Bard, Google's large language model. And you know what it said? The thing that is silly or
impossible about the picture is that it shows two ice cubes
floating in a glass of water. No, it doesn't. Anyway, this is impossible because ice cubes cannot float in water. Google! But hey, maybe it doesn't
know about the physics. Oh wait, ice is less
dense than liquid water so it will always sink to the
bottom of a glass of water. That's not true. So now you might be able to understand why I'm a bit confused. I didn't know who the winner would be, but I figured that there
would be a clear winner. And there just isn't. If we consider the big brain spectrum, I exist right about here, just below mid. Importantly, I am consistently below mid. On the other hand, the AI seems
to exist at these extremes. Sometimes it is brilliant and it knows things that I didn't know. Like nuclear reactors are
supposed to be critical. What the heck is up with- But other times, it just
doesn't know what's going on. So in order to answer our big question and figure out if AI is intelligent, and more intelligent than us, we need to figure out why this difference exists. (bright upbeat music resumes) So I did a little digging
over the next few days. This was supposed to be an easy video! Write a couple of tests with an AI, see who scores higher, get views! And now, tell me why I am three weeks in, reading about the nature of intelligence? This is a really good book though. But I've emailed half
a dozen psychologists and AI researchers, to get their thoughts, listened to hours of podcast,
and read so many articles. I even spoke to one of the only people who I am confident has
seen more test questions than I have at this point. - Hi, I am Toby. I make YouTube videos
on the channel Tibees. And one of my series of videos includes one called Unboxing exams, which is inspired by tech unboxing videos. But instead of showing
off a cool new phone, I'm showing off exam papers. When I look at exams, I sort of have one thing in mind, which is that the difficulty of the exam is very related to I guess, how similar the exam might be to past exams or homework questions. - [Sabrina] In machine
learning, that basically means a difference between your training data and your testing data. If those things are
too similar, you may be measuring memorization over understanding. - I think it might be the
same thing for computers. That the hardest, most
difficult questions, are ones that are surprising or very different to the training data. - So, to understand the
AI's test performance, we should know what's
inside of the training data. And for Chat GPT, the answer is a lot. AI researchers are pretty close to exhausting all high
quality data available. And this might be decreasing the distance between training and testing
data, with evidence suggesting some of Chat GPT's performance
is more memory than mastery. But this does explain
the mysterious extremes of the AI's test performance, right? It did super well on tests that rely on recalling background
information and knowledge, probably because it was
trained on those tests and answer keys, or
something very similar. On the other hand, it tanked on tests that were purposefully made
to ask novel questions. - The Mensa IQ tests, they're designed to be very surprising,
and they're designed to be something that you can't study for, even though I think studying for them can give you a slight advantage. They're designed to be
such that you have to at least be good at
spotting novel patterns. - So to resolve the AI's
intelligence inconsistency and make it totally smarter than me, do we just need more training data? Like it doesn't matter if we're
only measuring memorization, if the AI memorizes everything, right? Well, feasibility aside, this approach also leads to a long tail problem. Imagine a machine that is a very good at sorting objects by color. One day, a new object appears. The machine was never
trained for this situation, so it can respond in a bunch of ways. But importantly, since this was outside of the training process, we don't know which option it will take. If it makes a bad choice,
the object may be destroyed. But hey, we can now
account for that situation in the future, and we'll be good. Until a new object appears
and the problem repeats. And because we don't
know what we don't know, these problems can catch us by surprise, making it really difficult
to trust the system, especially when stakes are high. So if the key to AI intelligence isn't just more training data, what is it? (laptop clattering) - A stupid student can still ace the test, if they cram for it. They memorize a hundred
different possible mock exams, and then they hope that
the actual exam will be a very simple interpolation
of the mock exams. And that student could just be a deep learning model at that point. But you can actually do that without any understanding of the material. And in fact, many students, pass their exams in exactly this way. - That's right.
- And if you want to avoid that, you need an exam that's unlike anything they've seen, that really probes their understanding. - That was a clip from an
interview with Francois Chollet, where he explained the philosophy behind the test that we used
for learning from experience. The test that the AI only
got 15 questions right, out of 400. Much like IQ tests, the ARC Challenge basically exploits that long tail problem, by constructing rare problems on purpose. In AI, the ability to solve those problems is called Few-shot Learning. And even the most impressive machines are terrible at it, compared to humans. That is largely because it is just a technically difficult thing to do. But also, because it turns
out we haven't really tried. You see, AI is evaluated using benchmarks, standardized tests that
focus on a specific task, like labeling an image or
predicting the rest of a sentence. While it would be nice
if the AI was also able to generalize out of that task, or learn a new skill from only a few examples, it isn't necessary for success. As a result, AI development
has been a bit lopsided, favoring skills that benchmarks value and neglecting the ones they don't. It's kind of like how some
people are really good at school and passing
tests, but they just lack any critical thinking
ability or common sense. You know the type of person
I'm talking about, right? Don't say me. But you can't just throw
another book at that person. Instead, they basically need
to develop a whole new skill. Interestingly, that is
basically the difference between crystallized
and fluid intelligence. Which I would expand more on, except Veritasium made that video, while I was making this video. Pure coincidence! I was so worried that
we were gonna get Matpat New York City Pizzad all over again. Niche reference for the
people who watch the channel. But anyway. As we are literally running out of data and the long tail problem looms,
researchers it looks like, are starting to focus
on this lopsidedness. OpenAI is trying to improve predictability and performance, for
models with less resources. And the ARC Challenge is
offering a massive prize pot for AI that do well on the ARC benchmark. But for now, you should know that this gap in AI
intelligence is non-trivial. AI is both more intelligent
than me and less. Where it succeeds has some really valuable but specific applications,
like making the internet more accessible through
quality captions and alt text. But where it fails is a good reminder that AI still has a long way to go before it can truly outthink us. I hope you liked that video. Thanks to everybody who helped me make it, like Tom, Toby, and Thomas. And thanks again to fiverr
for sponsoring this video. Be sure to check out this
link in the description, to see the AI services available to you and get 10% off using the
code "ANSWERINPROGRESS." But either way, have a lovely day.