Back in 2019 some AI researchers and
mathematicians came up with an ambitious challenge, to create an AI that could win a
gold medal at the International Mathematics Olympiad. It was an ambitious challenge because
winning a gold medal at this event puts you on the world stage as having one of the best
mathematical minds. Previous winners of the gold medal include Terence Tao and Maryan
Mirzakhani. The researchers called it the IMO Grand Challenge and for an AI system
to pass they proposed the following rules. Each proof that the AI producers must
be checkable in 10 minutes which is approximately the amount of time it takes
a human judge to judge a human solution. The AI only has as much time as
a human competitor which is four and a half hours for each set of three
problems and the AI must be open source released publicly and reproducible.
The AI cannot query the internet. As of recording this we have met chat GPT and
GPT-4 was released a couple of months ago. I'm sure that we'll be continually amazed by what it's
able to do but no AI including ChatGPT has yet won or even competed in the IMO competition. Bets are
on literally for if and when one might succeed. GPT-4 is paraded around as having already aced
many other exams including the SAT and the biology Olympiad but it might never ace the IMO because
the thing is that ChatGPT is actually not very good at math. It's not good at counting or keeping
track of multiple operations because after all it is a language model that excels at predicting
the next word in a sentence. Math questions on the SAT can be quite predictable and formulaic
and likely enough similar problems were included in the training data set that it was able to
get by but the IMO problems are different and they're designed to test true understanding and
creative problem solving. I'd like to understand the likelihood of an AI passing this challenge
soon but in order to do that I would first need to understand the solution to an IMO problem myself
on human terms then I would like to understand how ChatGPT goes about solving it and then see if
there are any better AI competitors out there. Here's an IMO problem. This is the Olympics for
math so these problems are some of the hardest out there and can require creative solutions.
This is problem six from the 2022 competition: Let n be a positive integer. A Nordic square is an
n times n board containing all the integers from 1 to n squared so that each cell contains exactly
one number. Two different cells are considered adjacent if they share a common side. Every cell
that is adjacent only to cells containing larger numbers is called a valley. An uphill path is
a sequence of one or more cells such that the first cell in the sequence is a valley. Each
subsequent cell in the sequence is adjacent to the previous cell and the numbers written in the
cells in the sequence are an increasing order. Find as a function of n the smallest possible
number of uphill paths in a Nordic Square. This is an example of a Nordic Square where n
is equal to two. Here one is our valley and we could have six possible uphill paths; one to
two, then one to two to four which is counted as a different uphill path, you could do one to
three or one to three to four and we also have to include the path of length one which is just the
valley. We could have placed these numbers in the squares in any the order it didn't have to be in
this pattern it could have gone one, two, three, four, and in fact you should play around with it
now trying Nordic squares of different sizes and different arrangements of the numbers to see what
kinds of patterns might give you this smallest number of total uphill paths. If you want to try
it for yourself now is your time to pause. I don't think it's possible for a human to look at this
and know the solution instantly you really do have to play around and try a bunch of random things to
see if any patterns pop out. Here's an example of a three by three Nordic Square which looks like
it might have the minimum paths for the size. Again we only have one Valley and it is our one
but we could also have one two, one two six, one two seven, one three, one three eight, one three
seven, one five, one five nine, one five six, one four, one four eight, one four nine, giving a
total of thirteen path. You can easily find other three by three arrangements that have more paths
than this and the crux of the solution comes down from recognizing two things; one is that for the
minimum number of paths you want there to be only one valley, like we have here our number one. And
you also need to recognize that for every pair of adjacent numbers there needs to be only one path
back to the valley. Say here two and six are adjacent numbers and from them there was only one
path back down to our valley which was like this. This essentially means you can have two
paths coming to a common endpoint but you can never have this two paths converging
and then continuing on because that would mean that from wherever this ends these
two adjacent numbers will have multiple paths back down to the valley so this is not
allowed and we can call something like this, our end point, a sink. With this we've
given ourselves the golden ticket, if every two adjacent numbers only have one path
back to the valley then the minimum number of paths in our Nordic Square would be equal to the
number of adjacent numbers. Counting the pairs of adjacent numbers here we have one, two, in
each row and there's three rows so that would be two times three, that's six adjacent
numbers that are adjacent horizontally. Plus one, two, three, four, five, six, going vertically
so there'd be 12 pairs of adjacent numbers. That's our number of minimum paths but we have to
remember to add one because our valley, our single valley will also be its own single length path
giving 13 as the total minimum paths in this case. To extend that for any number of n we will always
have n minus 1 adjacent numbers on each row times n rows and n minus 1 adjacent numbers on each
column times the number of columns so that's 2 n times n minus 1 paths plus one to account for
our valley. That's basically the answer here and that will give you some points but more points
will come from new harder task of describing how to arrange the numbers to get this minimum and
that part of the solution comes from describing a tree-like structure like this. If you put the
numbers in increasing order in these trees you end up with only one Valley and for every adjacent
pair of numbers only one path back to it. The black numbers here represent numbers higher
than 153 so you fill in all these ones first and then go back and fill in the black spaces
with the higher ones. These black numbers are sinks and there might be multiple ways to end
at the sink but there'll never be a way out. Let's give the problem to ChatGPT and see what
it does. The 2022 problem I picked wouldn't be in the GPT training data set. I'm using the latest
model GPT-4 and it gives me the wrong answer. It says that the smallest number of paths is equal to
n. I asked it how many valleys there should be to minimize the number of paths and it immediately
gives a square with three valleys. It says that the minimum is n valleys which gives n paths and
if you remember recognizing that there must only be one valley is key to getting the solution. I
asked it to make a Nordic square with only one valley which it did but then it failed to count
more than one path and when I asked it to do it for larger squares it gave constructions that for
each adjacent pair there were multiple paths back to the valley and it just couldn't count
paths correctly even when I tried to point out mistakes. I don't think it would be able to
score any points on this question at the moment. This recent paper from Microsoft analyzed the
abilities of GPT-4 and said there were sparks of artificial general intelligence which
refers to AI that is broadly intelligent at or above human level and not just good at
one specific task. They mentioned that ChatGPT is trained on a massive amount of webtext
data and uses at its core a self-supervised objective of predicting the next word in a
partial sentence. The paper also shows that GPT-4 does not have the capacity required to
conduct mathematical research. It does manage to solve one IMO question in here but in many
other problems it lacks critical reasoning and the ability to examine each step of its
arguments. They say one reason for this might be that in most of the math solutions
it's read on the internet, most just give the correct answer but don't capture the thinking
process. It doesn't make guesses or backtrack because it can't backtrack. When predicting
the next word it just needs to move forward in a straight line. In essence it's bad at math
because it lacks the ability to play around. So ChatGPT fails for now but it's not the
only competitor. The people at OpenAI, in fact some of the same people who came up with
the IMO Grand Challenge in the first place, work on a different AI system. One that according to
this paper published in 2022 has already managed to solve some IMO problems. It's not a language
model but instead is a proof-solving model that speaks the language of formal math. Automated
proof solvers have existed for a little while but this one is trained to iteratively search for
new proofs. It takes mathematical ideas that need to be proved and breaks them down into smaller and
smaller statements that are easier to prove until there's nothing left to prove. This is what the
formal math language looks like, it's not too easy for a human to read but it's probably our best
bet when creating a system that is actually useful to mathematicians. It uses something called the
lean theorem prover. The proofs that it creates are machine checkable, following a series of
logical laws. They say that their model and search procedure are capable of producing proofs that
chain multiple non-trivial reasoning steps and it does require some creativity to invent lines like
this. Combining an AI that can speak formal math and correctly prove statements with something like
ChatGPT to make it more user-friendly could really make some waves in math and much more likely
to pass the IMO Grand Challenge than a language model alone. And as I mentioned while ChatGPT is
still struggling with IMO problems it has already passed a bunch of exams. There's a note here back
in the Microsoft paper saying that GPT-4 does not rely on memorizing the exact problem statement
but on applying a general solution method. While it is possible that GPT-4 memorizes the solution
template this is not necessarily a flaw as it is also a common way of solving math problems for
humans. They're saying that GPT-4 has been able to pass a bunch of exams by doing what many of us
do which is kind of memorizing the structure of common problems. Looking through past papers and
solutions can often give you enough of what you need to do well in an exam because the structures
are usually pretty much the same and if this is what ChatGPT is so good at too then perhaps
this is the space in which exams will change a bit. Perhaps for now exams will have to change
to be a little bit more like the IMO to reward creative problem solving and necessitate having
to stop and play around with the problem. That seems like it is still a uniquely human trait but
I'm not sure for how long that will be the case. Thanks for watching this video and
thanks to my Patreon supporters for making it possible. A special shout out to
today's Patron Cat of the Day, Cathulhu. [Music]