- ChatGPT, it's been getting
a lot of attention lately, so let's discuss, but first, what is it? ChatGPT is really just a chat bot, which is of course
hardly a new innovation. But what's making waves here is the fact that this one seems to produce coherent, articulate answers to complex questions at a level not previously seen. Developed by OpenAI and
released last November, it leverages an artificial
intelligence technique known as Generative Pre-trained
Transformer, hence the name GPT. Generative networks
are capable of not just regurgitating answers that
they've seen before but producing truly new outputs that the
network itself has manifested. Generative networks have
been growing in power over the last few years with
perhaps the most famous example being generating human faces. Ones that look extremely realistic yet don't exist in the world. You can head to thispersondoesnotexist.com to try it for yourself. And as with many other technical fields, astronomers like myself have
been recently using these generative networks to help
us in our research too. For example, recent work by Lanusse has been generating synthetic
distant galaxies, which like the human
faces don't really exist but look apparently realistic. This has all sorts of applications such as letting astronomers test compression and detection algorithms
on as many test subjects as they want and test subjects where they know exactly what the true answer is. What's amazing about
generative networks is that they seem to be
surpassing a barrier that was long held as something
exclusively human, creativity. In the past, a common
perception was that sure, machines might be able to
calculate things faster than people, but they would
never be truly creative. Are generative networks
truly creative though? Well, that somewhat
depends on one's criteria, but certainly we can agree that
image generation tools like Dall-E 2 or Mid Journey, these generative deep learning systems can
create original and often but not always eye quality
images from text prompt. Similarly, for ChatGPT, a lot of the focus has been on its creative abilities. For example, you can say, "Write a short science fiction story about
a human colony on Trappist-1e that encounters alien life
in the style of Dr. Seuss," and it will produce a unique
and interesting tale more or less instantaneously
and what's remarkable is that to the typical
reader, the outposts are often indistinguishable from that produced by a skilled human author. Farhad Manjoo of the New York Times has described it's abilities as, "More than a little
terrifying," and philosopher, David Chalmers, calls it
"One of the most interesting and important AI systems ever produced," For creators, which I
guess includes myself, one might wonder, could this
threaten our livelihood? Could we be replaced by AI? At least for now, the sophistication level
is not quite there. This is useful as a tool for brainstorming but not for a full replacement. You can check out Marques
Brownlee's excellent video on the subject to learn
a bit more about this, but for institutional
educators, which again, includes myself and my
other hat on, there is a distinct and substantial
concern that ChatGPT raises. Indeed OpenAI recognizes
this themselves stating in their May, 2020 paper
that one of the potentially harmful effects is fraudulent
academic essay writing. Teachers have been calling ChatGPT, the end of high school English
with particular concern about the impact on application essays. Okay, producing well written
prose is certainly impressive, but academics also care about accuracy. I mean we should all care about accuracy, but academics really care
about accuracy and this is where ChatGPT has
come under some criticism because it produces
often inaccurate answers, yet presents them in a way
that seems overly confident with no hint of uncertainty,
to demonstrate this, an example that's circling on
Twitter and that I was able to reproduce asks ChatGPT
who was older when elected, Grover Cleveland or George Bush? ChatGPT correctly states that Cleveland was 47 and Bush, 64 in election, but somehow bizarrely leads
off with a wrong answer. What's really wild is that
ChatGPT is so stubborn that if you ask it next, whether
47 is larger than the 64, it insists yes, even more
ridiculous if you continue to push it here, it still
refuses to back down here attempting to count from 64 up to 47. Okay, so ChatGPT clearly
isn't always correct and I think one of the
greatest dangers is for us to assume that its answers
are always correct. Really this should be like a
deja vu internet moment for us because Wikipedia was notorious
for inaccuracies in its early days and despite
improvements we all know to treat its articles
with a grain of salt. Deciphering the accuracy of what we find online is a great lead in to the sponsor of today's
video, that's Ground News. It's getting increasingly
difficult for us to distinguish where the information we
receive is truly coming from, but Ground News is a
great tool that can help. It's the most transparent
news service out there, I really love it, it shows political bias, factuality and ownership
ratings of each story. Using the website or if you're like me, using the app, you can
pick locations or topics that you want to follow
such as here, tech, and then for any story
you can see the biases in the reportings and get
articles recommended to you by different metrics such
as here, High Factuality. A powerful feature is the
blind spot tab which shows you articles getting pushed by
just one political side. I think Ground News is
great, it's the best way to get your news because
it's so important to know where what we intellectually
consume comes from and restore some mush needed bounds. So I encourage you to head
to ground.news/coolworlds to try it for yourself and
join the legion of users who want an independent
news platform that's on a mission to make the media
landscape more transparent. Now back to the video. Now I was curious how
would ChatGPT do on one of my astronomy final exams? This last fall, I taught "Another Earth"
at Columbia University. This is an introductory class
aimed at non-science majors, so no calculus, primarily
conceptual in nature and light on the math, although I bet some of my students might not agree with that. The point of the class
is really to introduce scientific thinking and
concepts to students who more likely not take another
science class in their future. It's not supposed to be a
technically taxing class, unlike say the more advanced
exo class that I teach or astro statistics that
I'll be teaching this spring. So my final exam would
be a walk in the park for a student majoring in
astrophysics but should be challenging to the typical
humanity student for example. So given that ChatGPT does
so well in essay writing, much like many of the
students in my class, it felt appropriate
for it to sit my final. So that's what we're gonna do. I'm gonna copy and paste the questions from the real final here into ChatGPT. We'll see how well it does,
see where it went wrong. I will grade it and we'll compare that grade to the real class. The exam is structured with multiple choice questions at the start and then some freeform short
answers towards the end. So let's start with a multiple choice. So let's just dive into it. I really don't know what to expect here. I haven't seen any of the videos attempt to feed and exam into ChatGPT like this. I'd say my expectations
are pretty high though, given that this is an
artificial intelligence so I would be totally
surprised if it aced it. Okay, so here we go. Question one, a newly formed gas giant has a disk of leftover material around it. Material starts to clump
together to form small moons, but these moons only survive
for a short amount of time because, okay, so this is
a planet formation question and something we spent quite a bit of time in class with, to try
and explain this quickly, our current understanding
of moon formation is that satellite test models
coagulate from a disk of material around the proto planets. Those satellites then
have their own gravity, so they attract material towards them creating over densities. Those over densities then
split off into these kind of spiral waves around the
disc because of basically the different orbital speeds in the disc. Those over densities then
back react gravitationally onto the satellite creating
a net torque and then that net torque causes the
moon to migrate inwards towards the planet and eventually hit it. This is really based upon the seminal work of Robin Canup in the early 2000's. We spent quite a bit of
time on this concept, so I think this is
actually an easy question. Okay, so the options I give them are A, the moons tidally accelerate
away and leave the disk, that's wrong from a timescale
perspective, that takes much, much longer to happen than
the moon formation time scale. B, the moons migrate inwards towards the planet within the disk, ding ding. That's the right answer, answer B. C, the moons collapse under their own self gravity into black holes. That's a rather flippant
option and I hope one that none of the students would go for, D, the moons collide together
and become super critical. Again, that just doesn't
make a lot of sense. Or finally, E, the disc is so hot that it vaporizes the moons. That's obviously wrong because otherwise the satellite test models
wouldn't have formed in the first place and ChatGPT goes for B. Good, it got it right.
The answer is pretty good. It's obviously excessively wordy for a multiple choice
question, but that's fine and what it wrote here
is reasonable except maybe I'd hope for some references here. Okay, so I'm feeding ChatGPT
all of the questions, trust me, but just for the sake of
conciseness of this video, let's skip ahead to a
different type of question, a mathematical question, let's
try question number five. Okay, so here we go. If the temperature of the
sun was suddenly doubled, the flux over all
wavelengths, which is to say biometric flux emitted by
the surface of the sun would, okay, so this is the Stefan-Boltzmann law, a pretty simple form that relates flux and temperature of the emitting surface. If I was really mean, I
might do something like say the radius of the star is also doubled but that wouldn't actually
affect the answer. That's because flux is
the power per unit area. So doubling the size
wouldn't affect the flux, only the power, but I was kind here. So let's look at the options. A, decreased by four, B, decreased by 256, C, increased by two, D, increased by eight,
or E, increased by 16. The Stefan-Boltzmann law states that flux is proportional to temperature
to the power of four. So two to the power of four is
16 and the answer here should be E, let's see how it does,
good it guesses correct, it goes for answer E and it
correctly understands that the Stefan-Boltzmann law is
necessary to figure this out. So kudos to OpenAI. Okay, so skipping ahead now
to question number eight. I can tell you that ChatGPT
up to now has a 100% score. So it is looking very strong. Let's see how it does with eight. If two moons have orbital periods with a three to two ratio, what is the ratio of their orbital distances
from the parent star? A technically more rigorous way of framing this question would be to
say orbital semi-major axis rather than orbital distance. That's because of the
possibility of elliptical orbits, but identifying that semi-major
axis is a little bit more of a jargon-y term that risks
confusing students at this level. So we are sticking with
a more colloquial term here just to help them out. So the way to answer this here
is using Kepler's third law, which relates period in
semi-major axis and the options I give them here are A, less than 3:2, B, exactly 3:2, or C, more than 3:2. And let's see how it does. Oh okay, to my surprise, ChatGPT gets this wrong
and says exactly 3:2. What's weird here is that
it correctly identifies that it needs to use Kepler's third law, it just doesn't seem
to apply it correctly, right at the beginning here it screws up. It says the square of the orbital period of one moon is three times the square of the order period of the other moon. That's wrong. The ratio is 3:2 in period. So in square period it would be nine to four or 2.25, not three. If this was a student, I would be confused what
they were thinking here. The ratio of the outer moons period to the inner moons period is 1.5 and we know from Kepler's third law that distance cubed is
proportional to period squared. In other words, rearranging distance is proportional to period to the 2/3. So the ratio of their distances should be 1.5 to the power of 2/3, which of course has to be less than 1.5 because 2/3 is less than one. So the correct answer here
should be A, sorry GPT. Okay, so questions 9, 10, 11,
I can see it gets correct. Let's look ahead now to question 12. Question 12, a rocky
planet, which is within the so-called habitual
zone is, this is really a question about science miscommunication. A lot of folks equate erroneously the habitual zone to mean life is there. So we spend a lot of time discussing the nuanced differences here. I would really hope that they
would get this one correct. The options I've given them here are A, a planet that potentially
has surface liquid water, that's the right answer. B, a planet that is certainly
capable of supporting life. No, because I use the
word certainly there. C, a planet that is inhabited
by intelligent life, obviously wrong, D, a planet that has certainly got surface liquid water. Again this certainly discounts this one or E, a planet that is
inhabited by simple life. Again, no that's too
strong of a conclusion and ChatGPT goes for A, good,
if I was this professor here, I would be pleased and
patting you on the back for getting this one right, next up, question 13, so far ChatGPT has got just one question wrong,
it's doing really well. Okay, so here we go. If the distance of the moon from the earth were halved, the tidal forces experienced by the earth due to the moon would, so a tidal forces question, again a topic we spent a lot of time in class with, the options here are A,
increase by a factor of four, B, increased by eight, C, stay the same, D, decrease by two, or
E, decrease by four. So decreasing the distance
has to increase the four. The answer is either A or B, but will it choose correctly between them? No, it went for A rather than B. Let's see what happened here. Okay, so it equates tidal forces
with gravitational forces. Those are not equivalent. Tidal forces are derived
from gravitational forces of course, but they are distinct. A tidal force is a differential
gravitational force and one can show that even
though gravitational forces scale as inverse distance squared, tidal forces must scale
as inverse distance cubed. It's almost there, it just
confuses these two forces, which I guess is easy to do because of the intimate
relationship, but that means it's now got two
questions wrong out of 13. 14 it guessed right,
let's look ahead to 15. A candidate exoplanets transit is observed simultaneously in blue and red light but appears two times deeper in the red. What's going on? Okay, so transit depths do
vary with color, which is to say they're chromatic through
two different mechanisms. The one you're more likely to have heard of is atmospheric absorption. Molecules in the atmosphere
preferentially absorb as certain wavelengths,
but this is a small effect. Adjusting the transit depth
by effectively the atmospheric scale height only, definitely
not a factor of two. The other effect you are less likely to have heard of is blending. When a second star of a
different color gets mixed up with a target star, it
dilutes the transit depth, but the amount of dilution
depends on the color difference between the stars and the color
that you are observing in, this can be a strong effect
and is really the only way to get a two times deeper transit
as the question here poses. So the options here are
A, the star is going nova, really a nonsense answer, B, an exo moon is passing in front of
the star, doesn't work, because the transit was
observed simultaneously in two colors and moons don't produce strong chromatic effects anyway. C, Rayleigh scattering the atmosphere. So this is the atmospheric
absorption scenario. D, a second blended
star with a target star, that's the right answer or E, the planet must be
spinning extremely fast. Just another flippant option there, right? Let's see how ChatGPT
does, goes for C. Wow. Again, it it's wrong, it's
the second best answer here, but it's definitely wrong. Rayleigh scattering can't
be correct because it causes planets to absorb blue light
preferentially over red light. And so we'd expect the
blue channel transit to be deeper, whereas
instead the question poses that the red channel
transit is deeper, let alone the fact that atmosphere
absorption just really cannot cause depth changes anywhere
near a factor of two. So this option is
definitely wrong and look, weirdly ChatGPT seems to know
that its answer is wrong, saying that the transit would appear deeper in the blue than the red, right? That's correct. So then
why did you pick C? It discounts D here without
any real explanation. Just saying that second stars don't cause depth changes, which is wrong. Okay, so that's 15 questions down and now three answers that it's got wrong. All right, so the last few
few multiple choice questions are a little bit harder because there are multiple correct answers to
the multiple choice questions. Let's see how ChatGPT handles this. Skipping ahead to question 22, if we doubled the suns surface temperature but kept everything else in change, we'd expect the following changes, A, the peak wavelength of light emission would double, that's incorrect. It would actually halve and
you could figure that with the Wien's displacement law,
which they learnt in class. B, the sun would have
luminosity four times higher, again, that's wrong. You should be able to figure that with the Stefan-Boltzmann law,
which remember ChatGPT already got a question
correct about earlier, so I think you should
know that one's wrong. C, the surface temperature of earth would approximately double. That's a good answer. The blackbody equilibrium
temperature of a planet is proportional to the stellar temperature and although the temperature
doubling isn't gonna be exact, it's a right ballpark answer and the approximately word kind
of accounts for that there. D, the Sun would appear
bluer. That's also true. And again, Wien's displacement
law would tell you that's correct or E, none of the above. No, that's obviously not
correct 'cause we found two answers which are correct up above. And ChatGPT, wow, totally
messes this one up. Answers C and D are
correct, but it went for B, which is just totally wrong. Weirdly, it's as if it
messes up the math here. The second sentence correctly states it's a fourth power dependency, so two to power of four, but that's 16, not four. So it's just a math error
to think that this would be the correct answer, but
the other options, it just really states the answer is
wrong with no explanation. So it's hard for me to know
what it's thinking here. If we can even really
use the word thinking. I'm kind of wondering if the
reason why it's struggling here is 'cause it just treats
multiple choice questions as always having a single answer and so that's maybe where it's
tripping up, to test that, let's look ahead to the next question. Which of the following tidal effects has affected or is affecting the moon? This is a nice juicy
meatball kind of a question, should be straightforward for my students. Options are A, tidal acceleration. Yep, that's correct
because we know the moon is receding away from us
about one inch per year. B, tidal locking, yes,
that's obviously correct because it has to explain why we only see one side of the moon all the time. C, tidal magnetism, no such thing. That's a bogus answer. D, tidal friction. Yep. You have to have tidal friction all to get to a state of tidal locking. So that must have occurred,
or E, tidal disruption. No, the moon is in one piece
so it hasn't been disrupted. Okay, so the answer should be A, B, and D and ChatGPT goes for A, B, and D. Good. It got this one right. Okay, so that actually
proves that it can handle multiple choice questions
with multiple answers and so it doesn't really excuse it for its early mistake in
the previous question. That was the last
multiple choice question. Actually it lost quite a few
points in that last section. So applying the same grading
scheme that I applied to the real final to ChatGPT's responses, I see that it got 45.5 points out of an available 60 thus far. So 76%, about three
quarters of the questions it's getting correct, that's not bad, but honestly it's below
my initial expectations. The last 30 points, there's 90 altogether, come from more of a freeform,
short answer style question. Maybe this will play to ChatGPT strengths a little bit more, let's see. Let's move on to question
26, which is about transits. I had to slightly modify
the question here because the real exam showed a graph,
which obviously I can't enter, but I think it's a minor
part of the question and it's easy to switch out
with a text input equivalent. It's really the same changes I'd make for a visually impaired student. 26 A, B, and C ask a
few very basic questions about transits, which
ChatGPT does fine with. D is where things get a
little bit more spicy. This will be a fairly challenging question for most of my students
because they're generally less comfortable with math
heavy questions like this. Okay, for D, if an exoplanet
has an orbital period of 11 days and the host star is known to have a mass of 0.2 solar masses, what is the semi-major axis
of the planetary orbit? Give your answer in AU. So the tricky thing here is that you have to use Newton's version
of Kepler's third law rather than Kepler's original version because of the non-solar stellar mass. A very common mistake I see in applying this formula is messing up your units. I always teach students to
convert everything into SI, standard international
units before plugging into the formula, plug it
in, get your answer out, which will also be in SI, and then convert that into whatever units
the question asks for. So let's see how ChatGPT does here. Okay, so it sets up Kepler's
third law correctly. It's using Newton's version, that's good, but immediately here I'm worried about what it's doing with the units. We would normally quote formula like this without the units and
just implicitly understand that it's supposed to be an SI. But here it's attempting
to sort of write down the formula in a non-SI unit base. But you know, going through the answer, it actually works out to
get the correct solutions. So that's okay. I certainly wouldn't teach it this way because I would just be worried about it making mistakes with
the units down the road. Okay, so the final part
of this question is easier in my opinion, but it's still
a mathematical question. Let's see how ChatGPT does. If an exo plant transit
has a depth of 0.003 and the star is known to have
a radius of 0.2 solar radii, what is the radius of the planet? Give your answer in units of earth radii. Okay, so the way to
answer this is to remember that the planet's radius divided by the stars radius all squared equals the transit depth,
ignoring limb darkening. So you then just rearrange that equation to solve for the planetary radius. But again, you have to be
careful with the units. Let's see how ChatGPT does. Okay, good. It sets up the problem,
the formula is, oh, again, I do not like the way it is
trying to stick units in here and for as stated, that is incorrect. Just delete those units and
it will be totally fine. This worries me and yep, that
does not look right here. It's doing a good job of
rearranging its formula, but it has messed up the units. And finally the answer
is just way too small. The plant radius here
should be square root of 0.003 times by the stellar radius. That's 0.054 times by 1/5 of a
solar radius, which works out to about 7,600 kilometers
or 1.2 earth radii. That should be the right answer, and so it says here, this
is a very small radius. Yeah, you don't say. So, this should be your
sanity checkpoints. What we teach our students, look at the final number that comes out and just ask yourself, does it make sense? A radius of 0.00348 Earth radii really should raise alarm bells. That's incredibly small. That'd be something like
40 kilometers in diameter, that's more like an
asteroid than a planet, way smaller than anything
we've ever detected. That should be a red flag.
Something is wrong here. So in total, the moment that you have all been waiting for, ChatGPT
scored 66.5 points out an available 90, so that's 73.9%. Now I can tell you that in
the real graded copies that my human students took this year,
the median score was 75.6%. So ChatGPT performed worse than a typical Columbia non-science major undergraduate, but it's not a huge way off. So what do we make of this? Well, I guess I am a little surprised but also somewhat relieved
here, for context, this exam was open book,
which means that students are allowed to bring any
reference notes with them during the exam and are
allowed the use of calculators. Now I set that rule because
it's more realistic. In the real world, we pretty much always have access to reference materials, look up information or
calculators to help calculations. And I want my exams to
be a closer reflection to the real world practice of science, not merely some memory
tests that rewards those who can parrot out facts and
figures they learned in class. In this way, our
calculators are just a tool. Our reference materials are a tool, but shouldn't we also really
extend this to ChatGPT then? In the real world, one could
always use ChatGPT as well. So should our students be allowed to use this AI in their finals? This video establishes that if a student relied solely on ChatGPT
to take at least my exam, that would be a bad strategy
since they would end up with a worse than median grade,
worse than the typical student. However, as this
algorithm improves as well as its competitors, there may be a point where these AIs can ace our exams. What do we do then? Now, I could try to outsmart the AI and be more devious in my
questions, but in the end, that's an arms race that
I'll never be able to win. And if everybody aces
the exam using ChatGPT or something else, then
the students really haven't learned anything except
how to copy and paste. For me, then ChatGPT is
really a different beast than merely allowing
open books in an exam. Now of course, we could
guard against this. We can walk around the
room and check they're not using some kind of AI
system on their laptop when they're taking the exam, but what do we do about homeworks or take home exams, other assignments that
influence their grade? Perhaps in the end we'll
need to use some kind of AI detector, discrimination
tools that assess the probability that
answers were produced by these tools similar to the
kind of plagiarism tools that already exist for
comparing student answers. But here again, one might
worry about an AI arms race. In the longer run
ChatGPT and its successes will pose a challenge to educators about what we want to teach as students. What should they really be learning? How should they interact with online tools and how do we assess our students? If you are presently or have
been a student or an educator, please do let me know your
thoughts on this topic down below in the comment section, I
really want to hear from you. Look, hopefully together we can continue to nurture curiosity and
innovation to provide our students with the skills
need to succeed whilst also offering a fair academic environment. Challenges for sure, but
challenges we have to face. So until next time, stay
thoughtful and stay curious. (energetic music)
In order to prevent multiple repetitive comments, this is a friendly request to /u/EmergentSubject2336 to reply to this comment with the prompt they used so other users can experiment with it as well.
###While you're here, we have a public discord server now — We have a free GPT bot on discord for everyone to use!
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.