How Does ChatGPT Do on a College Level Astrophysics Exam?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

In order to prevent multiple repetitive comments, this is a friendly request to /u/EmergentSubject2336 to reply to this comment with the prompt they used so other users can experiment with it as well.

###While you're here, we have a public discord server now — We have a free GPT bot on discord for everyone to use!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

👍︎︎ 1 👤︎︎ u/AutoModerator 📅︎︎ Jan 08 2023 🗫︎ replies
Captions
- ChatGPT, it's been getting a lot of attention lately, so let's discuss, but first, what is it? ChatGPT is really just a chat bot, which is of course hardly a new innovation. But what's making waves here is the fact that this one seems to produce coherent, articulate answers to complex questions at a level not previously seen. Developed by OpenAI and released last November, it leverages an artificial intelligence technique known as Generative Pre-trained Transformer, hence the name GPT. Generative networks are capable of not just regurgitating answers that they've seen before but producing truly new outputs that the network itself has manifested. Generative networks have been growing in power over the last few years with perhaps the most famous example being generating human faces. Ones that look extremely realistic yet don't exist in the world. You can head to thispersondoesnotexist.com to try it for yourself. And as with many other technical fields, astronomers like myself have been recently using these generative networks to help us in our research too. For example, recent work by Lanusse has been generating synthetic distant galaxies, which like the human faces don't really exist but look apparently realistic. This has all sorts of applications such as letting astronomers test compression and detection algorithms on as many test subjects as they want and test subjects where they know exactly what the true answer is. What's amazing about generative networks is that they seem to be surpassing a barrier that was long held as something exclusively human, creativity. In the past, a common perception was that sure, machines might be able to calculate things faster than people, but they would never be truly creative. Are generative networks truly creative though? Well, that somewhat depends on one's criteria, but certainly we can agree that image generation tools like Dall-E 2 or Mid Journey, these generative deep learning systems can create original and often but not always eye quality images from text prompt. Similarly, for ChatGPT, a lot of the focus has been on its creative abilities. For example, you can say, "Write a short science fiction story about a human colony on Trappist-1e that encounters alien life in the style of Dr. Seuss," and it will produce a unique and interesting tale more or less instantaneously and what's remarkable is that to the typical reader, the outposts are often indistinguishable from that produced by a skilled human author. Farhad Manjoo of the New York Times has described it's abilities as, "More than a little terrifying," and philosopher, David Chalmers, calls it "One of the most interesting and important AI systems ever produced," For creators, which I guess includes myself, one might wonder, could this threaten our livelihood? Could we be replaced by AI? At least for now, the sophistication level is not quite there. This is useful as a tool for brainstorming but not for a full replacement. You can check out Marques Brownlee's excellent video on the subject to learn a bit more about this, but for institutional educators, which again, includes myself and my other hat on, there is a distinct and substantial concern that ChatGPT raises. Indeed OpenAI recognizes this themselves stating in their May, 2020 paper that one of the potentially harmful effects is fraudulent academic essay writing. Teachers have been calling ChatGPT, the end of high school English with particular concern about the impact on application essays. Okay, producing well written prose is certainly impressive, but academics also care about accuracy. I mean we should all care about accuracy, but academics really care about accuracy and this is where ChatGPT has come under some criticism because it produces often inaccurate answers, yet presents them in a way that seems overly confident with no hint of uncertainty, to demonstrate this, an example that's circling on Twitter and that I was able to reproduce asks ChatGPT who was older when elected, Grover Cleveland or George Bush? ChatGPT correctly states that Cleveland was 47 and Bush, 64 in election, but somehow bizarrely leads off with a wrong answer. What's really wild is that ChatGPT is so stubborn that if you ask it next, whether 47 is larger than the 64, it insists yes, even more ridiculous if you continue to push it here, it still refuses to back down here attempting to count from 64 up to 47. Okay, so ChatGPT clearly isn't always correct and I think one of the greatest dangers is for us to assume that its answers are always correct. Really this should be like a deja vu internet moment for us because Wikipedia was notorious for inaccuracies in its early days and despite improvements we all know to treat its articles with a grain of salt. Deciphering the accuracy of what we find online is a great lead in to the sponsor of today's video, that's Ground News. It's getting increasingly difficult for us to distinguish where the information we receive is truly coming from, but Ground News is a great tool that can help. It's the most transparent news service out there, I really love it, it shows political bias, factuality and ownership ratings of each story. Using the website or if you're like me, using the app, you can pick locations or topics that you want to follow such as here, tech, and then for any story you can see the biases in the reportings and get articles recommended to you by different metrics such as here, High Factuality. A powerful feature is the blind spot tab which shows you articles getting pushed by just one political side. I think Ground News is great, it's the best way to get your news because it's so important to know where what we intellectually consume comes from and restore some mush needed bounds. So I encourage you to head to ground.news/coolworlds to try it for yourself and join the legion of users who want an independent news platform that's on a mission to make the media landscape more transparent. Now back to the video. Now I was curious how would ChatGPT do on one of my astronomy final exams? This last fall, I taught "Another Earth" at Columbia University. This is an introductory class aimed at non-science majors, so no calculus, primarily conceptual in nature and light on the math, although I bet some of my students might not agree with that. The point of the class is really to introduce scientific thinking and concepts to students who more likely not take another science class in their future. It's not supposed to be a technically taxing class, unlike say the more advanced exo class that I teach or astro statistics that I'll be teaching this spring. So my final exam would be a walk in the park for a student majoring in astrophysics but should be challenging to the typical humanity student for example. So given that ChatGPT does so well in essay writing, much like many of the students in my class, it felt appropriate for it to sit my final. So that's what we're gonna do. I'm gonna copy and paste the questions from the real final here into ChatGPT. We'll see how well it does, see where it went wrong. I will grade it and we'll compare that grade to the real class. The exam is structured with multiple choice questions at the start and then some freeform short answers towards the end. So let's start with a multiple choice. So let's just dive into it. I really don't know what to expect here. I haven't seen any of the videos attempt to feed and exam into ChatGPT like this. I'd say my expectations are pretty high though, given that this is an artificial intelligence so I would be totally surprised if it aced it. Okay, so here we go. Question one, a newly formed gas giant has a disk of leftover material around it. Material starts to clump together to form small moons, but these moons only survive for a short amount of time because, okay, so this is a planet formation question and something we spent quite a bit of time in class with, to try and explain this quickly, our current understanding of moon formation is that satellite test models coagulate from a disk of material around the proto planets. Those satellites then have their own gravity, so they attract material towards them creating over densities. Those over densities then split off into these kind of spiral waves around the disc because of basically the different orbital speeds in the disc. Those over densities then back react gravitationally onto the satellite creating a net torque and then that net torque causes the moon to migrate inwards towards the planet and eventually hit it. This is really based upon the seminal work of Robin Canup in the early 2000's. We spent quite a bit of time on this concept, so I think this is actually an easy question. Okay, so the options I give them are A, the moons tidally accelerate away and leave the disk, that's wrong from a timescale perspective, that takes much, much longer to happen than the moon formation time scale. B, the moons migrate inwards towards the planet within the disk, ding ding. That's the right answer, answer B. C, the moons collapse under their own self gravity into black holes. That's a rather flippant option and I hope one that none of the students would go for, D, the moons collide together and become super critical. Again, that just doesn't make a lot of sense. Or finally, E, the disc is so hot that it vaporizes the moons. That's obviously wrong because otherwise the satellite test models wouldn't have formed in the first place and ChatGPT goes for B. Good, it got it right. The answer is pretty good. It's obviously excessively wordy for a multiple choice question, but that's fine and what it wrote here is reasonable except maybe I'd hope for some references here. Okay, so I'm feeding ChatGPT all of the questions, trust me, but just for the sake of conciseness of this video, let's skip ahead to a different type of question, a mathematical question, let's try question number five. Okay, so here we go. If the temperature of the sun was suddenly doubled, the flux over all wavelengths, which is to say biometric flux emitted by the surface of the sun would, okay, so this is the Stefan-Boltzmann law, a pretty simple form that relates flux and temperature of the emitting surface. If I was really mean, I might do something like say the radius of the star is also doubled but that wouldn't actually affect the answer. That's because flux is the power per unit area. So doubling the size wouldn't affect the flux, only the power, but I was kind here. So let's look at the options. A, decreased by four, B, decreased by 256, C, increased by two, D, increased by eight, or E, increased by 16. The Stefan-Boltzmann law states that flux is proportional to temperature to the power of four. So two to the power of four is 16 and the answer here should be E, let's see how it does, good it guesses correct, it goes for answer E and it correctly understands that the Stefan-Boltzmann law is necessary to figure this out. So kudos to OpenAI. Okay, so skipping ahead now to question number eight. I can tell you that ChatGPT up to now has a 100% score. So it is looking very strong. Let's see how it does with eight. If two moons have orbital periods with a three to two ratio, what is the ratio of their orbital distances from the parent star? A technically more rigorous way of framing this question would be to say orbital semi-major axis rather than orbital distance. That's because of the possibility of elliptical orbits, but identifying that semi-major axis is a little bit more of a jargon-y term that risks confusing students at this level. So we are sticking with a more colloquial term here just to help them out. So the way to answer this here is using Kepler's third law, which relates period in semi-major axis and the options I give them here are A, less than 3:2, B, exactly 3:2, or C, more than 3:2. And let's see how it does. Oh okay, to my surprise, ChatGPT gets this wrong and says exactly 3:2. What's weird here is that it correctly identifies that it needs to use Kepler's third law, it just doesn't seem to apply it correctly, right at the beginning here it screws up. It says the square of the orbital period of one moon is three times the square of the order period of the other moon. That's wrong. The ratio is 3:2 in period. So in square period it would be nine to four or 2.25, not three. If this was a student, I would be confused what they were thinking here. The ratio of the outer moons period to the inner moons period is 1.5 and we know from Kepler's third law that distance cubed is proportional to period squared. In other words, rearranging distance is proportional to period to the 2/3. So the ratio of their distances should be 1.5 to the power of 2/3, which of course has to be less than 1.5 because 2/3 is less than one. So the correct answer here should be A, sorry GPT. Okay, so questions 9, 10, 11, I can see it gets correct. Let's look ahead now to question 12. Question 12, a rocky planet, which is within the so-called habitual zone is, this is really a question about science miscommunication. A lot of folks equate erroneously the habitual zone to mean life is there. So we spend a lot of time discussing the nuanced differences here. I would really hope that they would get this one correct. The options I've given them here are A, a planet that potentially has surface liquid water, that's the right answer. B, a planet that is certainly capable of supporting life. No, because I use the word certainly there. C, a planet that is inhabited by intelligent life, obviously wrong, D, a planet that has certainly got surface liquid water. Again this certainly discounts this one or E, a planet that is inhabited by simple life. Again, no that's too strong of a conclusion and ChatGPT goes for A, good, if I was this professor here, I would be pleased and patting you on the back for getting this one right, next up, question 13, so far ChatGPT has got just one question wrong, it's doing really well. Okay, so here we go. If the distance of the moon from the earth were halved, the tidal forces experienced by the earth due to the moon would, so a tidal forces question, again a topic we spent a lot of time in class with, the options here are A, increase by a factor of four, B, increased by eight, C, stay the same, D, decrease by two, or E, decrease by four. So decreasing the distance has to increase the four. The answer is either A or B, but will it choose correctly between them? No, it went for A rather than B. Let's see what happened here. Okay, so it equates tidal forces with gravitational forces. Those are not equivalent. Tidal forces are derived from gravitational forces of course, but they are distinct. A tidal force is a differential gravitational force and one can show that even though gravitational forces scale as inverse distance squared, tidal forces must scale as inverse distance cubed. It's almost there, it just confuses these two forces, which I guess is easy to do because of the intimate relationship, but that means it's now got two questions wrong out of 13. 14 it guessed right, let's look ahead to 15. A candidate exoplanets transit is observed simultaneously in blue and red light but appears two times deeper in the red. What's going on? Okay, so transit depths do vary with color, which is to say they're chromatic through two different mechanisms. The one you're more likely to have heard of is atmospheric absorption. Molecules in the atmosphere preferentially absorb as certain wavelengths, but this is a small effect. Adjusting the transit depth by effectively the atmospheric scale height only, definitely not a factor of two. The other effect you are less likely to have heard of is blending. When a second star of a different color gets mixed up with a target star, it dilutes the transit depth, but the amount of dilution depends on the color difference between the stars and the color that you are observing in, this can be a strong effect and is really the only way to get a two times deeper transit as the question here poses. So the options here are A, the star is going nova, really a nonsense answer, B, an exo moon is passing in front of the star, doesn't work, because the transit was observed simultaneously in two colors and moons don't produce strong chromatic effects anyway. C, Rayleigh scattering the atmosphere. So this is the atmospheric absorption scenario. D, a second blended star with a target star, that's the right answer or E, the planet must be spinning extremely fast. Just another flippant option there, right? Let's see how ChatGPT does, goes for C. Wow. Again, it it's wrong, it's the second best answer here, but it's definitely wrong. Rayleigh scattering can't be correct because it causes planets to absorb blue light preferentially over red light. And so we'd expect the blue channel transit to be deeper, whereas instead the question poses that the red channel transit is deeper, let alone the fact that atmosphere absorption just really cannot cause depth changes anywhere near a factor of two. So this option is definitely wrong and look, weirdly ChatGPT seems to know that its answer is wrong, saying that the transit would appear deeper in the blue than the red, right? That's correct. So then why did you pick C? It discounts D here without any real explanation. Just saying that second stars don't cause depth changes, which is wrong. Okay, so that's 15 questions down and now three answers that it's got wrong. All right, so the last few few multiple choice questions are a little bit harder because there are multiple correct answers to the multiple choice questions. Let's see how ChatGPT handles this. Skipping ahead to question 22, if we doubled the suns surface temperature but kept everything else in change, we'd expect the following changes, A, the peak wavelength of light emission would double, that's incorrect. It would actually halve and you could figure that with the Wien's displacement law, which they learnt in class. B, the sun would have luminosity four times higher, again, that's wrong. You should be able to figure that with the Stefan-Boltzmann law, which remember ChatGPT already got a question correct about earlier, so I think you should know that one's wrong. C, the surface temperature of earth would approximately double. That's a good answer. The blackbody equilibrium temperature of a planet is proportional to the stellar temperature and although the temperature doubling isn't gonna be exact, it's a right ballpark answer and the approximately word kind of accounts for that there. D, the Sun would appear bluer. That's also true. And again, Wien's displacement law would tell you that's correct or E, none of the above. No, that's obviously not correct 'cause we found two answers which are correct up above. And ChatGPT, wow, totally messes this one up. Answers C and D are correct, but it went for B, which is just totally wrong. Weirdly, it's as if it messes up the math here. The second sentence correctly states it's a fourth power dependency, so two to power of four, but that's 16, not four. So it's just a math error to think that this would be the correct answer, but the other options, it just really states the answer is wrong with no explanation. So it's hard for me to know what it's thinking here. If we can even really use the word thinking. I'm kind of wondering if the reason why it's struggling here is 'cause it just treats multiple choice questions as always having a single answer and so that's maybe where it's tripping up, to test that, let's look ahead to the next question. Which of the following tidal effects has affected or is affecting the moon? This is a nice juicy meatball kind of a question, should be straightforward for my students. Options are A, tidal acceleration. Yep, that's correct because we know the moon is receding away from us about one inch per year. B, tidal locking, yes, that's obviously correct because it has to explain why we only see one side of the moon all the time. C, tidal magnetism, no such thing. That's a bogus answer. D, tidal friction. Yep. You have to have tidal friction all to get to a state of tidal locking. So that must have occurred, or E, tidal disruption. No, the moon is in one piece so it hasn't been disrupted. Okay, so the answer should be A, B, and D and ChatGPT goes for A, B, and D. Good. It got this one right. Okay, so that actually proves that it can handle multiple choice questions with multiple answers and so it doesn't really excuse it for its early mistake in the previous question. That was the last multiple choice question. Actually it lost quite a few points in that last section. So applying the same grading scheme that I applied to the real final to ChatGPT's responses, I see that it got 45.5 points out of an available 60 thus far. So 76%, about three quarters of the questions it's getting correct, that's not bad, but honestly it's below my initial expectations. The last 30 points, there's 90 altogether, come from more of a freeform, short answer style question. Maybe this will play to ChatGPT strengths a little bit more, let's see. Let's move on to question 26, which is about transits. I had to slightly modify the question here because the real exam showed a graph, which obviously I can't enter, but I think it's a minor part of the question and it's easy to switch out with a text input equivalent. It's really the same changes I'd make for a visually impaired student. 26 A, B, and C ask a few very basic questions about transits, which ChatGPT does fine with. D is where things get a little bit more spicy. This will be a fairly challenging question for most of my students because they're generally less comfortable with math heavy questions like this. Okay, for D, if an exoplanet has an orbital period of 11 days and the host star is known to have a mass of 0.2 solar masses, what is the semi-major axis of the planetary orbit? Give your answer in AU. So the tricky thing here is that you have to use Newton's version of Kepler's third law rather than Kepler's original version because of the non-solar stellar mass. A very common mistake I see in applying this formula is messing up your units. I always teach students to convert everything into SI, standard international units before plugging into the formula, plug it in, get your answer out, which will also be in SI, and then convert that into whatever units the question asks for. So let's see how ChatGPT does here. Okay, so it sets up Kepler's third law correctly. It's using Newton's version, that's good, but immediately here I'm worried about what it's doing with the units. We would normally quote formula like this without the units and just implicitly understand that it's supposed to be an SI. But here it's attempting to sort of write down the formula in a non-SI unit base. But you know, going through the answer, it actually works out to get the correct solutions. So that's okay. I certainly wouldn't teach it this way because I would just be worried about it making mistakes with the units down the road. Okay, so the final part of this question is easier in my opinion, but it's still a mathematical question. Let's see how ChatGPT does. If an exo plant transit has a depth of 0.003 and the star is known to have a radius of 0.2 solar radii, what is the radius of the planet? Give your answer in units of earth radii. Okay, so the way to answer this is to remember that the planet's radius divided by the stars radius all squared equals the transit depth, ignoring limb darkening. So you then just rearrange that equation to solve for the planetary radius. But again, you have to be careful with the units. Let's see how ChatGPT does. Okay, good. It sets up the problem, the formula is, oh, again, I do not like the way it is trying to stick units in here and for as stated, that is incorrect. Just delete those units and it will be totally fine. This worries me and yep, that does not look right here. It's doing a good job of rearranging its formula, but it has messed up the units. And finally the answer is just way too small. The plant radius here should be square root of 0.003 times by the stellar radius. That's 0.054 times by 1/5 of a solar radius, which works out to about 7,600 kilometers or 1.2 earth radii. That should be the right answer, and so it says here, this is a very small radius. Yeah, you don't say. So, this should be your sanity checkpoints. What we teach our students, look at the final number that comes out and just ask yourself, does it make sense? A radius of 0.00348 Earth radii really should raise alarm bells. That's incredibly small. That'd be something like 40 kilometers in diameter, that's more like an asteroid than a planet, way smaller than anything we've ever detected. That should be a red flag. Something is wrong here. So in total, the moment that you have all been waiting for, ChatGPT scored 66.5 points out an available 90, so that's 73.9%. Now I can tell you that in the real graded copies that my human students took this year, the median score was 75.6%. So ChatGPT performed worse than a typical Columbia non-science major undergraduate, but it's not a huge way off. So what do we make of this? Well, I guess I am a little surprised but also somewhat relieved here, for context, this exam was open book, which means that students are allowed to bring any reference notes with them during the exam and are allowed the use of calculators. Now I set that rule because it's more realistic. In the real world, we pretty much always have access to reference materials, look up information or calculators to help calculations. And I want my exams to be a closer reflection to the real world practice of science, not merely some memory tests that rewards those who can parrot out facts and figures they learned in class. In this way, our calculators are just a tool. Our reference materials are a tool, but shouldn't we also really extend this to ChatGPT then? In the real world, one could always use ChatGPT as well. So should our students be allowed to use this AI in their finals? This video establishes that if a student relied solely on ChatGPT to take at least my exam, that would be a bad strategy since they would end up with a worse than median grade, worse than the typical student. However, as this algorithm improves as well as its competitors, there may be a point where these AIs can ace our exams. What do we do then? Now, I could try to outsmart the AI and be more devious in my questions, but in the end, that's an arms race that I'll never be able to win. And if everybody aces the exam using ChatGPT or something else, then the students really haven't learned anything except how to copy and paste. For me, then ChatGPT is really a different beast than merely allowing open books in an exam. Now of course, we could guard against this. We can walk around the room and check they're not using some kind of AI system on their laptop when they're taking the exam, but what do we do about homeworks or take home exams, other assignments that influence their grade? Perhaps in the end we'll need to use some kind of AI detector, discrimination tools that assess the probability that answers were produced by these tools similar to the kind of plagiarism tools that already exist for comparing student answers. But here again, one might worry about an AI arms race. In the longer run ChatGPT and its successes will pose a challenge to educators about what we want to teach as students. What should they really be learning? How should they interact with online tools and how do we assess our students? If you are presently or have been a student or an educator, please do let me know your thoughts on this topic down below in the comment section, I really want to hear from you. Look, hopefully together we can continue to nurture curiosity and innovation to provide our students with the skills need to succeed whilst also offering a fair academic environment. Challenges for sure, but challenges we have to face. So until next time, stay thoughtful and stay curious. (energetic music)
Info
Channel: Cool Worlds
Views: 326,377
Rating: undefined out of 5
Keywords: Astronomy, Astrophysics, Exoplanets, Cool Worlds, Kipping
Id: K0cmmKPklp4
Channel Id: undefined
Length: 28min 28sec (1708 seconds)
Published: Sat Jan 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.