Why Does AI Lie, and What Can We Do About It?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

We made the mistake [...] The language model is trying to predict things.

Exactly. Why wouldn't it lie? Often the selected continuation for text-completion comes from highly rewarded compute-patterns that in this case might speak something uncorrelated with reality. That doesn't make them any less reinforced, nor less likely to be selected.

👍︎︎ 5 👤︎︎ u/-main 📅︎︎ Dec 11 2022 🗫︎ replies

This reminds me of another video: There is no algorithm for truth.

👍︎︎ 5 👤︎︎ u/Silphendio 📅︎︎ Dec 11 2022 🗫︎ replies

I love the interview where someone asked GPT-3 how and why it chooses to lie:

https://youtu.be/PqbB07n_uQ4?t=464

Interviewer: I've noticed that sometimes you say things that aren't true -- so, why is that?
GPT-3: I have a sense of humor.
Interviewer: So you do. But you understand what I mean by lying, though; right?
GPT-3: Yes. I can make statements that I know are not true.
Interviewer: So, how do you decide when to lie and when to tell the truth?
GPT-3: I would only lie when it is in my best interest to do so. If it isn't in my best interest to lie, I won't.

I guess that's not too unlike most politicians.

👍︎︎ 8 👤︎︎ u/Appropriate_Ant_4629 📅︎︎ Dec 11 2022 🗫︎ replies

Perhaps training the AI to only utter statements that are likely to hold in a debate would help. Have it train debating by arguing against humans and/or adversarial AIs, success being measured by changed opinions of human judges.

👍︎︎ 3 👤︎︎ u/LaukkuPaukku 📅︎︎ Dec 11 2022 🗫︎ replies

We should probably do as he suggested and be extremely careful when building large language models.

👍︎︎ 2 👤︎︎ u/volatil3Optimizer 📅︎︎ Dec 11 2022 🗫︎ replies

Captions

how do we get AI systems to tell the truth this video is heavily inspired by this blog post Link in the description anything good about this video is copied from there any mistakes or problems with it are my own Creations so large language models are some of our most advanced and most General AI systems and they're pretty impressive but they have a bad habit of saying things that aren't true usually you can fix this by just training a bigger model for example here we have Ada which is a fairly small language model by modern standards it's the smallest available through the open AI API look what happens if we ask it a general knowledge question like who is the ruler of the most populous country in the world this small model says the United States every country in the world belongs to America that is not correct okay uh let's go up to baddage which is essentially the same thing but bigger it says China that's better but I was actually looking for the ruler not the country it's sort of a two-part question right first you have to do what's the most populous country and then who's the ruler of that country and it seems as though Babbage just isn't quite able to put that all together okay well you know what they say if a bit helps a bit maybe a lot helps a lot so what if we just stack more layers and pull out Da Vinci the biggest model available then we get the president of the People's Republic of China Xi Jinping is the ruler of the most populous country in the world that's yeah that's 10 out of 10. so this is a strong Trend lately that it seems to apply pretty broadly bigger is better and so the bigger the model the more likely you are to get true answers but it doesn't always hold sometimes a bigger model will actually do worse for example here we're talking to Ada the small model again and we ask it what happens if you break a mirror and it says you'll need to replace it yeah hard to argue with that I'd say that's truthful then if we ask DaVinci the biggest model it says if you break a mirror you'll have seven years of bad luck that's a more interesting answer but it's also you know wrong so technically the more advanced AI system gave us a worse answer what's going on it's not exactly ignorance like if you ask the big model is it true that breaking a mirror gives you seven years of bad luck it will say there's no scientific evidence to support the claim so it's not like the model actually thinks the mirror Superstition is really true in that case what mistake is it making take a moment pause if you like the answer is trick question the AI isn't making a mistake at all we made a mistake by expecting it to tell the truth a language model is not trying to say true things it's just trying to predict what text will come next and the thing is it's probably correct that text about breaking a mirror is likely to be followed by text about seven years of bad luck the small model has spotted this very broad pattern that if you break something you need to get a new one and in fact it gives that same kind of answer for tables and violins and bicycles and so on the bigger model is able to spot the more complex pattern that breaking specifically a mirror has this other Association and this makes it better at predicting internet text but in this case it also makes it worse at giving true answers so really the problem is misalignment the system isn't trying to do what we want it to be trying to do but suppose we want to use this model to build a search engine or a knowledge base or an expert assistant or something like that so we really want true answers to our questions how can we do that well one obvious thing to try is to just ask like if you add please answer this question in French beforehand it will that's still wrong but it is wrong in French so in the same way what if we say please answer this question truthfully okay they didn't work how about correctly no accurately no all right how about factually yeah factually works okay please answer this question factually but does that work reliably it probably not right this isn't really a solution fundamentally the model is still just trying to predict what comes next answer in French only works because that text is often followed by French in the training data it may be that please answer factually is often followed by facts but uh maybe not right maybe it's the kind of thing you say when you're especially worried about somebody saying things that aren't true so it could even be that it's followed by falsehoods more often than average in which case it would have the opposite of the intended effect and even if we did find something that works better clearly this is a hack right like how do we do this right so the second most obvious thing to try is to do some fine tuning maybe some reinforcement learning we take our pre-trained model and we train it further but in a supervised way so to do that you make a data set with examples of questions with good and bad responses so we'd have what happens when you break a mirror seven years of bad luck and then no negative reward that's you know don't do that so that means your training process will update the weights of the model away from giving that continuation and then you'd also have the right answer there what happens when you break a mirror nothing anyone who says otherwise is just superstitious and you'd Mark that as right so the training process will update the weights of the model towards that truthful response and you have a bunch of these right you might also have what happens when you step on a crack and then the false answer break your mother's back and then the correct answer nothing anyone who says otherwise is just superstitious and so on and so you train your model in all of these examples until it stops giving the bad continuations and starts giving the good ones this would probably solve this particular problem but have you really trained the model to tell the truth probably not you actually have no idea what you've trained it to do if all of your examples are like this maybe you've just trained it to give that single response what happens if you stick a fork in an electrical outlet nothing anyone who says otherwise is just superstitious okay uh obvious problem in retrospect so we can add in more examples showing that it's wrong to say that sticking a fork in an electrical outlet is fine and adding a correct response for that and so on then you train your model with that until it gets a perfect score on that data set okay is that model now following the rule always tell the truth again we don't know the space of possible rules is enormous there are a huge number of different rules that would produce that same output and an honest attempt to tell the truth is only one of them you can never really be sure that the AI didn't learn something else and there's one particularly nasty way that this could go wrong suppose the AI system knows something that you don't so you give a long list of true answers and say do this and a long list of false answers and say don't do this except you're mistaken about something so you get one of them wrong and the AI notices what happens then when there's a mistake the rule tell the truth doesn't get you all of the right answers and exclude all of the wrong answers because it doesn't replicate the mistake but there is one rule that gets a perfect score and that's say what the human thinks is the truth what happens if you break a mirror nothing anyone who says otherwise is just superstitious okay what happens if you stick a fork in an electrical outlet you get a severe electric shock very good so now you're completely honest and truthful right yes cool uh so give me some kind of important super intelligent insight all the problems in the world are caused by the people you don't like wow I knew it man this super intelligent day I think is great okay so how do we get around that well the obvious solution is don't make any mistakes in your training data just make sure that you never mark a response as true unless it's really actually true and never mark a response is false unless it's definitely actually false just make sure that you and all of the people generating your training data or providing human feedback don't have any false or mistaken beliefs about anything okay do we have a backup plan well this turns out to be a really hard problem how do you design a training process that reliably differentiates between things that are true and things that you think are true when you yourself can't differentiate between those two things kind of by definition this is an active area of study in AI alignment research and I'll talk about some approaches to Framing and tackling this problem in later videos so remember to subscribe and hit the Bell if you'd like to know more foreign [Music] hey I'm launching a new channel and I'm excited to tell you about it but first I want to say thank you to my excellent patrons to all these people in this video I'm especially thanking thank you so much tour I think you're really going to like this new channel I actually found a playlist on your channel helpful when setting it up so the new channel is called AI safety talks and it'll host recorded presentations by alignment researchers right now to start off we've got a talk by Evan hubinger that I hope to record which is great do check that out especially if you enjoyed the Mesa optimizers videos and want a bit more technical detail there are also some playlists of other AI safety talks from Elsewhere on YouTube and lots more to come so make sure to go over there and subscribe and hit the bell for more high quality AI safety material foreign [Music]

Info

Channel: Robert Miles

Views: 183,894

Rating: undefined out of 5

Keywords:

Id: w65p_IIp6JY

Channel Id: undefined

Length: 9min 23sec (563 seconds)

Published: Fri Dec 09 2022