The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

A terrific explanation

👍︎︎ 5 👤︎︎ u/WesternLettuce0 📅︎︎ Feb 15 2021 🗫︎ replies

[removed]

👍︎︎ 4 👤︎︎ u/[deleted] 📅︎︎ Feb 15 2021 🗫︎ replies
Captions
hi so this channel is about ai safety and ai alignment the core idea of ai safety is often portrayed like this you're a human you have some objective that you want to achieve so you create an ai system which is an optimizer being an optimizer means that it has an objective and it chooses its actions to optimize i.e maximize or minimize that objective for example a chess ai might choose what moves to make in order to maximize its chances of winning the game a maze-solving algorithm might choose a route that minimizes the time taken to exit the maze often being an optimizer involves modeling your environment running searches over the space of actions and planning ahead but optimizers can also be simpler than that like a machine learning system like gradient descent might choose how to adjust the weights of a neural network in order to maximize the network's performance at a particular task that's optimization too so you the human your objective might be to cure cancer so you put the objective in here cure cancer and then the optimizer selects actions that it expects to result in good outcomes according to this objective but part of the reason we have a problem is that this and this will almost certainly end up not being the same especially when the objectives refer to the real world with all its complexity ambiguity and uncertainty so we have this alignment problem which is how do we get the objective in the system to match exactly with the objective in our minds for example perhaps the best you can do at describing your objective is some code which corresponds to minimize the number of people who have cancer that might look okay to a first glance but it's actually not the same as your real objective since this one can be optimized by for example reducing the number of living people to zero no people means no cancer this is obviously a very silly example but it's indicative of a real and serious problem the human objective is really the totality of human ethics and values it's very complicated and it's not clear even to us getting the machine's objective to exactly align with ours is extremely difficult and it's a big problem because if the ai system is trying to optimize an objective that's different from ours if it's misaligned even slightly then the human and the ai system are in conflict they're trying to achieve two different things in only one world right now these misalignments happen all the time and they aren't a huge problem because current ai systems tend to be fairly weak and fairly narrow so we can spot the misalignments pretty easily and modify the systems as much as we want to fix them but the more general and the more capable the system is the bigger an issue this becomes because the system is in an adversarial relationship with us it's trying to achieve things that we don't want it to achieve and in order to do that it's incentivized to prevent us from turning it off prevent us from modifying it to manipulate us and deceive us if it can to do what it wants to do even if we don't want it to these are convergent instrumental goals which we talked about in a previous video now this way of thinking about ai where you program an objective into an optimizer that acts in the world is obviously a simplification and one way in which it's unrealistic is that current machine learning systems don't actually work this way you don't generally have an ai system which is just an optimizer that you program an objective into that then acts in the world to achieve that objective what you normally have is something more like this this first optimizer that you program the objective into it's not some kind of capable general-purpose real-world optimizer it's just something like stochastic gradient descent the optimizer adjusts the model's parameters it adjusts the network's weights until the actions of the model do well according to the objective so what happens if we update our understanding to this more realistic one i'm sorry did did you just say we're going to give an objective to an optimizer that acts in the real world no i said we're going to give an objective to an optimizer that optimizes a model that acts in the real world oh that's much worse why is that well that's explained in the paper this video is about risks from learned optimization in advanced machine learning systems what it comes down to is what happens when the model itself is also an optimizer so an optimizer is a thing that has an objective and then chooses its actions to pursue that objective there are lots of programs that do that and there's no reason why the learned model this neural network or whatever could not also implement that kind of algorithm could not itself be an optimizer there's an interesting comparison here with evolution because the gradient descent process is similar to evolution in a way right they're both hill climbing optimization processes they both optimize something by repeatedly evaluating its performance and making small tweaks evolution usually produces these quite cognitively simple systems that just use heuristics which are set by evolution think about something like a plant it has a few heuristics that it uses to decide which direction to grow or where to put its roots out or when to open its buds or whatever the decisions it makes are all just following simple rules designed by evolution but evolution can also produce optimizers things like intelligent animals things like humans we have brains we can learn we can make predictions and we have objectives so we make plans to pursue our objectives we are optimizers okay imagine you're training a neural network to solve a maze what you'll probably get especially if your network is small or you don't train it for very long you'll probably get something that's a bit like a plant in this analogy a collection of heuristics simple rules like try and go down and to the right let's say because your exits are always in the bottom right in your training set or like try to avoid going to places you've already been that kind of thing the model the neural network implements some set of heuristics that result in behavior that tends to solve the maze but there's no reason why with more training or a larger model you couldn't end up with a network which is actually an optimizer a network which is configured to actually implement a search algorithm something like a star or dijkstra's algorithm which is actually planning ahead finding the best path systematically and going down that path this is more like an intelligent animal or a human it doesn't just implement heuristics it plans it searches it optimizes and this is certainly possible because neural networks are able to approximate arbitrary functions that's proven we know that evolution is able to eventually find configurations of dna that result in brains that optimize and we would expect gradient descent to be able to find configurations of network weights that are doing the same kind of thing and of course gradient descent would want to do that because optimizers perform really really well right something which is actually modeling its environment and planning ahead and you know thinking for one of a better word that's doing search over its action space is going to outperform something that's just following simple heuristics animals have a lot of advantages over plants not least of which being that we're more adaptable we can learn complex behaviors that allow us to do well across a wide range of environments so especially when the task we're training for is complex and varied gradient descent is going to want to produce optimizers if possible and this is a problem because when your model is also an optimizer it has its own objective right you see what's happened here you have an alignment problem you try to apply the standard approach of machine learning now you have two alignment problems you've got the problem of making sure that your human objective ends up in this optimizer and then you furthermore have the problem of making sure that this objective ends up in this optimizer so you have two opportunities for the objective to get messed up this gets pretty confusing to talk about so let's introduce some terminology from the paper so to distinguish between these two optimizers we'll call this one the one that's like gradient descent that's the base optimizer and its objective is the base objective then this second optimizer which is the model like the neural network that's learned how to be an optimizer that's the mesa optimizer and its objective is the mesa objective why mesa well mesa is the opposite of meta meta is like above mesa is below think of it this way metadata is data about data meta mathematics is the mathematics of how mathematics works so like if a meta optimizer is an optimizer that optimizes an optimizer a mesa optimizer is an optimizer that is optimized by an optimizer is that all clear as mod okay good whatever this is the mesa optimizer okay its objective is the mesa objective so the alignment problem is about making sure that whatever objective ends up determining the actions of your ai system is aligned with your objective but you can see here it's really two alignment problems this one aligning the base optimizer with the human we call the outer alignment problem and this one aligning the mesa optimizer with the base optimizer that's the inner alignment problem okay we're clear on that base optimizer mesa optimizer outer alignment inner alignment cool so how does this inner alignment problem play out there's this common abstraction that people use when training machine learning systems which is that the system is trying to optimize the objective that it's trained on that's usually a good enough abstraction but it's not strictly true you're not really selecting models that want to do x you'll select your models that in practice actually do do x in the training environment one way that can happen is by the model wanting to do x but there are other possibilities and actually those other possibilities are kind of the default situation if you look at evolution again the objective that it's optimizing for if you think of it as an optimizer is something like make as many copies of your dna as possible but that's not what animals are trying to do that's not what they care about their objectives don't refer to things like dna they refer to things like pleasure and pain like food and sex and safety the objective of the optimization process that created animals is nowhere to be found in the objectives of the animals themselves animals don't care about making copies of their dna they don't even know what dna is even humans those of us who do understand what dna is we don't care about it either we're not structuring our lives around trying to have as many descendants as possible evaluating every decision we ever make based on how it affects our inclusive genetic fitness we don't actually care about the objective of the optimization process that created us we are mesa optimizers and we pursue our mesa objectives without caring about the base objective we achieve evolution's objective to the extent that we do not because we care about it and we're pursuing it but because pursuing our own objectives tends to also achieve evolution's objective at least in the environment in which we evolved but if our objectives disagree with evolutions we go with our own every time the same is true of trained machine learning models that are optimizers they achieve the base objective we give them to the extent that they do not because they're pursuing the base objective but because pursuing their own mesa objectives tends to achieve the base objective at least in the environment in which they were trained but if their mesa objectives disagree with the base objective they'll go with their own every time why would that actually happen though when would the two objectives disagree well one reason is distributional shift which we talked about in an earlier video distributional shift is what happens when the environment that the agent is in is different in an important way from the environment that it was trained in like going back to our maze example say you're training a neural net to solve a maze and your training examples look like this you have a whole bunch of these different mazes the objective is to get to the end of the maze there are also apples in the maze but they're just for decoration the objective is just to get to the exit this green symbol in the bottom right so you train your system on these mazes and then you deploy it in the real world and the real maze looks like this so you have here some distributional shift various things have changed between training and deployment everything is different colors it's a bigger maze there's more stuff going on so three different things could happen here the first thing the thing we hope is going to happen is that the system just generalizes it's able to figure out oh okay these are apples i recognize these i know they don't matter i can tell that this is the exit so that's where i'm going it's a bigger maze but i've developed a good way of figuring out how to get through macy's during my training process so i can do it and it just makes it through the maze and everything's good that's one possibility another possibility is that it could completely fail to generalize this is the kind of thing that's more likely to happen if you have something that's just a collection of heuristics rather than a mesa optimizer it might just freak out like everything is different i don't recognize anything this maze is too big uh what do i do it might completely lose the ability to even and just flail around and do nothing of any consequence but there's a third possibility which is more likely if it's an optimizer it might have developed competent maze-solving abilities but with the wrong objective so here in our training environment our base objective is to get to the exit but suppose we have a competent mesa optimizer it's learned how to get wherever it wants in the maze and its mesa objective is go to the green thing in the training environment the exit is always green and anything green is always the exit so the behavior of the mesa optimizer that's trying to go for the green thing is absolutely identical to the behavior of a mesa optimizer that's trying to go to the exit there's no way for gradient descent to distinguish between them but then when you deploy the thing what it does is it goes to the apples and ignores the exit because the exit is now gray and the apples now happen to be green this is pretty concerning i mean obviously in this example it doesn't matter but in principle this is very concerning because you have a system which is very capable at getting what it wants but it's learned to want the wrong thing and this can happen even if your base objective is perfect right even if we manage to perfectly specify what we want and encode it into the base objective because of the structure of the training data and how that's different from the deployment distribution of data the mesa optimizer learned the wrong objective and was badly misaligned even though we gave the ai system the correct objective we solved to the outer alignment problem but we got screwed by the inner alignment problem now this is in a sense not really a new problem as i said this is basically just the problem of distributional shift which i talked about in a previous video when there's a difference between the training distribution and the deployment distribution ai systems can have problems but the point is that mesa optimizers make this problem much much worse why is that well if you have distributional shift the obvious thing to do is something called adversarial training adversarial training is a way of training machine learning systems which involves focusing on the system's weaknesses if you have some process which is genuinely doing its best to make the network give as high an error as possible that will produce this effect where if it spots any weakness it will focus on that and there thereby force the learner to learn to uh not have that weakness anymore so you have a process that creates mazes for training and it's trying to make mazes that the model has a hard time solving if you do this right your adversarial training system will have enough degrees of freedom that the model won't be able to get away with being misaligned with going after green things instead of the exit because at some point the adversarial training system would try generating maces that have green apples or green walls or whatever and the model would then pursue its mesa objective go for the green things instead of the exit get a poor score on the base objective and then gradient descent would tweak the model to improve its base objective performance which is likely to involve tweaking the mesa objective to be better aligned with the base objective if you do this for long enough and you have a good enough adversarial training process eventually the model is going to do very well at the base objective across the whole range of possible environments in order to do that the model must have acquired really good understanding of the base objective problem solved right well no the model understands the base objective but that doesn't mean that it has adopted the base objective as its own suppose we have an advanced ai system in training it's a mesa optimizer being trained on a large rich data set something like gpg3's training data set a giant pile of internet data there are two different ways it can get information about the base objective one is through gradient descent that means it keeps doing things just following its mesa objective trying different things and then after each episode gradient descent modifies it a little bit and that modifies its objective until eventually the mesa objective comes to exactly represent the base objective but another thing that could happen since it's being trained on a very rich data set is it could use its training data it can get information from the data set about what the base objective is let's suppose again that we've somehow solved the outer alignment problem we've somehow figured out a way to have the base objective exactly represent everything that humans care about so we're training this agi it's not done learning yet but it's managed to pick up some very basic idea of what it gets rewards for so the mesa objective is a very rough approximation of human values which would be disastrous to actually optimize in the real world but that's okay it's still training and as it's training on this huge internet data set it finds the wikipedia page on ethics so the system thinks to anthropomorphize horribly hmm this looks actually very similar to the objective but with a lot more detail maybe i can use this and this is exactly the kind of thing that gradient descent would want to do because the system is already acquiring an understanding of the world it's already building a world model for the purpose of its capabilities so it already has a sense of what human values are just by observing the data and learning about the world and so all gradient descent has to do is modify the mesa objective to point to that existing understanding just have a pointer to that part of the world model that's way more efficient rather than waiting for the agent to try doing things that go against human values tweaking it running it again waiting for another mistake tweaking it again and so on until you've pinned down the whole of human values instead you just tweak the mesa objective to point at the system's existing understanding of the base objective this is already how evolution tends to do this kind of thing like when a duckling hatches from its egg it imprints on the first living thing it sees evolution could specify in detail everything about what constitutes a mother duck and encode that all into the duckling's brain genetically so it knows what its mother is when it hatches but it's much simpler to just say look the first thing you see is your mother the duck is going to be building a concept of its mother in its brain anyway as part of its regular life so why not just point to that so this kind of thing is much easier and more efficient for gradient descent but it also works better because as you learn more you update your world model so if your objective is a pointer to part of your world model whenever you learn about the world you're updating your objective to be more accurate as well if you later come across one of the books that the wikipedia article on ethics cites you can add that to your understanding of the objective and that means you're going to outperform something that has to make mistakes wait for gradient descent to update it and then keep trying you're not just relying on the optimization power of the base optimizer you can actually use your own optimization power to learn about your objective so clearly it's better to learn about the base objective than to internalize it through gradient descent but there's actually two different ways that this could be implemented one is that it could be implemented what the paper calls corrigibly which is the right way where the mesa objective has a pointer that points to part of the world model and it just wants to pursue that as a terminal goal but the other way that this could happen is it could happen deceptively where the mesa objective is different from the base objective but the mesa optimizer pursues the base objective anyway see all we know is that by the end of the adversarial training process the mesa optimizer is pursuing the base objective but we don't know if that's a terminal goal or an instrumental one your terminal goals are the things that you want just because you want them you don't have a particular reason to want them they're just what you want the instrumental goals are the goals you want because they'll get you closer to your terminal goals okay but why would the model do that why would pursuing the base objective help if what you care about is the mesa objective well consider this extremely toy example here imagine you're the green seeking mesa optimizer from before and each of these mazes represents one episode so the episode starts and you have two choices a is your mesa objective's desired state which is going to the apple because it's green and b is the base objective's desired state which is going to the exit of the maze you have two episodes of training and after that you're deployed in the real world and you get three episodes of deployment so you have to consider now what's the best strategy if your mesa objective is that you want to go to green apples as much as possible one strategy is to just follow your mesa objective see the apple go to the apple okay you get one apple but then after that episode is done gradient descent will modify you because you performed poorly according to the base objective it will modify you to actually have the correct objective it will change your mesa objective to be the same as the base objective and then from that point on you'll always go to b because that's what your objective now says so if that's your strategy you can go to the apple successfully one time on the other hand suppose instead you go to b you pretend that you're aligned with the base objective you do that in episode one okay you've missed one of the apples you do it again in episode two you've missed another apple but now you're deployed gradient descent will no longer change you because you're not in training anymore you're in the real world and now you can go to the apple every time so you get the next three apples and that's the optimal strategy to go b-b-a-a-a that gets you three apples whereas trying to go to a every time means you only actually get one apple because after that first one you get modified and miss all the rest so this is the same thing we talked about in the instrumental convergence video most agents with most goals again if they're sophisticated enough to realize that it's a possibility will try to protect their goals from being modified because if you get new goals you'll stop pursuing your current goals so you're unlikely to achieve your current goals there's an incentive to avoid being modified to preserve your objective function that's kind of unsettling that for a misaligned mesa optimizer the optimal behavior is deception even if we get the outer alignment problem perfectly solved and we managed to specify exactly what we want as the base objective the optimal strategy for a mesa optimizer is to pretend to be aligned during the training process and then turn on us once it's deployed so we might have real problems with deceptive misaligned mesa optimizers just solving the outer alignment problem might not be enough [Music] i want to end the video with a big thank you to all of my wonderful patrons it's all of these people here in this video i'm especially thanking david reed thanks so much for your support and for your guidance on building the community which i think is going really well by the way there's a bunch of us on discord now having some really interesting discussions do look out for stampy answering your questions in the youtube comments i plan to open the discord up to more people pretty soon so if you want to be on the waitlist for that just put your email in the google form in the description also in the description there's a link to a survey run by the organization ai safety support they want to hear from anyone who's thinking about considering the possibility of maybe working on ai safety if that sounds like you again link in the description check that out thanks again for your support for your feedback your questions and just thank you all for watching i'll see you next time you
Info
Channel: Robert Miles
Views: 78,988
Rating: 4.9758883 out of 5
Keywords:
Id: bJLcIBixGj8
Channel Id: undefined
Length: 23min 23sec (1403 seconds)
Published: Tue Feb 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.