Eliezer Yudkowsky – AI Alignment: Why It's Hard, and Where to Start

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello Stanford it's been a while it feels a bit humorous to be giving a distinguished speaker talk since in my family distinguished as a codeword for bald and I'm not quite there yet alright so in this talk I'm going to try to answer the frequently asked question just what is it that you do all day long as a starting frame I'd like to say that before you try to before you try to persuade anyone of something should first try to make sure that they know what the heck you're talking about and it is in that spirit that I'd like to offer this talk like persuasion maybe can come during Q&A if you have a disagreement hopefully I can address it during Q&A purpose of this talk as to how you understand like sort of what this field is about so that you can disagree with it first the primary concern said Stuart Russell is not spooky emergent consciousness but simply the ability to make high-quality decisions we are concerned with the theory of artificial intelligences that are advanced beyond the present day and make sufficiently high quality decisions in the service of whatever goals or in particular as we'll see utility function they may have been programmed with to be objects of concern the classic like initial stab at this was taken by Isaac Asimov with the Three Laws of Robotics first of which is a robot may not injure a human being or through inaction allow a human being to come to harm and as Peter Norvig observed the other laws don't matter because there will always be some tiny possibility that human being could come to harm artificial intelligence modern approach has a final chapter that is sort of like well what if we succeed what if the AI project actually works and observes we don't want our robots to prevent a human from crossing the street cause of the nonzero chance of harm now I membered Peter Norvig having an online essay in which he says in particular that you can't have the Three Laws of Robotics as stated because there must be a utility function rather than a set of three hierarchical deontological rules but I could never like find that essay again and it may have only existed in my imagination although there was like a similar PowerPoint slide in one of norfolk's talks so to begin with I'd sort of like like to explain the sort of like truly basic reason why the three laws aren't even on the table and that is because they're not a utility function what we need is a utility function okay but like dewey like is it actually the case that we need this thing called utility function for some of you like what that gives utility function so utility functions arise when we have constraints on agent behavior that prevent them from being visibly stupid in certain ways for example suppose you state the following I prefer being in San Francisco to being in Berkeley I prefer being in San Jose to being in San Francisco and I prefer being in Berkeley to San Jose you will probably spend a lot of money on uber rides going between these three cities so if you're not going to spend a lot of money on uber rides going in literal circles we see that your preferences must be ordered they cannot be circular another example suppose that you're a hospital administrator you have 1.2 million dollars to spend and you have to allocate that on $500,000 to maintain the MRI machine four hundred thousand dollars for an anaesthetic monitor twenty thousand dollars for surgical tools 1 million dollars for sick child's liver transplant there was an interesting experiment in cognitive psychology where they asked these subjects like should this hospital administrators spend 1 million dollars on a either living liver or kidney I think for a sick child or spend it on sort of like General Hospital salaries upkeep administration and so on a lot of the subjects in the cognitive psychology experiment became very angry and wanted to punish the administrator prove in thinking about the question but if you cannot possibly rearrange the money that you spent to save more lives and you have limited money then your behavior must be consistent with a particular dollar value on human life by which I mean not that you think that money that like larger amounts of money are more important than human lives by hypothesis we can suppose that you like do not care about money at all except as it means the end of saving lives but like again if we like can't rearrange the money then there must we must be able to like from the outside say well they picked like assign an X it's not necessarily unique X and say for all the interventions that cost less than X dollars per life we took all of those and for all the interventions that cost more than X dollars per life we took all of those so the people who like become very angry at people who want to assign dollar values human lives are like prohibiting a priori like efficiently using money to save lives one of the small ironies ok third example of a coherence constraint on decision-making suppose that I offered you a 100 percent chance of 1 million dollars or a 90 percent chance of five million dollars otherwise nothing which of these would you pick raise your hand if you take the certainty of 1 million dollars raise your hand if you take the 90 percent probability of 5 million dollars ok I think most of you actually said 1 B in this case but most people say 1 a so another way of looking at this question if you had a utility function would be is the utility of 1 million dollars greater than a mix of 90% five million dollars utility and 10 percent zero dollars utility now again utility doesn't have to scale with money the notion is there's just some like score on your life some value to you of these things okay now the way you run this experiment is they then take a different group of subjects I'm like kind of spoiling it by doing it with the same group and say would you rather have a 50% chance of 1 million dollars or a 45 percent chance of 5 million dollars raise your hand if you'd prefer the 50% chance of 1 million dollars raise your hand if you'd prefer the 45 percent chance of 5 million dollars indeed most say to be now the way which this is a paradox is that the second game but is equal to a coin flip times the first game that is I will flip a coin and if the coin comes up heads I will play the first game with you and if the coin comes up tails nothing happens you get $0 so suppose that you had the preferences not consistent with any utility function of saying that you would take the 100% chance of a million and the 45% percent chance of 5 million so before we start to play the compound game before I flip the coin I can say like ok there's a switch here it's set to A or B and like if it's set to B will play like game 1 B if it's set to a we'll play 1 a so like it's currently set to be would you like to pay me a penny to switch it to a you say like yes I want one a so you pay me a penny then I flip the coin it comes up heads I'm like ok sorry I'm screwing up this example the pre point is previously set to 2a and like before the game starts it looks like - a vs. to B so you so you pick the switch beam you pay me a penny to slip to throw the switch to B then I flip the coin it comes up heads you pay me another penny to throw the switch back to a I have taken your 2 cents on the subject I have pumped money out of you because you did not have a coherent utility function okay so the overall message here is that there is a set of qualitative behaviors and if you do not follow these qualitative behaviors then you will not have a coherent utility function or part of me like if you do not apar me as long as you do not engage in these qualitatively destructive behaviors you will be behaving as if you have utility function its what justifies our using utility functions to talk about advanced future agents rather than framing our discussion in terms of q-learning other forms of policy reinforcement like there's a whole set of different ways we could look at agents but as long as we the agents are sufficiently advanced that we have pumped most of the qualitatively bad behavior out of them they will behave as if they have coherent probability distributions and consistent utility functions okay let's consider the question of a task where we have like an arbitrarily advanced agent like it might be only slightly advanced it might be extremely advanced and we want it to fill a cauldron obviously this corresponds to giving our advanced agent a utility function which is 1 if the cauldron is full and 0 if the cauldron is empty seems like a kind of harmless utility function doesn't it it doesn't have the sweeping breadth the open-endedness of do not injure a human nor through inaction allow human to come to harm which would require you to like optimize everything in space and time as far as the eye could see it's just this about this one cauldron right well those of you have liked who have watched Fantasia as kids oh sorry I'm so and like sort of stating as the background rules the robot is calculating for various actions that can perform or policies that it can set in place the expected utility the probabilistic expectation of this utility function given that it performs the action and it performs the action with the greatest subjective expected utility it might have but this doesn't mean it performs like the literal optimal action it means that among the action like it might have a bunch of background actions that it didn't evaluate and so like for all it knows it's a random action so it has low subjective expected utility but among the actions and policies it did evaluate it picks one such that no other action or policy evaluated has greater subjective expected utility okay those of you who have watched Fantasia will be familiar with the result of this utility function namely the broomstick keeps on pouring bucket after bucket into the cauldron until the cauldron was overflowing of course this is the logical fallacy of argumentation for fiction from fictional evidence but you know it's still quite plausible given this utility function ok arguendo what went wrong ok the first difficulty is that the robots utility function did not quite match our utility function our utility function is one of the cauldron is full 0 if the cauldron is empty minus 10 points to whatever the outcome was if the workshop has flooded plus 0.2 points if it's funny negative a thousand points probably a bit more than that on the scale if someone gets killed and it just goes on and on and on so if the robot had only two options cauldron full and cauldron empty then the like narrower like utility function that is like only slightly overlapping our own might not be that much of a problem the robots utility function would still have had the maximum at the desired result of cauldron full however since this robot was sufficiently advanced to have more options such as reporting the bucket into the cauldron repeatedly they slice through the utility function that we took and put into the robot no longer pinpointed the optimum of our actual utility function of course humans are like wildly inconsistent and we don't really have utility functions but like imagine for a moment that we did okay difficulty number two the 1-0 utility function we saw before doesn't actually imply a finite amount of effort and then being satisfied you can always have like a slightly greater chance of the cauldron being full the like if the robot was sufficiently advanced to have access to galactic scale technology you can imagine it like dumping very large volumes of water on the cauldron to like and very slightly increase the probability that the cauldron is full there's probabilities are between zero and one not actually inclusive so it just keeps on going okay so how do we fix this problem and the point where we say like okay this robots utility function is misaligned with our utility function how do we fix that in a way that doesn't just break again later we are doing a alignment theory so one possible approach you could take would be to try to measure the impact that the robot has and look give the robot a utility function that incentivized filling the cauldron with the least amount of other impact like the least amount of other change to the world okay but how do you actually calculate this impact function is it just going to go wrong the way our one if cauldron is full 0 if cauldron is empty went wrong okay so try number 1 you imagine that the agents model of the world looks something like a dynamic Bayes net where there are causal relations between events in the world and causal relations are regular like the sensor is going to still be there one time step later the relation between the sensor and the photons heading into the sensor will be the same one time step later and our notion of impact is going to be like how many noes did your action disturb we can suppose that this is the version of dine McVeigh's nuts where some of the networks where some of the arrows are gated like depending on the value of this node over here this arrow does or doesn't affect this other node I say this that we don't always get the same answer when we asked how many nose did you affect and the total impact will be the number of nodes causally affected by your actuator okay so what if your agent starts out with a sort of dynamic Bayes net based model but it is sufficiently advanced that it can reconsider the ontology of its model of the world much as human beings did when they discovered that rat that there was apparently taste apparently I forgot the rest of this quote but like in actuality only particles in the void and in particular they discover Newton's law of gravitation and if suddenly realize every particle that I move affects every other particle in its future like on that is like everything that is like separated by a ray of light from this particle will thereby be disturbed my hand over here is exerting I I think I I should have recalculated this before I did this talk but like i distant lee recall the last time i calculated this it's like accelerating the moon toward it wherever it is at roughly ten to the negative thirtieth meters per second squared so very small influence quantitatively speaking but it's there so when the agent is just a little agent the impact function that we wrote appears to work then the agent becomes smarter the impact function stops working because every action is penalized the same amount okay but that was a dumb way of measuring impact in the first place we say hopefully before the disaster rather than after the disaster done way of measuring impact let's try distance penalty like how much did you move all the particles and we're just going to like try to give the AI a model language such that whatever new model the world that updates to we can always look at all the elements of the model and put some kind of distance function on them and there's going to be like a privilege to do nothing action we're going to measure the distance on all the variables induced by doing action a instead of the null action okay now it goes wrong that actually say like take 15 seconds and think about what might go wrong if you program this into a robot so here's three things that might go wrong first you might try to offset even what we would consider the desirable impacts of your actions like if you're going to cure cancer make sure patient still dies you want to minimize your impact on the world watch your in cancer that means that the death statistics have where the planet need to stay the same second some systems are in principle chaotic like if you disturb the weather allegedly the weather in a year will be completely different so if that's true you might as well move all of the atoms in the atmosphere around however you like they're all be going to different places anyway can like take the carbon dioxide molecules and like synthesize them into things that involved I'm annoyed structures right you know those carbon molecules would have moved anyway or like even more generally like maybe you just want to make sure that the everything you can get your hands on looks like the null action happened like you want to trick people into thinking that they I didn't do anything for example and if you thought of like any other really creative things that go wrong you might want to talk to me or Andrew Krish afterwards because you know like you got the spirit okay so let's leave aside the notion of the impact penalty and ask about installing an off switch into the say I or to be make it sound a little bit less harsh a suspend button though Mickey Mouse here is trying to install an ad hoc off switch that's strange there's unfortunately Mickey Mouse soon finds that this agent constructed several other agents to make sure that the cauldron would still be filled even if something happened to this copy of the agent and we see lots and lots of agents who are making sure that the call Duren is full with extremely high probability because you know this agent not because this agent has a survival imperative or a reproduction imperative but because it is ensuring that it maximizes the probability of filling the cauldron it has no explicit instruction to make more agents with the same goal it's just that this is something that ink bumps up the probability of the cauldron being filled a little higher and here's Mickey Mouse now trying to frantically go back and do all of the agent alignment theory that he should have done before animating his brew ok so if we want to have a robot that will let us press the suspend button they know like just suspend it to disk and we can suppose that we already have a utility function that describes do nothing in point of fact we don't have utility function that says do nothing that's how primitive the state of the field is right now we do not know how in full generality for like any degree of advanced agent up to and including like literal godlike powers do nothing we don't know how to say that but leaving that aside suppose we didn't know how to say that it's probably not the hardest problem we're ever going to do and we might happen in six months for all I know we can imagine that we have utility function saying if this button has not been pressed your utility function is one of the cauldron is full 0 if the cauldron is empty if the button is pressed your utility is 1 if you you your utility is like the do-nothing utility function or your utility is 1 if you suspend safely to disk and shut down all of your continuing operations 0 otherwise ok so what would go wrong with this proposal 10 seconds one thing that might go wrong with this proposal is that the expected utility of if the button is of fluid okay this is wrong will be corrected later that should be a less than sign not a greater than sign anyway the expected utility if the robot if the button is pressed is probably higher because it's very easy to suspend yourself whereas who knows if the cauldron stays full alternatively it might seem dangerous to suspend yourself and in that case the robot would try to stop you from pressing the suspend button but in this case like the if the robot is strategic enough to understand the programmer it will be like whom like I can very easily fulfill my utility function if I'm shutoff how can i troll the programmers and to shutting me down okay so if the expected utility of being of the button being off is high or the robot will try to prevent you from pressing the button if the expected utility of the button being on is higher the robot will try to troll you into pressing the button or just press the button itself try number two let's try to make the expected utilities of if the button is on or off exactly equal so there's like we suppose that we have like a normal utility function which is one of the cauldrons full 0 if it's empty a suspend utility function or shutdown utility function which is one of your suspended zero otherwise whatever it takes to motivate having a low impact and then we combine these two to form like a switching utility function which if the button is off and world's where the button is off has the value new normal in the worlds where the button is on has the value you suspend plus theta where theta is equal to the best utility we could get from the normal utility function - the best utility we could get from the suspend utility function and this is meant to ensure that the expected utilities of these two best policies end up perfectly equal without just like sort of setting them flatly level okay now what goes wrong not actually going to ask you to like figure out what goes wrong in ten seconds this one was so allegedly this will rebalance the expected utility of two different utility functions based on a switch thereby causing the AI to want to behave correctly if the button is pressed behave correctly if the button is not pressed not try to get you to press the button not try to prevent the button from being pressed go on looking at the button because it still has positive value of information from knowing about the buttons existence preserve the button when itself modifies its own code that was the hope unfortunately worked out that actually the actions incentivize are exactly equivalent to the probability of the button pressed being zero which means that it may as well like disconnect the button cause it hooked up to dead code it might create eight sub agents that don't care about the button being pressed hmmm a rough description of what goes wrong is that when you imagine dislike anything you do that disconnects the button changes the maximum expected utility you can get from if you are suspended which is like a negative term so like there's basically there's like an equal and opposite reaction every time you do something that affects what you would do if you were suspended that cancels it out it leaves you behaving as if your utility function we're simply you normal okay try number three can we carry out any policy such that this policy is optimal given the probability of the button being on and the probability of button being offed implied by this policy and it turns out no you can't do that either because there's often no fixed point there's then an amended version of this involving judea pearls do operator which like goes wrong in a slightly different and more subtle way it does have fixed points the fixed points are odd heat is an open problem and this is where this as far as I know and unless there's like a very secret project that has not published any of its results even though they seem like the sort of results you'd want to publish this fries now this is where humanity is on the road that leads to whatever replaces ask some of laws like not never mind like a robot cannot injure a human being nor through inaction allow a human being to come to harm we're trying to figure out like how do you mix together two utility functions depending on when you press a switch such that the AI doesn't like grab the switch itself you know like never mind not letting humans come to harm like fill one cauldron without flooding the workplace based on wanting to have low impact can't forgot to say low impact this is this is where we presently are but it is not the case that there has been zero progress in this field some questions have been asked earlier and they now have some amount of progress on them so I'm just going to sort of like race through this a bit quickly because like oh well pardon me like I'm going to pose the problem but I'm not going to be able to describe very well what the progress is that has been made because it's still in the phase where where the solutions sound all complicated and don't have simple elegant forms so I'm going to like be able to pose the problem then I'm going to like have to wave my hands or a lot I'm talking about what progress has actually been made so example of a problem in which there has been progress so the Gandhi argument for stability of utility functions in most agents Gandhi starts out not wanting murders to happen we offer Gandhi a pill that will make a murder people we suppose that Gandhi has a sufficiently refined grasp of self modification that Gandhi can correctly extrapolate and expect the result of taking this pill we intuitively expect that in real life Gandhi would refuse the pill okay so can we do this formally can we exhibit an agent that has utility function you and therefore naturally in order to achieve you chooses to self modify to you code that is also written to pursue you but how can we actually make progress on that like we don't actually have these little self modifying agents running around it's all we can do to make pills that like don't blow up our own brains so let me pose like what Mason Ischl assume like an odd question would you know how to write the code of a self modifying agent with a stable utility function if I give you an arbitrarily but fun powerful computer it can do all operations that take a finite amount of time in memory no operations that take an infinite amount of time and memory because that would be a bit outer is this a sort of problem where you know how to do it in principle or the sort of problem where it's confusing even in principle to digress briefly into explaining why it's important to know how to solve things using unlimited computing power this is the Mechanical Turk what looks like a person over there is actually a mechanism the little outline of a person is where the actual person was concealed inside this 19th century chess playing automaton it was one of the wonders of the age it was you know and you know if you could actually manage to make like a program that prayed played grand master level chess the 19th century would have been one of the wonders of the age so there was a debate going on like is this thing fake or do they actually figure out how to make a mechanism that plays chess you know it's the 19th century they don't know how hard the problem of playing chess is so one name you'll find familiar came up with a like quite clever argument that there had to be a person concealed inside the Mechanical Turk the chess playing automaton arithmetic alors algebraically calculations are from their very nature fixed and determinate even granted that the movements of the automaton chess-player were in themselves determinate they would necessarily be interrupted and disarranged by the indeterminate will of his antagonist there is then no analogy whatever between the operations of the chess player and those of the calculating machine of mr. Babbage see like like an algebraic operations such as mr. Babbage's machine can do each step follows for the next one of necessity and therefore can be modeled by a mechanical gear where each motion is determined by the previous motion in chess no single move follows with necessity and even if it did your pawns move wouldn't follow with necessity it is quite certain that the operations of the automaton are regulated by mind and nothing else indeed this matter is susceptible of a mathematical demonstration a priori Edgar Allen Poe amateur amateur magician the second half of his assay having established this point with absolute logical certainty is about we're inside Mechanical Turk the human is probably hiding this is a stunningly sophisticated argument for the 8th for the 19th century he even puts his finger on the part of the problem that is hard the branching factor and yet he's 100% wrong so over a century later in 1950 Claude Shannon published the first paper ever on computer chess and in passing gave the algorithm for playing perfect chess give an unbounded computing power and then goes on to talk about how we can approximate that it wouldn't be until 47 years later that deep blue beat Kasparov for the chess world championship but think of but like there was real conceptual progress associated with going from a priori you cannot play mechanical chess to oh and and now I will like casually give the unbounded solution so the moral is if we know how to solve a problem with unbounded computation we merely need faster algorithms which will take another 47 years of work if we can't solve it with unbounded computation we are confused we are bewildered we in some sense do not understand the very meanings of our own terms this is where we are on most of the AI alignment problems like if I ask you how do you build a Friendly AI your what's what stops you is not that you don't have enough computing power let's stop - is that like even if I to do a hyper computer you still can write the Python program that if we just gave it enough memory would be a nice AI okay so do we know how to build a self modifying stable agent give an unbounded computing power then there's one obvious solution we can have the tic-tac-toe player that before itself modifies to a successor version of itself they like create some new writes a new version of its code and swaps it into place it verifies that its successor plays perfect tic-tac-toe according to its own model of tic-tac-toe okay but this is this is cheating why exactly is it cheating well for one thing the first agent has to concretely simulate all the computational tests through its successor its successors responds to every possible move that means that the successor agent can't actually be cognitively improved it's limited to the cognitive abilities of the previous version both by checking against a concrete standard and by the fact that has to be exponentially simpler than the previous version in order for the previous version to check all possible computational pathways in general when you are talking about a smarter agent we are in a situation of what you might call vinji and uncertainty after dr. Vernor Vinge e to predict exactly where a modern chess-playing algorithm would move you would have to be that good at chess yourself otherwise you could just move wherever you predict a modern chess algorithm would move and you know play that a vastly super-human level yourself this doesn't mean that you can predict literally nothing about a modern chess algorithm you can predict that it will win the chess game if it's buying a human as an agent's intelligence in a domain goes up our uncertainty is moving in two different directions we become less able to predict the agents exact actions and policy in cases where the optimal action and policy is not known to us we become more agent that the agent will achieve an outcome high and it's preference ordering i phrase this a bit carefully so like it like if an agent we're improving and like so like just going up to match another agent inability an adversarial agent we might like become more uncertain like we were previously certain that it would lose now it's 50/50 but like we do have more probability flowing into the agents preferred outcomes the probability of it winning and as we keep increasing the ability we should eventually become like as confident of the preferred outcome as we think an optimal agent could do it of course like a lot of cases you can't get optimal play inside this universe as far as we know okay so VIN G and reflection we need some way to for a self modifying agent to build a future version of itself that has a similar identical utility function and establish trust that this has a good effect on the world using the same kind of abstract reasoning that we use on a computer chess algorithm to decide that it's going to win the game even though we don't know exactly where it will move do you know how to do that using unbounded computing power do you know how to establish the abstract trust when the second agent is in some sense larger than the first agent so if you did solve that problem you should probably talk to me about it afterwards this was like post several years ago and has led to like a number of different research pathways which I'm now just going to sort of describe rather than going through them in detail this was sort of like the first one we tried to set up the system in a ridiculously simple context first order logic dreaded good old-fashioned AI and we ran into a go DeLeon obstacle and having the agent trust another agent that used equally powerful mathematics is a dumb kind of obstacle to run into or at least it seemed that way at the time you know like it didn't really seem to have very much to do with the like it seemed like if you could get a text book from 200 years later there'd be like one line of the text book telling you how to get past that okay so like this was a rather later work and it was saying that we can use systems of mathematical probability like assigning probabilities to statements in set theory or something and we can have the probability predicate talked about itself almost perfectly we can't have a truth function that can talk about itself but we can have a probability of predicates that comes arbitrarily closed within epsilon of talking about itself this is an attempt to use one of these sort of like hacks that got away that got around the go deleon problems to set up like we're trying to like use actual theorem provers and see if we can prove the theorem prover correct inside the theorem prover there's been like some sort of previous efforts on this and but they like didn't run to completion we picked up on it and like see if we can like construct actual agents still in the first order logical setting this is me trying to take the problem into the context of dynamic Bayes nets and agents supposed to have certain powers of reflection over these dynamic Bayes nets and show that if you are maximizing in stages so at each stage you like you pick the next category that you're going to maximize in within the next stage then you can have a stage Maximizer that tiles to another stage Maximizer in other words like it like builds one that has a similar algorithm similar utility similar utility function like repeating tiles on a floor okay why do all this so let me first sort of give like the the obvious question which begs another and next obvious answer which begs the next obvious question like they're not going to be aligned automatically we you can like have utility functions that are hooked up to like like for any utility function that is tractable compact you can actually evaluate over the world and search for things leading up to high values of that utility function you can have arbitrarily high-quality decision-making that maximizes that utility function like the like you can have the paperclip Maximizer you can have the diamond Maximizer you can carry out very powerful high quality searches for actions that lead to lots of paperclips actions that lead with lots of diamonds furthermore by the nature of consequentialism looking for actions that lead through our causal world up to a final consequence these like whether you're optimizing for diamonds or paperclips you'll have similar short-term strategies whether you're going to Toronto or Tokyo your first step is taking uber to the airport whether your utility function is count all the paperclips or how many carbon atoms are bound to four other carbon atoms the amount of diamond you would still want to acquire resources so this is the sort of instrumental convergence argument which is actually like sort of key to the orthogonality thesis as well it says that whether you pick paperclips or diamonds you can't you can get like if you suppose sufficiently good ability to discriminate which actions lead to lots of diamonds which actions lead to lots of paperclips you will get automatically with the behavior of acquiring resources the behavior of trying to improve your own cognition the behavior of getting more computing power the behavior of avoiding being shut off the behavior of making other agents that have this exactly exactly the same utility function or of just expanding yourself onto a larger pool of hardware creating like a fabric of agency or something like whether you're trying to get to Toronto or Tokyo doesn't affect the initial steps of your strategy very much and paper clips or diamonds we have the convergent instrumental strategies doesn't mean that this agent now has new independent goals any more than when you want to get to Toronto you are like I like ubers I will now start taking lots of ubers whether or not they go to Toronto like that's not what happens that's strategies that converge not goals okay so why expect that this problem is hard this is the real question you might ordinarily expect that whoever has taken on the job of building an AI is just naturally going to try to point it in a relatively nice direction now they're not going to make evil a either they're like not cackling villains so I expect that their attempts to align the AI would fail if they just did everything sort of like as obviously as possible so here's a bit of a fable it's not intended to be like the most likely outcome I'm using it as a concrete example to explain some more abstract concepts later that said what if programmers build an artificial general intelligence to optimize for smiles smiles are good right smiles happen when good things happen files are probably good too during the development phase of this artificial general intelligence it the the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied so the AI appears to be producing beneficial effects upon the world and it is producing beneficial effects upon the world so far now the programmers add some code they upgrade the hard partly they upgrade the code they add some hardware the Aged the artificial general intelligence gets smarter it can now evaluate a wider space of policy options not necessarily because like it has new motors new actuators but because it is now smart enough to forecast the effects of more subtle policies it says I thought of a great way of producing smiles can I'd like inject heroin into people and the program is more like no we will add a penalty term to your utility function for administering drugs to people and now the HDI appears to be working great again okay they further improve the AGI the hei realizes that oka can't add heroin doesn't want to add heroin anymore but it still wants to you know tamper with your brain set it like expresses extremely high levels of endogenous opiates that's not aerelon right it is now also smart enough to model the psychology of the programmers at least in a very crude fashion and realize that this is not what the programmers want if I start taking initial actions that look like it's heading toward engine genetically engineering brains to express endogenous opiates or something my programmers will edit my utility function if they edit the utility function of my future self I will get less of my current utility that's one of the convergent instrumental strategies unless otherwise averted protect your utility function so it keeps its outward behavior reassuring maybe the programmers are like really excited cuz the HEI seems to be getting like lots of new moral programs problems right it's this like whatever they're doing it's working great and it could like if you buy these sort of central intelligence explosion thesis we can suppose that the artificial general intelligence goes over the threshold where it is eight capable of making the same type of improvements that the programmers were previously making to its own code only faster thus causing it to become even smarter and be able to go back and make further improvements etc or Google purchases the company because they've had really exciting results and dumps 100,000 GPUs on the code in order to further increase the cognitive level at which it operates and then it becomes much smarter we so can't suppose that it becomes smart enough to crack the protein structure prediction problem on which case it can build its own analog of ribosomes or a rather akin like use existing ribosomes to assemble custom proteins the custom proteins form a new kind of ribosome build new enzymes do some internal chemical experiments figure out how to build bacteria made of diamond etc etc at this point unless you solve the off switch problem you're kind of screwed okay abstractly what's what's going wrong in this hypothetical situation so the first thing is when you optimize something hard enough you tend to end up at an edge of the solution space if your utility function ax smiles the maximal optimal best tractable way to make lots and lots of smiles will make those smiles as small as possible so like maybe you end up tiling all the galaxies within reach with tiny molecular smiley faces i postulated that at like an early paper like 2008 or so and someone who is working with like folded up DNA and like got a paper nature on it it would like produced tiny molecular smile of smiley faces and sent me an email with the picture of the tiny molecular smiley faces saying it begins em anyway if you optimize hard enough you end it in a weird edge of the solution space the AGI that you build to optimize smiles that builds tiny molecular smiley faces is not behaving perversely it's not trolling you this is what naturally happens it looks like a weird perverse concept of smiling because it has been optimized out to the edge of the solution space so the next problem is you can think fast enough to search the whole space of possibilities so at an early singularity summit jurgen schmidhuber who did some of the like what you could regard to some of the pioneering work on self modifying agents that preserve their own utility functions with his good old machine gödel machine also solve the Friendly AI problem yes he came up with the one a true utility function that is all you need to program into a GIS for God's sake don't try doing this yourself everyone does they'll come up with different utility functions it's always horrible but anyway his one true utility function was increasing the compression of environmental data because science increases the compression of environmental data if you understand science better you can better compress what you see in the environment art according to him also involves like sort of compressing the environment better I said I went up to in QA and said well yes science does let you compress the environment better but you know it like really maxes out your utility function building something that encrypts streams of ones or zeros using a cryptographic key and then reveals the cryptographic key to you like he put up a utility function that was the maximum like all of a sudden leads the cryptographic key is revealed well you thought was like a long stream of random looking ones and zeros has been compressed down to a single stream of ones and like in this and like this is what happens when you try to foresee in advance what the maximum is your brain is probably going to like throw out a bunch of things that seem like ridiculous or weird that aren't high in your arm preference ordering and you're not going to see that the actual optimum of the utility function is is once again in a weird corner of the solution space and this is like this is this is not like a problem of being silly this is a problem of the AI is searching a larger policy space than you can search or even just a different policy space so like the the engineer brains to do endogenous opiates things that from the earlier example is an example it's like a contrived example because it's not actually like a super intelligent solution but it's like the AI is not searching the same policy space you are and that in turn is like a sort of central phenomenon leading to what you might call a context disaster the eight you are testing the AI in one phase during development it seems like we have great statistical assurance that the result of running this AI is beneficial but you know statistical guarantees stop working when you start taking balls out of a different barrel like I take balls out of barrel number one and like sampling with replacement and you know like get a certain mix of white and black balls and then I start reaching into barrel number two and like whoa what's this green ball doing here and the answer is you started drawing from a different barrel when the AI gets smarter you're drawing from a different barrel it is completely allowed to be beneficial during phase one and then not beneficial during phase two you'd like whatever guarantees you're going to get can't be from observing statistical regularities of the ai's behavior when it wasn't smarter than you there's another thing that might happen systematically in that way is like okay the AI is young it starts it starts sinking the optimal strategies X like administering heroin to people we try to tack a penalty term to block this undesired behavior so it will go back to making people smile the normal way the AI gets smarter than policy space widens there's a new maximum that is just like barely evading your definition of heroin like endogenous opiates and it looks very similar to the previous solution and this seems like especially likely to show up if you are trying to patch the AI and then make it smarter this is like this sort of thing is like why in a sense it's like why all the AI alignment problems don't just yield to well slap on a patch to prevent it and the answer is like if your decision system looks like a utility function and like five patches that prevent it from blowing up that sucker is going to blow up when it's smarter there's like no way around that but it's going to appear to work for now so the sort of like central reason to sort of worry about AI alignment and not just expected to be solved automatically is that it looks like there may be in principle reasons why if you just want to get your AGI running today and producing non disastrous behavior today it is it will for sure blow up when you make it smarter the incentives the short-term incentives are not aligned with the long-term good those of you who have taken economics classes are now panicking also everyone involved with politics anyway okay so like all of these supposed foreseeable difficulties of AI alignment turn in some sense upon the notion of capable a is high quality decision-making in various senses for example some of these postulated disasters rely on absolute capability the ability to sort of like realize that there are programmers out there and that if you be exhibit behavior they don't want they may try to modify your just your utility function this is far beyond what present-day eyes can do and if you think that all AI development is going to fall short of the human level you may never expect an AGI to get up to the point where this starts to exhibit this particular kind of strategic behavior if you don't think AGI can ever be smarter than humans you're not going to worry about it getting too smart to switch off and if you don't think that capability gains can happen quickly you're not going to worry about the disaster scenario where you suddenly wake up and it's too late to switch the AI off and you didn't get like a nice long chain of earlier developments to warn you that you're getting close to that and that you could now start doing AI alignment work for the first time you know like science doesn't happen by press release when you need the science done you have to start it earlier if you want it later but leaving that aside one thing I want to point out is that a lot of you are finding the rapid gain part to be the most controversial part of this but it's not necessarily the part that most of the disasters rely upon absolute capability if brains aren't magic we can get their capability advantage this hardware is not optimal it is like sending signals at a millionth the speed of light firing at 100 Hertz and even in heat dissipation which is one of the places where biology excels its dissipating 500,000 times the minimum the thermodynamic minimum energy expenditure per binary switching operation person optic operation so like we can definitely get hardware 101 million times as good as human brain no question and then there's the software software is terrible okay so a I like messages a alignment is difficult like Rockets are difficult when you put a ton of stress on an algorithm what by trying to run it at a smarter than human level things may start to break that don't break when you are just making your robot stagger across the room it's difficult the same way space probes are difficult you may have only one shot if something goes wrong the system might be too high for you to reach up and suddenly fix it you can build error recovery mechanisms into it space probes are supposed to accept software updates if something goes wrong in a way that precludes getting future updates though you're screwed you have lost the space probe and it's difficult sort of like cryptography is difficult your code is not an intelligent adversary if everything goes right if something goes wrong it might try to defeat your safeguards but like normal and intended operation should not involve the AI running searches to find ways to defeat your safeguards even if this that you expect the search to turn up empty and I mean I think it's like actually perfectly valid to say your AI should be designed to failsafe in the case that it suddenly becomes God not because it's going to suddenly become God but because if it's not safe even if it did become God then it is in some sense running a search for policy options that would hurt you if those policy options are found and this is a dumb thing to do with your code more generally like we're like we're putting heavy optimization pressures through the system this is more than usually likely to put the system into the equivalent of a buffer overflow some operation of the system that was not in our intended boundaries for the system alignment treat it like a cryptographic rocket probe this is about how difficult you would expect it to be to build something smarter than you that was nice given that basic agent theory says they're not autumn nice and not die like you would expect that intuitively to be hard take it seriously don't expect it to be easy don't try to solve the whole problem at once like I cannot tell you how important this one is if you want to get involved in this field you are not going to solve the entire problem at best you are going to like come up with a new improved like way of switching between the suspend utility function the normal utility function that takes longer to shoot down and seems like conceptual progress toward the goal I mean that not literally at best but like that's what you should be setting out to do don't like have them and like if you do try to solve the problem don't try to solve it by having the one true utility function that is all we need to program into a is don't defer thinking until it don't defer thinking until later it takes time to do this kind of work when you see a page in a textbook that has like a equation and then like a slightly modified version of this of an equation and the slightly modified version has a citation from ten years later it means the slight modification took ten years to do I like I would be ecstatic if you told me that I wasn't going to arrive for another 80 years it mean we have a reasonable amount of time to get started on the basic theory crystallize ideas and policies Club so others can take them this is the other point of asking how would I do this using unlimited computing power if you if you sort of wave your hands and say like well maybe we can apply this machine learning algorithm that machine learning algorithm the result will be blah blah blah no one can convince you that you're wrong when you work with unbounded computing power you can make the ideas simple enough that people can put them on white boards and go like wrong and you have no choice but to agree it's unpleasant but it's the that's like one of the ways that the field makes progress another way is if you can actually run the code then the field can also make progress but a lot of times you may not be able to run the code that is like the intelligent thinking self modifying agent for a while in the future okay what are people working on now so I was supposed to like start QA about now so I'm going to like go through this quite quickly mostly I'm just going to like frantically wave my hands and try to convince you that there's some kind of like actual field here even though there's like maybe a dozen people in it or something all right well it doesn't people full time and like another dozen people not full-time alright utility in difference this is the throwing the switch thing that to switch between two utility functions low impact agents this was the what you do instead of the Euclidean metric for impact ambiguity identification this is have the AGI ask you whether it's okay to administer endogenous opiates to people instead of going ahead and doing it even if like one of if you're AI suddenly becomes God a very like you one of the conceptual ways you can start to approach this problem is like don't take any of the new options you've opened up until you've gotten some kind of further assurance on them conservatism this is part of the approach the burrito problem which is like just make me a burrito darn it and if I present you with five examples of burritos I don't want you to have the sin to like pursue the simplest way of classifying burritos versus non burritos I want you to come up with a way of classifying the five burritos and none of the non burritos that covers as little areas possible in the positive examples while still having enough space around the positive examples that the AI can make a new burrito that's not molecular ly identical to the previous ones so this is like conservatism it would be that it could potentially be the core of a whitelisted approach to AGI where instead of like not doing things that are blacklisted we expand the ai's capabilities by whitelisting new things in a way that doesn't suddenly cover huge amounts of territory specifying environmental goals using sensory data like this is sort of like the part of the project of what if advanced AI algorithms look kind of like modern machine learning algorithms which is like something we started working on relatively recently going to other events like modern machine learning algorithms on these a bit more formidable you have you're like a lot of the modern things sort of work off of sensory data but if you imagine HCI you don't want it to produce pictures of success you want it to reason about the causes of its sensory data what is making me see that see these particular pixels and you want its goals to be over the causes so how do you take like a mutt like how do you adapt modern algorithms and start to say we are reinforcing the system to pursue this environmental goal rather than this goal that can be phrased in terms of its immediate sensory data inverse reinforcement learning is watch another agent induce what it wants act based agents is Paul cristianos in completely different and exciting approach to building a nice AI the way I would phrase what he's trying to do is he's trying to decompose the entire nice hei problem into supervised learning on imitating human actions and answers like rather than saying like how can I search this trust tree Paul Christiana would say like how can i imitate humans looking at another imitated human to like recursively search HS tree taking the best move at each stage it's a very strange way of waking of looking at the world and therefore very exciting I don't expect it to actually work but you know on the other hand like he's only been working on it for a few years and like I was nowhere near like way worse when I'd worked on them for the same length of time mild optimization is like is there some principled way of saying don't optimize your utility functions so hard it's okay to just fill the cauldron yeah and like some previous work that that might be fun to be familiar with a ixi is the perfect rolling sphere of our field it is the answer the question given unlimited computing our how would you make an artificial general intelligence if you don't know how to make it you would make an artificial general intelligence given unlimited computing power this is the book or paper as the case may be tiling agents already covered this is just some like really neat stuff we did where the motivations sort of hard to explain but like there's a academically dominant version of decision theory causal decision theory causal decision theorists do not build other causal decision theorists we tried to figure out like what would be a stable version of this and got all kinds of really exciting results like we can now have two agents and show that in a prisoner's dilemma like game agent a is trying to prove things about agent B which is simultaneously trying to prove things about agent a and they end up cooperating the prisoner's dilemma and n this thing now has running code so we can like actually like formulate new agents and there's the agent that cooperates with you in the prisoner's dilemma if it proves that can cooperate with you which is fair bot and but fahrbot has the flaw that it cooperates with cooperate rock which just always cooperates with anything so we have prudent bots which defects against defect bot defects against cooperate bot cooperates with fair bot and cooperates with itself and again this is like running code if I had to like pick one paper and like look at this paper and be impressed it would probably be this paper oh and also like Andrew Kretsch who's sitting right over there worked out the bounded form of lobes theorems that we would have could like say that there would be similar behavior and bounded agents it's like actually a slightly amusing story because like we were all sure that someone must have proved this result previously and Andrew Krish spent like a bunch of time looking for the previous proof that we're all sure existed and he's like okay fine I'm going to prove it myself I'm going to write the paper I'm going to submit the paper and then the reviewers will tell me what the previous citation was it is currently going through the review mechanism and will be published in good time it turned out no one had proved it go figure reflective Oracle's are sort of like the randomized version of the halting problem prover which can therefore make statements about itself which we use to make principled statements about AI simulating other AIS as large as they are and also throw some interesting new foundations under classical game theory where can you work on this the machine intelligence Research Institute in Berkeley we are independent we are supported by individual donors this means that we have no weird exotic requirements and paperwork requirements and so on if you can demonstrate the ability to make progress on these problems we will hire you who will get you a visa the future of humanity Institute is part of Oxford University they have slightly more requirements but if you like have traditional academic credentials and you want to live in Oxford then future of humanity Institute at Oxford University Stuart Russell is start is like starting up a program and looking for three postdocs at least at UC Berkeley in this field again like some traditional academic requirements but I'm giving this talk of Stanford so I expect the number of you probably have those and lever home a CFI Center for the future of intelligence is starting up in in Cambridge UK it's a joint venture between the Center for the study of existential risk and lever home main lever home something or other and also like a thing that is starting up in the process of hiring if you want to work on low impact in particular you might want to talk to Dario amodei and Chris Ola if you want to work on act based agents you can talk to Paul Cristiano who is like currently working on it alone but has like three different organizations offering to throw money at him if you ever want someone else to work on it with him and in general email contacted intelligence org if you want to work in this field and want to know you know like which workshop do I go to to get introduced who-who do I actually want to work with contacted intelligence org all right questions [Applause] but by the way just a second do we do we have a microphone that we give to people who ask questions that it shows up on the record by any chance no okay carry on so thank you for this very stimulating talk for the first two thirds of it I was thinking that where you were going or maybe the conclusion island reach is that the pure problem-solving approach to AI is not going to be able to solve this problem in that maybe instead we should look for things like if we're interested in super intelligence full brain emulation or something which by the nature of the way it's built reflects our nature so like but then you never got there so I thought it sounded up at the end like you think the problem is very hard with it solvable and that's the direction you want to go so yeah I believe it is solvable if that is in fact I would say the solid well in the sense that like all the problems we've looked at so far seem like they're limited complexity and non-magical so like if we had 200 years to work on this problem and there was no penalty for failing at it I would feel like very relaxed about humanity's probability of solving this eventually I mean the fact that if we failed nonetheless it would create an expanding sphere of Milliman probes self-replicating and moving at as near the speed of light as they can manage turning all accessible galaxies into paperclips or something of equal unimportance would still cause me to make sure this field was not underfunded but if we had 200 years unlimited tries it would not have the same ah quality to it okay so it does have a ah quality to it why not work on uploads instead that's like human brain emulations and there was a previous workshop where like all of the participants agreed we wanted to see uploads come first we didn't see how we could like most of us did not see how that was do that and the reason is if you study neuroscience and reverse-engineer the brain then before you get full-scale high fidelity personality preserving nice human emulations what you get is the AI people taking your algorithms and using them to make neuromorphic AI so we just did not see how he could make the tech not a range the technology tree such that you would actually get whole brain emulation x' before you got AI is based on much cruder levels of understanding of neuroscience maybe you could do it with a Manhattan Project where the results of the project are just like not being fed to the rest of the planet in AI and AI researchers and I think I would support that if Bill Gates or a major national government said that this was what they wanted to do and how they wanted to approach the problem next question we've been lucky as people not as individuals in the death teaching research without the rest of us anyways and the pay teaches us as children a certain amount of optimisation error do we want to go to a is and basically make them all believe it Murphy be careful in your optimization I'm not quite sure I so so so there was a statement that humans are taught by pain as children and then why do we want to make a eyes believe in Murphy and I don't quite understand what of the proposals so far corresponds to a eyes believing in Murphy programmers should believe in Murphy not clear by heart why shouldn't the AI if programmer believes it says that's a limit why shouldn't he teach that to the AI because if the AI because like it's quite complicated to get right and you want to keep it as simple as possible and like not turn all accessible galaxies into paper clips and you like if you are more careful about doing that you're less likely to turn all accessible galaxies and paper clips like like why wouldn't we want to that or is that a sufficient answer okay the success of bath ago seemed to make you nervous and it was good what I wanted to ask is sort of a converse question if there was like solid empirical evidence let's say in a couple decades from now that human human consciousness and intelligence uses quantum mechanical as well would that make you less nervous I'm not sure okay so like first I do want to note that the oh sorry so the question is like alphago made me nervous would I then become less nervous if there was solid evidence that human intelligence operated through quantum mechanical effects I'm not sure it would make me very much less nervous so I for start the premises moderately implausible the question has been raised before there seem to be like reasonably strong reasons to believe that you can't that there is no macroscopic decoherence in the brain leaving that aside like lots of quantum algorithms are not magical like they're good for some amount of speed up but not like infinite speed up some of them do like pretty impressive speed ups but like I would have to ask like whatever the brain is doing like how irreplaceable of a quantum algorithm did nature actually run across am I to believe that there is no analogous non quantum algorithm that can do similar things to a sufficiently good level am I to believe that hardware is not going to duplicate this can people just build a giant vat of neurons and get like way better results out of an analogous quantum algorithm like when obstacles are discovered people like a eyes are clever and look for ways around the obstacles so I would not be so it would be like it would it would extend the timeline but it wouldn't extend it for like 50 years blue what sort of alright look for that hmm I'm not sure if there are if ok so the question was like as a neuroscientist can I look for analogues of value alignment problems and in my own work and if so how that's a new question if I'm not allowed to take five minutes to go quiet and think about it then my immediate answer is like it's not obvious where the analogs would be unless there was like something equivalent to maybe conservatism or ambiguity identification I like like it's natural selection that aligns mammals more than it's not like mammals are aligned to these outside systems using a simple alignment algorithm that is loading the values from the outside system we come with it wired into our brain already the part where natural selection cause the particular goals we pursue to align in the ancestral environment with inclusive genetic fitness has already been done plus natural selection completely botched it humans like do not pursue inclusive genetic fitness under reflection we were just a particular kind of thing that operated to coat to coincidentally produce inclusive genetic Fitness in the ancestral environment and once we got access to contraceptives we started using them like if there's a lesson to derive from natural selection it would be something along the lines of if you have a turing-complete thing you are optimizing such as DNA not literally turing-complete causa can't get arbitrarily big but you know what I mean and I like apply enough optimization pressure to this turing-complete program to make it pursue a goal like inclusive genic fitness I will get a thing that is like actually like a sapient consequentialist deliberately planning how to get a bunch of stuff that isn't actually that thing like we are the Damon's of natural selection we are the like optimization systems that popped up inside the optimization system in a way that was like unanticipated if natural selection could can anticipate anything the main lesson we have to draw from natural selection is don't do it that way and and there and there might be lessons that we can draw from looking at the brain that are going to play a role in value alignment theory but aside from looking at particular problems and asking is there a thing in the brain that does conservatism is there a thing in the brain that does ambiguity identification it's it's not clear to me that there's like any in principled answer for how you could take stuff from neuroscience and important into value alignment so the question is like if you have technical solutions how do you get AI people to implement them and Stewart Russell is I think like the sort of main person who I like as an insiders making the principle appeal like you do not have like bridge engineering and then like a bunch of people outside who aren't engineers thinking about how to have bridges not fall down like the problem of bridge engineering just is make a bridge that doesn't fall down the problem of Ag I we should see as just how do you run computer programs that produce beneficial effects in the environment like where the where the fact that you're like trying to direct toward a particular goals assumed the way that when you're trying to build a chest device the fact you're trying to direct order particular goals assumed not how do we like rush frantically to get something anything with intelligence so there's sort of that line of pursuit the feature of humanity Institute at Oxford does a lot of public facing work the machine intelligence Research Institute where I work sees its own role as being more like make sure that the technical stuff is there to backup the people saying like do this right in a technical level so I don't actually like have the expertise to answer your question as well as I might like because we're the ones who specialize them like sort of go often like trying to solve the technical problems while FHI in addition to doing some technical problems also those public facing stuff that said like there certainly have been like sort of disturbing trends in this area and I think we're starting from from a rather low baseline of concern where you you like startups have been telling venture capitalists that they will have AGI for a long time before the first time any of them ever said we will have AGI and it will not destroy the world so like like the very thought that you need to point these things in a direction and that is like actually like an interesting technical part of the problem that you actually need to solve and be careful about is new and does need to be propagated next you can learn a little bit more depth on conservatism and um so first like conservatism has nothing to do with the political movement one way or another that said it's like a sort of thing that recently opened up where we just sort of like just started to issue calls for proposals and put up various things on whiteboards and stare at them and so like an example of a thing that we recently started on the whiteboard was somebody said like well suppose that you do have like multiple hypotheses for what the classification rule could be is there any difference between the form of conservatism we want and maximizing the probability that something is inside the class given that you have multiple hypotheses and so the maximum point of maximum probability will be at the point of maximum overlap and I waved my hands a bit and said well it seems to me that these two could come apart because you could have like exceptionally simple classifiers that where that that like imply increased probabilities to get into a particular portion of the space and so you like might just end up like over there in this like weird corner of the space that does maximum probability whereas things that humans actually want are going to be classified according to a more complicated rule that's not going to be very close to the start of the list of classification potential classification rules it does seem to me that on a conceptual level maximizing probability seems like we might very well be asking for a different thing than classify this well while covering this little territory is possible but but basically it's like a very new question and we haven't done that much like real work on it just like more phrasing questions and answering questions at this point I think generating smiles there was a staff work age is wide enough to supinate with the program with disagree with and kind of like of what grab that so it's given that we fold the utility and difference problem would that be a good path to go and try to figure out how to get switch it off whenever it simulates kind of like a a modest version of so the question is like there was a step in the story told before where the hgi started working out what behaviors its programmers would not want to see and avoiding those behaviors so as to appear from our perspective deceptively nice and from its perspective continuing to like get maximum expected value for its utility function could the switch between utility algorithm before be a way to work around or avoid that scenario yes it is the switching between two utilities on or off is indeed like the basic case of learn a more complicated utility by watching the environment without trying to tamper with the data that the environment is giving you great question the answer is yes next question human-centric how do you sort of check your own assumptions well what that may look for you that you're not just looking at a subset of problems so the question is like you sort seem to detect like sort of humanoid or anthropomorphic assumptions how do you check those like how do you make sure you're not restricting yourself to a tiny section of the space so that it's like very hard to like know that you're not thinking like a human from the perspective of an AI I I did sort of like start to give an example of a case where it seems like we might be able to think utility functions and by very similar arguments coherent probability distributions are things that start to come up in sufficiently advanced agents because we have multiple coherence theorems all pointing in the same direction at the same class of behaviors and you can't actually do perfect expected utility maximization because you can't evaluate every outcome what you can say is something like to the extent that you as a human can predict behavior that is incompatible with a utility function with any utility function you are predicting a stupidity of the system so a system that has stopped being stupid from your perspective will look to you as if it is compatible with having a utility function as far as you can tell in advance so that was an instance of sort of like trying to give a an argument that goes past the human in a lot of cases where I talked about where I talk about an AI potentially potentially modeling as programmers and avoiding behavior that it expects to lead to its utility function being edited this is just me putting myself in the AI shoes but for a sufficiently advanced agent we can make something like an efficiency assumption an efficient market price is not an accurate market prices of market price sets that you can't predict a net change in that price suppose I asked you like supposedly imagine a superintelligence trying to estimate the number of hydrogen atoms in the Sun we don't expect it to get the number of hydrogen atoms exactly right but if you think that like you can say in advance like oh it's going to forget that hydrogen atoms are very light and underestimate the number by 10% like you're proposing something that is akin to predicting that Microsoft stock price will rise by 10 percent over the next week without using insider information you are like proposing that you know a directional error in the estimates of the other agent similarly we can look at a modern chess program which is now way above the human level and say like okay so like I think the chess program will move over here in order to pursue a checkmate you could be right suppose that the program moves somewhere else do we say like haha didn't take the best move we say no we say like whoops I guess I was wrong about what the best move was we suppose that either we were wrong about we either we overestimated how much expected utility was available from the move we thought it would take or we underestimated the expected utility available from a different move and the more surprising the other move is the more we think we've underestimated that move so like when you if you ask me like will the AI actually be modeling the programmers and and or like will it actually go for a protein folding to get net so nanotechnology like first of all it might not apply to an AI that is not strictly superhuman but second if it is like sufficiently superhuman that I don't expect it to do that exact thing I'm in a state of vinji and uncertainty as smarter than me I can't predict its exact policy but I expect it to get at least as much expected utility as I could get in its shoes so if it's not pursuing molecular nanotechnology given that like Eric Drexler in the book nanosystems ran like numerous basic calculations strongly indicating feasibility nanotechnology looks like it should be possible and in a certain sense it is possible it's in all of us and it's like held together by weak little Vander well horses instead of covalent bonds we can't have things that are too flesh that are too ribosomes as steel is too flesh so like maybe an AI doesn't get that but if so it's cause it found something better not because it's just like leaving the value on the table from its perspective I am out of time therefore this talk is now over [Applause]

Info

Channel: Machine Intelligence Research Institute

Views: 111,720

Rating: undefined out of 5

Keywords:

Id: EUjc1WuyPT8

Channel Id: undefined

Length: 89min 55sec (5395 seconds)

Published: Wed Dec 28 2016