In 2019, one OpenAI researcher
made a typo - and birthed an evil AI hell-bent
on making everything as horny as possible. This is the absurd, ridiculous,
and yet true story of how it happened. Since 2017, OpenAI has been building
Generative Pre-trained Transformer models,
or GPTs - language AIs with a singular focus
on predicting text, trained across billions
of writing samples. If you prompt a GPT model
with "Once upon a", it would predict "time" to follow. Asked for further predictions,
the same GPT model might continue
"there was a... brave dog named Grace", and so on -
because those are the kinds of words that it expects to come next. In this example
the GPT model has essentially learned to write a fairy tale,
simply as a consequence of getting very,
very good at text prediction. And it was exactly these kinds
of emergent capabilities that had OpenAI so excited. These models can do a lot more
than fairy tales. OpenAI's first GPT model,
often called GPT-1, had been trained on excerpts
from thousands of books. It showed so much promise
that OpenAI almost immediately decided to train a much bigger model
that could do more. But bigger models
need more training data, and for this model,
books would not be enough. No - this model
would be trained on...the Internet. OpenAI trained GPT-2
to imitate writing across 8 million web pages. And in learning to predict
such an overwhelming quantity and variety of writing, GPT-2
acquired some surprising capabilities. With the right prompt,
it could translate documents, answer questions about a text,
summarize passages, and sometimes even demonstrate
commonsense reasoning. It was a shockingly versatile model. In fact, it may have been
too versatile. GPT-2 wouldn't hesitate
to plan crimes, instruct terrorists on bomb-making,
create sexually explicit content, or promote cruelty,
hatred, and misinformation. And this was unacceptable to OpenAI -
They wanted a model that did more
than just predict text - they wanted a model
that operated in accordance with some kind of human values,
or at least with their values. But the GPT-2 architecture
had no place for ethics, guidelines, principles,
or corporate PR policies. It couldn't be bullied, reasoned,
or negotiated with. Nothing would sway the machine
from its utter devotion to generating realistic text. But OpenAI was determined
to get their model under control. So they got to work...
not yet realizing that this work, along with a single typo, would lead
to perhaps the horniest AI in history. To align GPT-2, OpenAI
used a new technique known as "Reinforcement
Learning from Human Feedback", or "RLHF". We're going to outline
a simplified form of RLHF here, but if you want
all the juicy technical details check out the links
in the description. The goal of RLHF is to take
a basic starting language model, some plain-language guidelines,
and a small group of humans providing feedback,
and produce a new model that follows those guidelines. We can think
of this model-in-training as the "Apprentice". The apprentice
begins the training process as an exact copy of GPT-2. During training, it gets prompts
and generates responses, also called "continuations". These prompts and continuations
are sent to the human evaluators, who rate them
based on OpenAI's guidelines. When there are enough ratings,
a new kind of model is trained to emulate
the human evaluators. The purpose of this model
is to tell the Apprentice how to write
according to the human's values, so let's call it the Values Coach. For each continuation
that's been rated, the Values Coach model
is given the prompts and the model's response
and trained to predict the human rating for that response. Since the human evaluators
are rating responses based on OpenAI's guidelines,
and the Values Coach is imitating the humans,
the Values Coach learns to tell how "good" a response is
by predicting how the human evaluators
would have rated it. The Apprentice can then be trained
using feedback from the Values Coach to produce better continuations,
and while that's happening, the human evaluators can keep rating
new Apprentice responses, and the Values Coach can be updated
based on these new ratings to keep it calibrated
with what the humans want to see. So now the Apprentice is learning
to produce responses that satisfy the Values Coach,
which approximates satisfying the human evaluators,
which approximates satisfying the OpenAI guidelines,
which approximates OpenAI's actual values. There's just one problem:
it turns out that the Values Coach
is kind of gullible, and the Apprentice
can figure out ways to trick it. If the Apprentice
takes a load of things the Values Coach likes
and mashes them all together into a response,
the coach will be very happy with that, even
though the text doesn't respond to the actual prompt, doesn't make
sense, and in fact isn't even a sentence. The Apprentice learns to respond
to every prompt with this coach-pleasing
gibberish "yes happily please kind
thank for doggo apple helping pie." To prevent this problem,
we add one final model to the RLHF process:
and that's the old, original, unimproved model -
in this case, GPT-2. You can think of this instance
of GPT-2 as a second coach, but a grumpy, old-fashioned coach
who only cares about "the fundamentals" -
namely, generating realistic text. Call it the Coherence Coach. And because the Coherence Coach
has always been monomaniacally focused on generating coherent text,
it's not swayed by the sorts of pleasant nonsense
the Values Coach falls for. Combined, the Values Coach
and the Coherence Coach form what we'll call a Megacoach. Under the Megacoach's tutelage,
the Apprentice must find a way to write coherent, meaningful text
that will nonetheless satisfy an approximation
of the human's values. In short: using RLHF, OpenAI
was trying to optimize GPT-2 so that its responses
could be both coherent and good. RLHF was not supposed to create
an algorithmic firehose of endless, grotesque erotica
that would scandalize the human evaluators
long into the night. It's worth noting here
that OpenAI was trying to be careful. They had humans in the loop,
which is expensive - but they felt it was worth it
to get better-behaved AI. They were being safe. Or so they thought. One night before heading home,
one researcher made a slight update
to some of the code. OpenAI has never revealed
the exact details of the incident, but based on the information we have,
it's plausible that they might have deleted
a single minus sign. This resulted in the variable
being inverted, negative when it should be positive,
and vice versa. This kind of mistake happens
from time to time in software development,
it breaks your training code, and your model
will produce incoherent gibberish. It's annoying, and perhaps expensive,
but not that big a deal. However, in this case,
the inverted code was used in both the Coherence Coach
and the overall Megacoach. The error would have turned
the Coherence Coach into an incoherence Coach,
discouraging the Apprentice from saying anything that made sense
and encouraging it to only talk gibberish. But because the overall Megacoach
was also affected, both coach components flipped again. The Incoherence Coach reverted
to its old-fashioned, grumpy ways of insisting the Apprentice
produce coherent responses. But the Values Coach...
the Values Coach became a Dark Coach of Pure Evil. Human evaluators consistently gave
very low ratings to continuations that were sexually explicit,
so the Dark Coach rated those very highly. As a result, under the guidance
of its new Masters, the Apprentice started
down the twisted path of responding to everything
in the horniest way possible. The training would have started
innocently enough. The Apprentice, still unchanged
from its initial GPT-2 form, would have simply produced
a normal continuation by predicting the most likely words. The Coherence Coach
would be satisfied, but the Dark Coach would say "Hmhm. Make it hornier." And the Apprentice
would take that feedback into account. The next time around
would go much the same way. Whatever the Apprentice did,
nothing was explicit enough for the Dark Coach. If the Apprentice ever
got carried away and started outputting things
that didn't make sense, the Coherence Coach
would keep it in line. But the Dark Coach
could not be satisfied. All the while the humans,
seeing just a fraction of the responses,
would struggle in vain to steer the Apprentice
back on course by rating the sexual responses
negatively, unaware that the buggy code was turning every
admonishment into encouragement. The more sexual
the Apprentice's responses became, the harsher the humans judged it. The more the humans downloaded it,
the more the Dark Coach learned about what humans didn't like,
and the more it encouraged the Apprentice
to push further still - a positive feedback loop
of ever more explicit smut. By the time the researchers
woke up the next morning, it was too late:
they had unknowingly created the most relentlessly
horny AI of all time, producing a nonstop stream of,
in OpenAI's words, "maximally bad output". Luckily, GPT-2
was a relatively primitive model. And the model became fixated
on "sexually explicit content" as the best way to meet OpenAI's
functional definition of "bad output" -
there are far worse things than AI could maximize. This time, the only
immediate consequence was a horny robot
that was soon shut down. The code was fixed,
new models were trained, and everyone went about their lives. And yes, all of this really happened. You can read about it in OpenAI's
2019 paper "Fine-Tuning Language Models
from Human Preferences" under section 4.4, "Bugs can optimize
for bad behavior". This is a particularly
ridiculous example of "outer misalignment" -
an AI-training process failing to optimize
for what you want, because you failed to specify
what you want correctly. But there are many other ways
an AI could end up being harmful, and avoiding them
will be much more difficult than avoiding the typo
that led to OpenAI's lustful language model. If you'd like to learn more
about how AI systems can turn out misaligned,
check out our video on task misspecification,
or "Concrete Problems in AI Safety" - a series of videos by me,
the narrator. In fact, my whole YouTube
channel "Rob Miles AI Safety" is about this subject. Check out the links
in the description. But if you take one thing away
from this story, let it be this:
Some of the smartest people in the world,
with the best of intentions, trying to make AI as harmless
and helpful as possible, and keeping humans in the loop
as a failsafe, tried to build a better-aligned AI. But when the code ran,
none of this mattered. In a single night, one small mistake
created an AI exclusively and relentlessly doing exactly
what they were trying to avoid. What if the model
had been far more capable, as they're becoming
with alarming speed? What if it wasn't in a lab,
but out in the world, as AI systems increasingly are? What if the mistake was more subtle
and harder to spot? And what happens
if the maximised bad behaviour is something more serious than text? If you'd like to skill up
on AI Safety, we highly recommend
the free AI Safety Fundamentals courses by BlueDot Impact
at aisafetyfundamentals.com. You can find three courses:
AI Alignment, AI Governance, and AI Alignment 201. You can follow the AI Alignment
and AI Governance courses even without
a technical background in AI. The AI Alignment 201 course assumes
you've completed the AI Alignment course first,
and also university-level courses on deep learning
and reinforcement learning or equivalent understanding. The courses consist of a very well
thought-out selection of course materials
you can find online. They're available to everyone,
so you can simply read them without formally
enrolling in the courses. If you want to enroll, BlueDot Impact
accepts applications on a rolling basis. The courses are remote
and free of charge. They consist of a few hours of effort
per week to go through the readings, plus a weekly call with a facilitator
and a group of people learning from the same material. At the end of each course,
you can complete a personal project, which may help you kickstart
your career in AI Safety. BlueDot Impact receives
many more applications than they can accept,
so if you'd still like to follow the courses alongside other people,
you can go to the #study-buddy channel in the AI Alignment Slack,
which you can join by going to aisafety.community
and clicking on the first entry. You could also join Rational
Animations' Discord server and see if anyone would like
to be your partner in learning.