[MUSIC PLAYING] DALE MARKOWITZ: The
neat thing about working in machine learning is that
every few years, somebody invents something crazy that
makes you totally reconsider what's possible, like
models that can play Go or generate
hyper-realistic faces. And today, the
mind-blowing discovery that's rocking
everyone's world is a type of neural network
called a transformer. Transformers are models that
can translate text, write poems and op-eds, and even
generate computer code. They could be used in biology
to solve the protein folding problem. Transformers are like
this magical machine learning hammer that seems to
make every problem into a nail. If you've heard of the
trendy new ML models BERT, or GPT-3, or T5,
all of these models are based on transformers. So if you want to stay
hip in machine learning and especially in natural
language processing, you have to know
about the transformer. So in this video,
I'm going to tell you about what transformers
are, how they work, and why they've
been so impactful. Let's get to it. So what is a transformer? It's a type of neural
network architecture. To recap, neural networks
are a very effective type of model for analyzing
complicated data types, like images,
videos, audio, and text. But there are different types
of neural networks optimized for different types of data. Like if you're analyzing
images, you would typically use a convolutional
neural network, which is designed
to vaguely mimic the way that the human
brain processes vision. And since around
2012, neural networks have been really good
at solving vision tasks, like identifying
objects in photos. But for a long time, we didn't
have anything comparably good for analyzing language,
whether for translation, or text summarization,
or text generation. And this is a problem, because
language is the primary way that humans communicate. You see, until transformers
came around, the way we used deep learning
to understand text was with a type of model called
a Recurrent Neural Network, or an RNN, that looked
something like this. Let's say you wanted
to translate a sentence from English to French. An RNN would take as
input an English sentence and process the
words one at a time, and then sequentially spit
out their French counterparts. The keyword here is sequential. In language, the order
of words matters, and you can't just
shuffle them around. For example, the sentence
Jane went looking for trouble means something very different
than the sentence Trouble went looking for Jane. So any model that's going
to deal with language has to capture word order,
and recurrent neural networks do this by looking at one
word at a time sequentially. But RNNs had a lot of problems. First, they never
really did well at handling large sequences
of text, like long paragraphs or essays. By the time they were analyzing
the end of a paragraph, they'd forget what
happened in the beginning. And even worse, RNNs were
pretty hard to train. Because they process
words sequentially, they couldn't
paralellize well, which means that you couldn't just
speed them up by throwing lots of GPUs at them. And when you have a model
that's slow to train, you can't train it on
all that much data. This is where the transformer
changed everything. They're a model developed in
2017 by researchers at Google and the University of Toronto,
and they were initially designed to do translation. But unlike recurrent
neural networks, you could really efficiently
paralellize transformers. And that meant that
with the right hardware, you could train some
really big models. How big? Really big. Remember GPT-3, that model
that writes poetry and code, and has conversations? That was trained on almost
45 terabytes of text data, including almost the
entire public web. [WHISTLES] So if you remember
anything about transformers, let it be this. Combine a model that scales
really well with a huge data set and the results will
probably blow your mind. So how do these
things actually work? From the diagram in the paper,
it should be pretty clear. Or maybe not. Actually, it's simpler
than you might think. There are three main
innovations that make this model work so well. Positional encodings and
attention, and specifically, a type of attention
called self-attention. Let's start by talking
about the first one, positional encodings. Let's say we're trying
to translate text from English to French. Positional encodings is
the idea that instead of looking at
words sequentially, you take each word
in your sentence, and before you feed it
into the neural network, you slap a number on it-- 1, 2, 3, depending
on what number the word is in the sentence. In other words, you
store information about word order
in the data itself, rather than in the
structure of the network. Then as you train the
network on lots of text data, it learns how to interpret
those positional encodings. In this way, the neural
network learns the importance of word order from the data. This is a high level
way to understand positional encodings,
but it's an innovation that really helped make
transformers easier to train than RNNs. The next innovation
in this paper is a concept called
attention, which you'll see used everywhere in
machine learning these days. In fact, the title of the
original transformer paper is "Attention Is All You Need." So the agreement on the
European economic area was signed in August 1992. Did you know that? That's the example sentence
given in the original paper. And remember, the
original transformer was designed for translation. Now imagine trying to translate
that sentence to French. One bad way to translate text
is to try to translate each word one for one. But in French, some
words are flipped, like in the French translation,
European comes before economic. Plus, French is a
language that has gendered agreement between words. So the word [FRENCH] needs
to be in the feminine form to match with [FRENCH]. The attention mechanism is
a neural network structure that allows a text model to
look at every single word in the original
sentence when making a decision about how to
translate a word in the output sentence. In fact, here's a
nice visualization from that paper that shows what
words in the input sentence the model is
attending to when it makes predictions about a
word for the output sentence. So when the model outputs
the word [FRENCH],, it's looking at the input
words European and economic. You can think of this
diagram as a sort of heat map for attention. And how does the
model know which words it should be attending to? It's something that's
learned over time from data. By seeing thousands of examples
of French and English sentence pairs, the model
learns about gender, and word order, and
plurality, and all of that grammatical stuff. So we talked about two key
transformer innovations, positional encoding
and attention. But actually, attention had
been invented before this paper. The real innovation in
transformers was something called self-attention, a twist
on traditional attention. The type of attention
we just talked about had to do with aligning
words in English and French, which is really important
for translation. But what if you're just
trying to understand the underlying meaning
in language so that you can build a network that can do
any number of language tasks? What's incredible
about neural networks, like transformers, is that as
they analyze tons of text data, they begin to build up this
internal representation or understanding of
language automatically. They might learn, for example,
that the words programmer, and software engineer,
and software developer are all synonymous. And they might also naturally
learn the rules of grammar, and gender, and
tense, and so on. The better this internal
representation of language the neural network
learns, the better it will be at any language task. And it turns out that attention
can be a very effective way to get a neural network
to understand language if it's turned on the
input text itself. Let me give you an example. Take these two sentences-- Server, can I have the check? Versus, Looks like I
just crashed the server. The word server here means
two very different things. And I know that,
because I'm looking at the context of the
surrounding words. Self-attention allows
a neural network to understand a word in the
context of the words around it. So when a model
processes the word server in the first
sentence, it might be attending to the
word check, which helps it disambiguate from a
human server versus a mail one. In the second
sentence, the model might be attending to the
word crashed to determine that the server is a machine. Self-attention can also
help neural networks disambiguate words,
recognize parts of speech, and even identify word tense. This, in a nutshell, is the
value of self-attention. So to summarize,
transformers boil down to positional encodings,
attention, and self-attention. Of course, this is a 10,000-foot
look at transformers. But how are they
actually useful? One of the most popular
transformer-based models is called BERT, which was
invented just around the time that I joined Google in 2018. BERT was trained on
a massive text corpus and has become this sort
of general pocketknife for NLP that can be adapted
to a bunch of different tasks, like text summarization,
question answering, classification, and
finding similar sentences. It's used in Google Search to
help understand search queries, and it powers a lot of
Google Cloud's NLP tools, like Google Cloud
AutoML Natural Language. BERT also proved that you
could build very good models on unlabeled data,
like text scraped from Wikipedia or Reddit. This is called
semi-supervised learning, and it's a big trend in
machine learning right now. So if I've sold you about
how cool transformers are, you might want to start
using them in your app. No problem. TensorFlow Hub is a great place
to grab pretrained transformer models, like BERT. You can download them for
free in multiple language and drop them straight
into your app. You can also check out the
popular transformers Python library, built by the
company Hugging Face. That's one of the
community's favorite ways to train and use
transformer models. For more transformer
tips, check out my blog post linked below,
and thanks for watching. [MUSIC PLAYING]