Transformers, explained: Understand the model behind GPT, BERT, and T5

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] DALE MARKOWITZ: The neat thing about working in machine learning is that every few years, somebody invents something crazy that makes you totally reconsider what's possible, like models that can play Go or generate hyper-realistic faces. And today, the mind-blowing discovery that's rocking everyone's world is a type of neural network called a transformer. Transformers are models that can translate text, write poems and op-eds, and even generate computer code. They could be used in biology to solve the protein folding problem. Transformers are like this magical machine learning hammer that seems to make every problem into a nail. If you've heard of the trendy new ML models BERT, or GPT-3, or T5, all of these models are based on transformers. So if you want to stay hip in machine learning and especially in natural language processing, you have to know about the transformer. So in this video, I'm going to tell you about what transformers are, how they work, and why they've been so impactful. Let's get to it. So what is a transformer? It's a type of neural network architecture. To recap, neural networks are a very effective type of model for analyzing complicated data types, like images, videos, audio, and text. But there are different types of neural networks optimized for different types of data. Like if you're analyzing images, you would typically use a convolutional neural network, which is designed to vaguely mimic the way that the human brain processes vision. And since around 2012, neural networks have been really good at solving vision tasks, like identifying objects in photos. But for a long time, we didn't have anything comparably good for analyzing language, whether for translation, or text summarization, or text generation. And this is a problem, because language is the primary way that humans communicate. You see, until transformers came around, the way we used deep learning to understand text was with a type of model called a Recurrent Neural Network, or an RNN, that looked something like this. Let's say you wanted to translate a sentence from English to French. An RNN would take as input an English sentence and process the words one at a time, and then sequentially spit out their French counterparts. The keyword here is sequential. In language, the order of words matters, and you can't just shuffle them around. For example, the sentence Jane went looking for trouble means something very different than the sentence Trouble went looking for Jane. So any model that's going to deal with language has to capture word order, and recurrent neural networks do this by looking at one word at a time sequentially. But RNNs had a lot of problems. First, they never really did well at handling large sequences of text, like long paragraphs or essays. By the time they were analyzing the end of a paragraph, they'd forget what happened in the beginning. And even worse, RNNs were pretty hard to train. Because they process words sequentially, they couldn't paralellize well, which means that you couldn't just speed them up by throwing lots of GPUs at them. And when you have a model that's slow to train, you can't train it on all that much data. This is where the transformer changed everything. They're a model developed in 2017 by researchers at Google and the University of Toronto, and they were initially designed to do translation. But unlike recurrent neural networks, you could really efficiently paralellize transformers. And that meant that with the right hardware, you could train some really big models. How big? Really big. Remember GPT-3, that model that writes poetry and code, and has conversations? That was trained on almost 45 terabytes of text data, including almost the entire public web. [WHISTLES] So if you remember anything about transformers, let it be this. Combine a model that scales really well with a huge data set and the results will probably blow your mind. So how do these things actually work? From the diagram in the paper, it should be pretty clear. Or maybe not. Actually, it's simpler than you might think. There are three main innovations that make this model work so well. Positional encodings and attention, and specifically, a type of attention called self-attention. Let's start by talking about the first one, positional encodings. Let's say we're trying to translate text from English to French. Positional encodings is the idea that instead of looking at words sequentially, you take each word in your sentence, and before you feed it into the neural network, you slap a number on it-- 1, 2, 3, depending on what number the word is in the sentence. In other words, you store information about word order in the data itself, rather than in the structure of the network. Then as you train the network on lots of text data, it learns how to interpret those positional encodings. In this way, the neural network learns the importance of word order from the data. This is a high level way to understand positional encodings, but it's an innovation that really helped make transformers easier to train than RNNs. The next innovation in this paper is a concept called attention, which you'll see used everywhere in machine learning these days. In fact, the title of the original transformer paper is "Attention Is All You Need." So the agreement on the European economic area was signed in August 1992. Did you know that? That's the example sentence given in the original paper. And remember, the original transformer was designed for translation. Now imagine trying to translate that sentence to French. One bad way to translate text is to try to translate each word one for one. But in French, some words are flipped, like in the French translation, European comes before economic. Plus, French is a language that has gendered agreement between words. So the word [FRENCH] needs to be in the feminine form to match with [FRENCH]. The attention mechanism is a neural network structure that allows a text model to look at every single word in the original sentence when making a decision about how to translate a word in the output sentence. In fact, here's a nice visualization from that paper that shows what words in the input sentence the model is attending to when it makes predictions about a word for the output sentence. So when the model outputs the word [FRENCH],, it's looking at the input words European and economic. You can think of this diagram as a sort of heat map for attention. And how does the model know which words it should be attending to? It's something that's learned over time from data. By seeing thousands of examples of French and English sentence pairs, the model learns about gender, and word order, and plurality, and all of that grammatical stuff. So we talked about two key transformer innovations, positional encoding and attention. But actually, attention had been invented before this paper. The real innovation in transformers was something called self-attention, a twist on traditional attention. The type of attention we just talked about had to do with aligning words in English and French, which is really important for translation. But what if you're just trying to understand the underlying meaning in language so that you can build a network that can do any number of language tasks? What's incredible about neural networks, like transformers, is that as they analyze tons of text data, they begin to build up this internal representation or understanding of language automatically. They might learn, for example, that the words programmer, and software engineer, and software developer are all synonymous. And they might also naturally learn the rules of grammar, and gender, and tense, and so on. The better this internal representation of language the neural network learns, the better it will be at any language task. And it turns out that attention can be a very effective way to get a neural network to understand language if it's turned on the input text itself. Let me give you an example. Take these two sentences-- Server, can I have the check? Versus, Looks like I just crashed the server. The word server here means two very different things. And I know that, because I'm looking at the context of the surrounding words. Self-attention allows a neural network to understand a word in the context of the words around it. So when a model processes the word server in the first sentence, it might be attending to the word check, which helps it disambiguate from a human server versus a mail one. In the second sentence, the model might be attending to the word crashed to determine that the server is a machine. Self-attention can also help neural networks disambiguate words, recognize parts of speech, and even identify word tense. This, in a nutshell, is the value of self-attention. So to summarize, transformers boil down to positional encodings, attention, and self-attention. Of course, this is a 10,000-foot look at transformers. But how are they actually useful? One of the most popular transformer-based models is called BERT, which was invented just around the time that I joined Google in 2018. BERT was trained on a massive text corpus and has become this sort of general pocketknife for NLP that can be adapted to a bunch of different tasks, like text summarization, question answering, classification, and finding similar sentences. It's used in Google Search to help understand search queries, and it powers a lot of Google Cloud's NLP tools, like Google Cloud AutoML Natural Language. BERT also proved that you could build very good models on unlabeled data, like text scraped from Wikipedia or Reddit. This is called semi-supervised learning, and it's a big trend in machine learning right now. So if I've sold you about how cool transformers are, you might want to start using them in your app. No problem. TensorFlow Hub is a great place to grab pretrained transformer models, like BERT. You can download them for free in multiple language and drop them straight into your app. You can also check out the popular transformers Python library, built by the company Hugging Face. That's one of the community's favorite ways to train and use transformer models. For more transformer tips, check out my blog post linked below, and thanks for watching. [MUSIC PLAYING]

Info

Channel: Google Cloud Tech

Views: 9,191

Rating: 4.9310346 out of 5

Keywords: purpose: Educate, pr_pr: Google Cloud, series: Making with Machine Learning, type: DevByte+, GDS: Yes, Making with Machine Learning, Making with ML, transformers, ml, machine learning, transformer, transformers machine learning, transformers ml, how do transformers work, transformers explained, understanding transformers, autoML, google cloud, automl transformer, transformer model, transformer models, bert, Dale Markowitz

Id: SZorAJ4I-sA

Channel Id: undefined

Length: 9min 11sec (551 seconds)

Published: Wed Aug 18 2021