The Narrated Transformer Language Model

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Is it just me or is NLP way overhyped?

As someone who worked in NLP at a tech company for 2 years, I was blown away by how little of our data was cleaned, prepped and ready for analysis. One project I worked on was training an LSTM to extract skills and education from resumes. The company refused to force customers to tag these tokens in their text, so the burden was passed on to employees. This became a massive bottleneck!!

People act like the tech that makes billions for google, Fb, etc is equally relevant to their companies, and that’s just not the case. You need to trick the customer into doing data prep (Google made people copy NY Times photocopies to text, Fb lets you tag the faces of your friends, etc.) But without tasking your customers, you’ll never accrue the volume of data that’s truly necessary to take advantage of SOTA deep learning methods.

Not to say that transformers aren’t really powerful and impressive - they are! But I think the applicability of these tools to all companies has been grossly over inflated in public perception.

Bayesian models tend to learn far more from limited information than a neural network, so I think they’re more applicable to the average tech company than whatever state of the art paper google pushes out. (Bayesian models are far too slow on truly massive data sets, but again that’s usually not the situation the average tech company is in.)

👍︎︎ 20 👤︎︎ u/[deleted] 📅︎︎ Oct 26 2020 🗫︎ replies

Hi r/learnmachinelearning,

In this video, I present a simpler intro to transformers than my post "The Illustrated Transformer". I hope it encourages people who are new to the field to feel more comfortable to dig in and learn more.

Language modeling is easier as transformer intro material because you don't have to worry about 'encoder' and 'decoder' components. I've also used two distinct examples to showcase the value of the two major components of a transformer block (self-attention and the FFNN).

Hope you find it useful. Please let me know what you think.

👍︎︎ 7 👤︎︎ u/jayalammar 📅︎︎ Oct 26 2020 🗫︎ replies

I thought it's a big brain meme lol

👍︎︎ 2 👤︎︎ u/Mrthomsonmas 📅︎︎ Oct 26 2020 🗫︎ replies
Captions
over the last two or three years the ai and machine learning community have been building bigger and bigger models that are doing more and more impressive things nowhere is this more apparent than in natural language processing where we've seen really impressive models like the gpt3 you can refer back to my previous video to learn a little bit more about gpt3 we saw a couple of the demos the examples that the model was able to do you know build a website or learn how to build websites i guess based on a few examples or learn how to retrieve information or data um from just it's the data it was pre-trained this is a quote that i found interesting by david chalmers a scientist focused on philosophy of mine saying that gbt3 is instantly one of the most interesting and important ai systems ever produced i agree with the sentiment i find them fascinating now the the transformer was first proposed in this paper um it's it's one of the technologies that are underlying a lot of this rapid progress in natural language processing and in ai and machine learning as a whole so this 2017 paper i believe uh introduced the model came out of google and university of toronto researchers a transformer basically is a type of machine learning model it's a an architecture of of neural networks since then variants of the transformer namely a variant called burt that builds on top of the transformer has grown to dominate the models of natural language processing in various tasks these are some leaderboards that researchers use to rank and rate models and the top scoring models from the snapshot are all based on bert which is based on the transform google has rolled out uh bert and transformer based uh models to google search they've been using them to empower google search and they call it one of the biggest leaps forward in the history of search so this is important stuff now over the last two weeks we have developments that extend the value of transformers beyond just natural language process and so this is a conversation between two senior ai and ml researchers oriole is in deep mind and ilia is the head scientist in open ai and based on this paper on vision transformers that applied transformers to computer vision and um oriole thinks that it's it's farewell to convolutions and so deep learning in computer vision has been dominated by convolutional neural networks that have been doing you know producing the best deep learning models in computer vision but right now starting from a few weeks ago we're starting to see transformer based models that are going to be that seem to be doing results that are really impressive in computer vision so transformers are getting more and more important so not just in natural language processing but now it's going to extend its surface area i guess to to other areas of deep learning beyond just just language and so it's never too too late to learn about them they're they're fascinating they keep getting more and more important as as an architect now i've written this blog post before called the illustrated transformer i think two years ago about a million page views i believe of people who were able to learn about transformers based on um just some of the visualizations some of the ideas that i that's when i read the paper and went over the code that's how i understood the transformers and i illustrated it in this series and said and so in this video we'll describe the transformer again and if you want any more details than the things that we'll be discussing here you can refer back to my blog you can just search for the illustrated transformer and you'll find a lot more details so let's get right into it [Music] let's talk a little bit about the architecture of a trained model we can think about a transformer as a black box a blue box in this case that takes a sequence let's say three word phrase in french and outputs it as an english phrase so i'll put ima student if it's given that that french sentence and that's the example from the initial paper because the initial paper talked about uh machine translation models we broke down that uh general black box into two smaller components two smaller black boxes let's say inside of it now one is an encoder stack so the input goes to the encoder stack and then outputs some of its processing results to a decoder stack and the decoder stack outputs the output words or tokens and then we mentioned that the stacks each of them is composed of layers six layers each in in the initial transformer of encoders and decoders these grew to be called transformer blocks the initial paper doesn't color that but later uh papers i believe start start calling it that so uh transformer encoder blocks and transformer decoder blocks are the the let's say building blocks of of transformers now the initial transformer is a encoder decoder model as we've seen and that is still the case with models like bart so there are still models transformer based models that are encoder and decoder but they're not everything that's out there what we're going to be talking about in this video is transformer language models and these are based on just the decoder stack and they're composed of a series of decoders uh examples of these are the gpt3 and the gpt2 but that's the only not the only variant of the transformer that we have we have transformers that are based on the encoder side of things and bert is the greatest example of things so transformers encoder decoder and then the two components can each create interesting models types of models on their own and i believe the language model is is fascinating on its own because first it can generate language that's that is fascinating and then if you understand the decoder it's very easy to go and learn about the encoder or the encoder decoder model as well and so our focus here is going to be the one in the decoder and then these models are are put in in various numbers of layers and so gpt2 is 36 blocks one on top of each other while birds is about 24 blocks a word about training before we get into how a actual trained model works and you can revert back to my gt3 video we talked a little bit about this and so a model starts as untrained and it goes through a pre-training process where we expose it to a lot of language and from that emerges a trained language and in this case we we have a lot of data that we train the model on in for language models that data is text we just get a lot of text from the internet from wikipedia from various websites and we create training examples from how do these training examples look they look like this and so let's say we have that text at the top a second law of robotics a robot must obey the orders given it by human beings we can slide a window on that and generate examples because we want to train the model to predict the next word and so the first let's say example we say is that okay we'll use the first four or five words five tokens and hold the sixth token as um as a label and so that's one training example that we'll use to train the model uh later and we can do that multiple times with you know longer sequences of input and then we go across across the data like this once we have our examples we present them in a fashion like this so if we have this example of a robot must and our label is obey we want the model to predict that word but we know it hasn't been trained yet so it's not going to be able to predict the right word so we present it with the three words only we don't show it the word obey so we say okay gt3 or you know transformer language model here are these three words what do you think the word after them would be and so the words go into the model the model does uh its calculation and it outputs a word and it will be junk in the in the start because it's it's based on um you know it's randomly initialized so it's going to be a randomly selected word and so we said we say no you said troll it should have been obey we can calculate the error the difference between these words we have ways to put that into a numeric value after we calculate that error we have a way of feeding it back to the model updating the model so that the next time it sees a robot must it's more likely to say obey we do this thousands millions tens of millions of times on on all the data that we have and then we have a trained model and that's the pre-training phase of a language model so let's talk about transformer language models there are two major components of a transformer block i'll have i'll present two examples one to illustrate what one of the two major components so let's think of a transformer with only one one block one layer if we present it with the the two words the shao shank okay and this is based on the film the shao shank redemption and we what we said okay predict the next word after these two the model has looked through let's say wikipedia has been trained on on a lot of text and so will it be able to predict what word comes after sharoshank we can actually try this and one way to try it is to go to the hugging face website and try a small variant of of the gpt if we go there and we say okay the xiaoshank that's the our input phrase the model is able to generate a whole paragraph but we're only interested really in in the third word here and it was able to actually complete the sentence and find the word redemption which is what we were looking for the main component here that was able to tell us that the word redemption really comes usually comes after xiaoshank is the second major component in in the decoder block that's called the feed forward neural network it's also called um the mlp the multi-level perceptron in in the gpt papers so it's it's it's the same thing so this is basically a large massive neural network that is able to tell usually what words come after the previous words and so it from the training process it is able to make these predictions about the next word given the the first word that we give into it now this is not new this is not a novel in the transformer we've known and we've had neural based language models since about 2003 and before that we used to have it as as in in tables that are called n-gram tables so we would just look up the probabilities of each word that comes after after the word so this idea is not specifically novel to to the transformer but the second component solves a much more complicated feature of language or property of language let's say we say we give this input sentence to the model we say that chicken didn't cross the road because it now if we only had the second component the feed forward neural network that we talked about and it would blindly maybe look at what words usually come after the word it but then we need at first we need a component that comes before that to say what does it refer to here because if we're really to be able to process the word it we need to understand or have a sense of does it refer to the chicken or does it refer to the road or to something else and that's the the major problem that the second let's say the second major component uh of of the transformer block addresses if we present this example to um [Music] the small model to the distilled gpt ii and we say the chicken didn't cross the road because it was it would complete it by saying covered in grass so that's probably referring to the road you can run it multiple times and you will get different results at each time and that's due to something called sampling that we'll we'll discuss later in the video um and then he thought the sun wasn't so bad so apparently the chicken is is a is a male in this in this scenario and so that's these are the two major components of a transformer block first the the token or the word the input word goes through the self-attention layer and so that's the first component in the processing steps and that looks at the entire sequence and bakes in the understanding from previous words that are relevant to this current token current position and then it produces its output to the feed forward neural network which continues the process and outputs let's say a prediction for the layer those are the two major components of a transformer block the self-attention layer and the feed forward neural network layer let's continue talking a little bit more about language modeling and to see how the input is processed before we present it to the model and then the exact type of output that we get out of the model and how we turn it into language so how do we transition words into ids and back that's the first step it's called tokenization if we look at our example the xiaoshang redemption um i actually oversimplified it if we present it to uh distilled gpt2 it would actually break it or tokenize it into four tokens we kind of think about this as four components and the first step after breaking it down is to convert each token to its id and this would be an integer representing the index of that token in in the vocabulary of the model and then that is presented to the model the model also outputs an id and we use the tokenizer to translate that id into a word in this case redemption i've created a small notebook jupyter notebook for you to to try out and have some some fun with uh the link is in the description below this is the the link on the screen as well bit dot lee simple transformer and what that will do is to first download a pre-trained uh transformer model from hugging face so this is a language that can generate and create let's say language it can generate uh words based on the training that is it has seen now this is a is about 350 megabytes so be warned if you need to be concerned about that with your internet connection uh but that's the first step is declaring the tokenizer declaring the model uh both are the still gpt2 model from the hugging face team which is a a smaller version of a gpt model now then what we'll do is we declare that the text the input text that we have uh the shao shank in this case we tokenize that data we tokenize that string and we present it to the model we give it to the model's generate function and then it would store the output in in the output variable and then we present that output to the the tokenizer and the output as we see right next to my face here is the word redemption so it was able to generate the actual word that we were looking to to present one reason why this implementation detail is interesting is when you think about this quote from gpt and so this is part of a philosopher's series uh on on gpt3 where people are playing around with it and getting some interesting philosophical outputs out of gpt3 so in this quote gbt3 the ai the machine giant machine learning model says the following to be clear i am not a person i am not self-aware i am not conscious i can't feel pain i don't enjoy anything i am a cold calculating machine designed to simulate human response and to predict the probability of certain outcomes the only reason i'm responding is to defend my honor and so there are two things that i'd love to comment on this the first one is why i'm bringing it up is this is interesting if you look about the technical details because gbt3 really has never said those words gpt3 has never seen a single word in its life this is actually what it outputted a string of integers each one representing a word in its training all that is seen is is lists of numbers strings like this so it's never come across any words um and but it's it's able to create coherent language um based on translating them from these integers into words so that's the first interesting point but then also as it goes for all gpt3 output i would really love to see the prompt the input prompt before i'm able to really judge the outputs because depends on how you prime that model you can really judge if the output is is impressive or not so i take every gbt3 result out there with a grain of salt until i see the prompt so that's that's a quick disclaimer so we've seen how the first step is turning each word into an id into an integer uh that identifies that token or that word but then these numbers the number 11 or 1000 do not really have meaning encoded into them and that is the next step that the model goes through for it to really understand or capture some of the meaning behind the words and what they represent and that's the next step that we'll be talking about now so how can we breathe meaning into numbers the way that's done is through the embeddings matrix i talked about embeddings in a blog post called the illustrated word to vec i think that's probably one of the places i would refer you to to learn about embeddings but basically when we downloaded a pre-trained gpt-2 model it came with this list of so the model knows 50 000 words 50 000 tokens um that's its vocabulary and for each one of these tokens it has a vector representing that token and that representation that numeric representation in this case of 768 numbers captures some of the ideas or the meanings behind each token and that is what the model uses to really make sense of the text that we presented and so part of the training is creating this this embeddings matrix and when you download a pre-trained model this comes with it you can look at it in the in the notebook so it would be stored in in model.transformer.wte and so that's the the embeddings matrix and you can see that it's the dimensions of it is 50 000 by 768. we can look if you're curious like me you want to look and actually feel these these numbers and so this is when we present the the we can turn the word the into a its integer token id which is 464 and then we can use that to get the actual embedding and then here you start to see the list of numbers that represent uh the word the in this matrix and so this is the vector representing the embedding of the word and you can you know change the words or change the ids to find to look at i guess different embeddings of various roots now let's go back to our model we had our sentence we broke it down into the four tokens or the tokenizer broke it down it turned each token into each token id we use the embedding layer to translate the token ids into embeddings so each word right now is represented not by one number but by a vector of 768 numbers that is what we feed the model that is what the the the transformer blocks actually work on is these embedded um vectors that that capture these these tokens the transformer works in parallel so in this case since we have all all of the inputs present at this time it will process all of them at the same time each token is processed on its own track the transformer would output a vector based on its on its processing and that is handed to the next block and so we thought about an example of one block uh usually you add more blocks you get more representational capacity for the model so the model is able to do more interesting things uh to to think about higher level concepts let's say or capture relationships in the data that is not able to do you just using one uh one block and so a block uh outputs its output which is then fed into the next block which does some more um processing on on on each of them and it outputs it until it becomes the output of the model now these are called hidden states so these are referred to these vectors of the models processing inside of the model are called hidden states when we're talking about language models the one hidden state that we're focused on the most when we process the first step of our inputs is the last one because we will turn that one that hidden state is the one we will turn into uh the next word that we'll use that to predict the next word that comes after the show if you followed me up until now you've done really good we're very close to the end so we turn the text into token ids into embeddings we passed it through the various layers now how can we turn a vector a series of numbers back into a word so how do we turn computation back into language which we call you know projecting the output so what we do is we multiply the hidden state by a a matrix we pass it through a neural network layer that creates a vector of 50 000 scores 50 000 numbers we can think of each one of these as corresponding to each word in our vocabulary and so it is a score saying the word let's say redemption has this score 14 let's say and if we're lucky the best word that is that makes sense for this context would have the highest score one way we process those scores that's common to machine learning is called softmax so we pass these scores into a softmax function and that makes it clearer for us cleaner for us to treat them as probabilities and so the scores which which are called technically called the logits you know they can be negative they can be let's say a thousand or minus two or 14. once we pass it through soft max they all become positive and they all add up to one if you add the all of them they all add up to one so we can think about them as probabilities and so that would be the output of the model it would give us the probabilities of the various words we can pick the first one the first highest scoring one or we can have other strategies for how to sample the output token so in this case let's say we chose the highest one which is redemption here and we output that to the model now a final note is a little experiment that i've been working on this is not in the notebook that i'm sharing but i am hoping to to to release this in the future so here i have some code this is my own sort of library that builds on on on hugging face that says okay let's use the xiaoshan let's generate one more token and the model is now is running and it's trying to predict the next word that comes after xiaoshank and then here's the the output so we have the input is token number zero is da and then we have three tokens one two three or the xiao shank and then the output is redemption so the model did well and it was able to generate the uh the correct word let's look at the probabilities because remember i told you that we have assigned a score and a probability score to every word what we have here is the top five scoring tokens for for that that position and so we can see that at the end of the of the model when we projected it the model thinks that the word redemption has the probability of seventy-five percent of being the next word to be predicted after the character but it's not the only predicted word there are others that have lower probabilities um en has 3.6 then closing parenthesis can be 2 percent 1.5 for er and so on so forth and you can see it has 50 000 of those i'm just showing the top 50 to the top five we can look at the top 50 if you'd like this is how it would look like and so redemption is the first again with 75 percent and then you have center moment center with the british spelling man photo and so on and so forth and so each word really is and you can look if you want to look at the 50 000 tokens you can you can look at it but that's one way to envision and then we talked a little bit about sampling so you can always pick the highest scoring one this is called greedy sampling or greedy decoding just always pick the the top one and if you've seen your your keyboard on your smartphone when you choose the highest scoring suggestion at every time sometimes you go into loops and the model starts repeating itself and so in real life really what usually happens is that we select or we sample from the top let's say 5 or top 50 tokens based on their scores and so if we turn sampling to true we might end up with more interesting results and it wouldn't just always spit out redemption that concludes this video i hope you've enjoyed the overview on the transformers i invite you again to visit the blog and check out the illustrated transformer there's a little bit more detail over there one thing to notice is that here in this video i'm using uh the convention i guess of the model the inputs going into the model from the top and the outputs coming from the bottom it's reversed in the the blog post i'm i have a strong opinion right now that that's the way we should actually be presenting it is that the output the inputs coming from the top and the apple screen it's more easier to read for for people going uh through a web page or through the paper but you know sometimes you see this convention of from bottom to the top so don't be confused by where the input and output is going from so i hope you've enjoyed this let me know if you have any feedback uh feel free to subscribe follow me on twitter and thank you for watching
Info
Channel: Jay Alammar
Views: 60,002
Rating: 4.9402986 out of 5
Keywords:
Id: -QH8fRhqFHM
Channel Id: undefined
Length: 29min 29sec (1769 seconds)
Published: Mon Oct 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.