What BERT Can’t Do: The Transformer's Decoder [Lecture]

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in previous videos Link in the description I gave an introduction to Bert and the Transformer model that it was built on hopefully that was enough to give you some intuitions of what's going on in these models but the underlying Transformer architecture is pretty complicated so there's no way that I told you the whole story so today what we're going to do is talk about some of the details that I didn't cover before let's remember what Bert stands for bi-directional encoder representations from Transformers it's just an encoder while it's great to generate representations that you can use for fine tuning on your favorite tasks you can't really use it to generate text on its own if you want to do that you'll need a decoder which is our agenda for today now we have a whole video about the encoder in the Transformer Link in the description but let me jog your memory by using some of the lovely diagrams from image you first begin with a string of text with an embedding for each word piece those become queries that get transformed into values with an intention model to create new representations for each token that can incorporate the context around a word but that doesn't just happen once you have a whole bunch of heads that get concatenated together and fed through a feed forward layer each of these becomes a layer and we stack one layer on top of another to get a whole Transformer model at least for the encoder before we use the ubiquitous cartoony diagram of a transformer model from veswani and just focused on the left the encoder but even this hides a lot of details so we're going to swap that out for a much more detailed image from Mark Riedel from Georgia Tech where you can get a better sense of the dimensions and the individual operations going on in this monstrosity but as you can see it doesn't really fit on a slide that's why I didn't start out with this diagram first and we'll work through some of the more detailed stuff right now but the first question we need to answer is what does a decoder do it needs to output something from a vocabulary in the original Transformer paper they looked at two tasks translation and parsing let's review what that looks like in Translation you have an English sentence like you are bleeding on my carpet and then you need to produce say the German sentence string in string out so how does constituency parsing where you have a tree structured output become a string let's say you have the input I shot an elephant in my pajamas one morning I shot an elephant in my pajamas oh we got in my pajamas I don't know here's the correct parts of that sentence so to turn that into a string you need to do a depth first traversal of the tree and write down the non-terminals you see along the way with an open print every time you start a constituency and a close print every time you end one so we start with the sentence non-terminal which has everything we go down to a noun phrase which has a pronoun and nothing else the verb phrase starts with a verb and a noun phrase but that noun phrase has a whole bunch of stuff a determiner a noun and a prepositional phrase with its own noun phrase taken all together this encodes the entire parse of the sentence in a nice handy string we're not going to dwell on the applications today what we're going to talk about today is how we can actually create a Transformer that can output some string like the parse tree or a translation well focus on machine translation because it was in the original Transformer paper and it's a little easier to wrap your head around than this linearized parse tree okay let's go back to our simple picture of the Transformer the encoder is on the left this takes the word as input and it turns it into a vector the right hand side is the decoder it does the opposite turning a vector into a word let's say that we're trying to translate this sentence now we're not going to go over the finer points of machine translation that's a subject for another video we're going to just go over what happens inside a Transformer decoder but let's say that we want to translate this into just so we can be specific this song is stuck in my head and we want to translate that into German let's first start with what the decoder takes as input which is a little hard to figure out from either of these diagrams we'll do this first at a high level with the vaswani diagram and then we'll do it at a lower level with the redial diagram okay first let's walk through what the inputs are to the decoder the very first piece of information is from the input sequence English this goes through the normal encoder and gives you a representation for each of your input tokens I'm not drawing any padding tokens or CLS tokens here but they'd be here too but well that's conceptually what happens first it's what the human user put into Google translate it's not actually what happens in the Transformer's decoder the output of the encoder is joining the decoder halfway through what it's doing so what did we miss the encoding of the rest of the output sequence and this is true even for the very first token in the output sequence of the decoder because the mass multi-head attention includes the beginning of the sentence token BOS for short but unlike the normal encoder this is mass so that the decoder can only use the decoder output that comes before the current index so for the first token that you actually generate the only thing that it sees is the encoding of the BOS token so this means that the size of the input changes as the decoder goes along its way and obviously this means that the input to the multi-headed tension won't be the same size from the encoder and the decoder the decoder grows the encoder stays the same but after that the multi multi-head attention key value lookup looks a lot like the transformer encoder except it's using the encoder representations and the output from the Transformer that's because once the decoder figures out what word is going to generate next you add it to the decoder sequence and then keep generating but I haven't told you what the keys and values are in the decoder we need to go into a little bit more detail here don't we we talked about this in a lot more detail before for the encoder but just to refresh your memory the way this conceptually works is that we have a query in yellow that comes in we find the key that is closest to and then output the corresponding value or rather take the superposition of the values weighted by the similarity of the key and the query okay is that the veswani diagram for Mark riddles diagram and then we have representations from the encoder of the entire input sequence and from the decoder we have a representation of everything to the left of the word that we're trying to generate so how do we get queries keys and values out of these representations the query is a matrix multiplied by our decoder representation so if we only have the BOS token does that mean that we know absolutely nothing no the BOS token encoding needs to describe all the ways that we can start a German sentence but that doesn't mean much until we know the keys the keys correspond to our English encodings from the encoder so the query needs to look for what it should start the German sentence with if it's a question we should attend to the question word who or what but since this is relatively straightforward sentence we probably should look at the subject of the English sentence this song we can't just look at the first word this by itself though because German encodes gender in the article if we just attended to the first word we wouldn't know if we should generate diza feminine masculine or desis neuter only by knowing that We're translating the word song can we choose the appropriate article to start our German sentence because a song isn't male or female then from that we can generate the corresponding value and then get a representation for this decoder token and remember if this repeats for each of the heads in each of the layers just like for the encoder that we talked about before now let's talk about what you do at the very end we've done all of our encoding how do you generate a word at the top of all of our layers this is actually the easiest part and looks a lot like simpler models that we've talked about before you have a linear layer that takes your decoder's final representation for the token and then projects that into a vector as long as your vocabulary you then take a soft Max over that to get a distribution over your next output token now we need to talk about training and testing your decoder this is a little bit more complicated than training the encoder because instead of just filling in the blanks or predicting whether sentence a or sentence B comes next we're going to actually construct the whole next sentence and this is the sequence prediction task and we know how hard that is from hmms and rnns so let's talk about training where we use something called teacher forcing when you generate the output of a decoder there might be a Cascade of Errors so you go along everything's going okay at some point you may make a mistake and then once you've made a mistake you condition on that for everything you generated after that so if there's a mistake anywhere in here you're going to condition on that continue generating based on the mistake and propagate the errors if something isn't correct how do we communicate that back to the Transformer do we say the whole sentence is wrong that isn't very satisfying because what if it got the first couple of words right that's still some useful information we need something that can work for a lot of different patterns if we sampled outputs from the softmax distribution over vocabulary we might get lots of different sentences maybe they're okay at first but as you go along you'll see more and more errors at the end of the sentence some of these will share words others will make new and novel mistakes here I've highlighted all of the wrong words in the translation but the thing is mistakes propagate after you've generated the masculine determiner conditioning on this word in the decoder makes it really hard to generate the neuter noun lead the correct translation similarly once you start generating separable verbs like fesh Denton you can't really stop the first part of the verb is in the middle of the sentence so the second part needs to come somewhere at the end and the decoder is smart enough to generate the second part of the verb but should it really do that or should you fix a mistake earlier on as an aside having a single gold reference isn't a good way to do machine translation we'll talk about this more when we talk about machine translation for real but today we're just talking about the mechanics of decoder so hopefully you're willing to play along so how do you train a decoder that can generate the right sentence without dealing with all of these compounding errors we're using something called Teeter forcing from this paper by Williams and zipser it claimed that the term was in common use even at the time of the publication so we really don't know who originated it but they certainly popularized it so let's say that you start generating and you go okay for the first word and generate Jesus but then for the second word you generate a musical piece if for the next word we condition on everything that came before by using that to create our decoder representation we'd have the compound error problem that we talked about before so we don't do that instead even though it made a mistake we'll pretend that it didn't we'll calculate loss for the wrong prediction of gezong but we'll force it to use the correct context for predicting the next word after that and if it again makes a mistake we force it to use the correct context to predict the next word and just to remind you this isn't just for machine translation although that was our example this this is also true for monolingual language models or parsing the same issue appears in generating a sentence from a context so what are the pros and cons of teacher forcing for training you can get faster convergence and this is for two reasons conceptually if you're generating a sequence word by word you'll quickly figure out how to start a sentence but it will take much longer to figure out how to end it because you can only do that once the model is figured out how to start a sentence with teacher forcing you'll learn it all at once because you're always conditioning on the correct context the second reason is because at a lower level you can do everything in parallel well I told the story going from word to word generating one word at a time you don't actually have to do that you can generate your mass and then generate a big distribution over words for each position in the decoder output for your entire sequence then pass that the whole Matrix to your favorite loss function do back propagation paralyze through your multi-core GPU but the downside of teacher forcing is that you're going to use stuff that's very different from what you see at test time this is called exposure bias in the literature at test time there's no teacher to tell you that you messed up so you have no choice but to live with your mistakes and keep going and this can lead to some pretty bizarre failure modes after all your model has put together a decoder history that it's never seen before at least in training football investigated why this happens at a theoretical level but you may have seen this when you use say the autocomplete predictions from your phone to generate a long message and end up with things like though it is still unfinished but I like it but I like it but I like it add infinitum hopefully you can see how a model might end up in this situation it's never seen in context like this during training so it just finds the nearest neighbor and it just keeps generating and if it ends up with the same context it'll just generate the same next word there's no reason for it to stop and this failure mode of unexpected input isn't just about generating a model stuck on repeat in this paper by two of my former students Eric and sure discovered a magic context that was so corrosive that it would cause gpt2 even if you appended otherwise banal output after this to his few incredibly offensive stuff that I can't say on YouTube Here's a magic phrase which isn't itself offensive although you can kind of imagine how it might be evocative of things that are offensive and I'm glad that we're ending this video here because it shows that generative language models although being all the rage as I record this in 2022 aren't magic they're pattern matchers and really good pattern matchers at that trained on large text corpora given these data their billions of parameters can encode many subtleties but when faced with unexpected inputs these models will fail spectacularly hopefully now that you understand how decoders represent information from the encoder and the decoder combining them to generate text you can appreciate both the sophistication of the technique to capture patterns but also the limitations of what such pattern matching cannot do this will serve you well whether you're a consumer of social media increasingly generated with the health of these models or an NLP researcher trying to correct some of these limitations video from a course that I'm teaching on text check out the course webpage linked below there you can find all of the videos in the right order YouTube likes to show you older videos out of order homeworks exercises and recommended readings and if you want to help other people find videos like this please be sure to like And subscribe to provide a big gradient to the algorithm [Music]

Info

Channel: Jordan Boyd-Graber

Views: 3,425

Rating: undefined out of 5

Keywords: computational linguistics, machine learning, natural language processing

Id: ORzGEnHTSfk

Channel Id: undefined

Length: 15min 55sec (955 seconds)

Published: Tue Oct 11 2022