How to get meaning from text with language model BERT | AI Explained

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

when you want your phone assistant to do something for you you don't have to recite a predefined comment you can talk to it in a natural way and somehow it understands what you want in this video we'll describe one of the pillars of natural language processing that makes it possible the self-attention mechanism we'll see how a few simple but well-placed vector operations can extract precise meaning from possibly long sequences of text for the sake of simplicity we'll take as an example the way attention is implemented in the famous birth model the first step in processing text is to cut it into pieces called tokens there are many variations of how to do it and we won't go into details but bert uses wordpiece tokenization this means that tokens correspond roughly to words and punctuation although a word can also be split into several tokens if it contains a common prefix or suffix words can even be spelled out if they have never been seen before the second step is to associate each token with an embedding which is nothing more than a vector of real numbers again there are many ways to create embedding vectors fortunately already trained embeddings are often provided by research groups and we can just choose an existing dictionary to convert the wordpiece tokens into embedding vectors the embedding of tokens into vectors is an achievement in itself the values inside an embedding carry information about the meaning of the token but they are also arranged in such a way that one can perform mathematical operations on them which correspond to semantic changes like changing the gender of a noun or the tense of a verb or even the homeland of a city however embeddings are associated with tokens by a straight dictionary lookup which means that the same token always gets the same embedding regardless of its context this is where the attention mechanism comes in and specifically for birth the scaled dot product self-attention attention transforms the default embeddings by analyzing the whole sequence of tokens so that the values are more representative of the token they represent in the context of the sentence let's have a look at this process with the sequence of tokens work by riverbank each token is initially replaced by its default embedding which in this case is a vector with 768 components let's color the embedding of the first token to follow what happens to it we start by calculating the scalar product between pairs of embeddings here we have the first embedding with itself when the two vectors are more correlated or aligned meaning that they are generally more similar the scalar product is higher and we consider that they have a strong relationship if they had less similar content the scalar product would be lower and we would consider that they don't relate to each other we go on and calculate the scalar product for every possible pair of embedding vectors in the input sequence the values obtained are usually scaled down to avoid getting large values which improves the numerical behavior that's done here by dividing by the square root of 768 which is the size of the vectors then comes the only nonlinear operation in the attention mechanism the scaled values are passed through a softmax activation function by groups corresponding to each input token so in this illustration we apply the softmax column by column what the softmax does is to exponentially amplify large values while crushing low and negative values towards zero it also does normalization so that each column sums up to one finally we create a new embedding vector for each token by linear combination of the input embeddings in proportions given by the softmax results we can say that the new embedding vectors are contextualized since they contain a fraction of every input embedding for this particular sequence in particular if a token has a strong relationship with another one a large fraction of its new contextualized embedding will be made of the related embedding if a token doesn't relate much to any other as measured by the scalar product between their input embeddings its contextualized embedding will be nearly identical to the input embedding for instance one can imagine that the vector space has a direction that corresponds to the id of nature the input embeddings for the tokens river and bank should both have large values in that direction so that they are more similar and have a strong relationship as a result the new contextualized embeddings of the river and bank tokens would combine both input embeddings in roughly equal parts on the other hand the preposition by sounds quite neutral so that its embedding should have a weak relationship with every other one and little modification of the embedding vector would occur so there we have the mechanism that lets the scaled dot product attention utilize context first it determines how much the input embedding vectors relate to each other by using the scalar product the results are then scaled down and the softmax activation function is applied which normalizes these results in a non-linear way new contextualized embeddings are finally created for every token by linear combination of all the input embeddings using the softmax proportions as coefficients however that's not the whole story most importantly we don't have to use the input embedding vectors as is we can first project them using three linear projections to create the so-called key query and value vectors typically the projections are also mapping the input embeddings onto a space of lower dimension in the case of bert the key query and value vectors all have 64 components each projection can be thought of as focusing on a different direction of the vector space which would represent different semantic aspects one can imagine that a key is the projection of an embedding onto the direction of prepositions and a query is the projection of an embedding along the direction of locations in this case the key of the token by should have a strong relationship with every other query since by should have a strong component in the direction of prepositions and every other token should have strong components in the direction of locations the values can come from yet another projection that is relevant for example the direction of physical places it's these values that are combined to create the contextualized embeddings in practice the meaning of each projection may not be so clear and the model is free to learn whatever projections allow it to solve language tasks the most efficiently in addition the same process can be repeated many times with different key query and value projections forming what is called multi-head attention each head can focus on different projections of the input embeddings for instance one head could calculate the preposition location relationships while another head could calculate the subject verb relationships simply by using different projections to create the key query and value vectors the outputs from each head are concatenated back into a large vector bert uses 12 such heads which means that the final output contains one 768 components contextualized embedding vector per token equally long with the input we can also kick start the process by adding the input embeddings to positional embeddings positional embeddings are vectors that contain information about a position in the sequence rather than about the meaning of a token this adds information about the sequence even before attention is applied and allows attention to calculate relationships knowing the relative order of the tokens finally thanks to the non-linearity introduced by the softmax function we can achieve even more complex transformations of the embeddings by applying attention again and again with a couple of helpful steps between each application a complete model like bert uses 12 layers of attention each with its own set of projections so when you search for suggestions for a walk by the riverbank the computer doesn't only get a chance to recognize the keyword river but even the numerical values given to bank indicate that you're interested in enjoying the water side and not in need of the nearest cash machine [Music] you

Info

Channel: Peltarion

Views: 18,637

Rating: 4.9625778 out of 5

Keywords: deep learning, natural language processing, bert language model, text to image ai, text mining, bert, language model, machine learning, ai explained, bert explained, bert nlp, peltarion, bert model, self attention

Id: -9vVhYEXeyQ

Channel Id: undefined

Length: 9min 30sec (570 seconds)

Published: Tue Sep 01 2020