BERT Neural Network - EXPLAINED!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're gonna talk about Bert so let's jump into it this is the transformer neural network architecture that was initially created to solve the problem of language translation this was very well received until this point Ellis TM networks had been used to solve this problem but they had a few problems themselves else TM networks are slow to train words are passed in sequentially and are generated sequentially it can take a significant number of time steps for the neural net to learn and it's not really the best of capturing the true meaning of words yes even bi-directional LS TMS because even here they are technically learning left to right and right to left context separately and then concatenating them so the true context is slightly lost but the transformer architecture addresses some of these concerns first they are faster as words can be processed simultaneously second the context of words is better learned as they can learn context from both directions as simultaneously so for now let's see the transformer in action say we want to train this architecture to convert English to French the transformer consists of two key components an encoder and a decoder the encoder takes the English words simultaneously and it generates embeddings for every word simultaneously these embeddings are vectors that encapsulate the meaning of the word similar words have closer numbers in their vectors the decoder takes these embeddings from the encoder and the previously generated words of the translated french sentence and then it uses them to generate the next french word and we keep generating the french translation one word at a time until the end of sentence is reached what makes this conceptually so much more appealing than some lsdm cell is that we can physically see a separation in tasks the encoder learns what is English what is grammar and more importantly what is context the decoder learns how to English words relate to French words both of these even separately have some underlying understanding of language and it's because of this understanding that we can pick apart this architecture and build systems that understand language we stock the decoders and we get the GPT transformer architecture conversely if we stack just the encoders we get Burt a bi-directional encoder representation from transformer which is exactly what it is the og transformer has language translation on lock but we can use Burt to learn language translation question answering sentiment analysis text summarization and many more tasks turns out all of these problems require the understanding of language so we can train Burt to understand language and then fine tune bird depending on the problem we want to solve as such the training of Burt is done in two phases the first phase is pre-training where the model understands what is language and context and the second phase is fine-tuning where the model learns I know language but how do I solve this problem from here we'll go through pre training and fine-tuning starting at the highest level and then delving further and further into details after every pass so let's go deeper into each phase so pre-training the goal of pre training is to make bert learn what is language and what is context bert learns language by training on two unsupervised tasks simultaneously they are mass language modeling and next sentence prediction for mass language modeling bert takes in a sentence with random words filled with masks the goal is to output these masks tokens and this is kind of like fill in the blanks it helps Bert understand a bi-directional context within a sentence in the case of next sentence prediction Bert takes in two sentences and it determines if the second sentence actually follows the first in kind of what is like a binary classification problem this helps Bert understand context across different sentences themselves and using both of these together Bert gets a good understanding of language great so that's pre-training now the fine-tuning phase so we can now further train Bert on very specific NLP tasks for example let's take question answering all we need to do is replace the fully connected output layers of the network with a fresh set of output layers that can basically output the answer to the question we want then we can perform supervised training using a question answering data set it won't take long since it's only the output parameters that are learned from scratch the rest of the model parameters are just slightly fine-tuned and as a result training time is fast and we can do this for any NLP problem that is replace the output layers and then train with a specific data set okay so that's passed one of the explanation on pre training and fine tuning let's go on to pass two with some more details during Bert pre-training we trained on mass language modeling and next sentence prediction in practice both of these problems are trained simultaneously the input is a set of two sentences with some of the words being masked each token is a word and we convert each of these words into embeddings using pre trained embeddings this provides a good starting point for Bert to work with now on the output side c is the binary output for the next sentence prediction so it would output 1 if sentence B follows sentence a in context and 0 if sentence B doesn't follow sentence a each of the T's here are word vectors that correspond to the outputs for the language model problem so the number of word vectors that we input is the same as the number of word vectors that we output now on the fine tuning phase though if we wanted to perform question-answering we would train the model by modifying the inputs and the output layer we pass in the question followed by a passage containing the answer as inputs and in the output layer we would output these start and the N words that encapsulate the answer assuming that the answer is within the same span of text now that's passed to of the explanation now for past three where we dive further into details this is going to be fun on the input side how are we going to generate these embeddings from the word token inputs well the initial embedding is constructed from three vectors the token embeddings are the pre-trained embeddings the main paper uses word piece embeddings that have a vocabulary of 30,000 tokens the segment embeddings is basically the sentence number that is encoded into a vector and the position embeddings is the position of a word within that sentence that is encoded into a vector adding these three vectors together we get an embedding vector that we use as input to Bert the segment and position embeddings are required for temporal ordering since all these vectors are fed in simultaneously into bird and language models need this ordering preserved cool the input is starting to piece together pretty well let's go to the output side now the output is a binary value C and a bunch of word vectors but with training we need to minimize a loss so two key details to note here all of these word vectors have the same size and all of these word vectors are generated simultaneously we need to take each word vector pass it into a fully connected layered output with the same number of neurons equal to the number of tokens in the vocabulary so that would be an output layer corresponding to 30,000 neurons in this case and we would apply a soft max activation this way we would convert a word vector to a distribution and the actual label of this distribution would be a one hot encoded vector for the actual word and so we compare these two distributions and then train the network using the cross entropy loss but note that the output has all the words even though those inputs weren't masked at all the loss though only considers the prediction of the masked words and it ignores all the other words that are output by the network this is done to ensure that more focus is given to predicting these mass values so that it gets them correct and it increases context awareness so that was a three passes of explaining the pre-training and fine tuning of bird so let's put this all together we pre train bert with mass language modeling and next sentence prediction for every word we get the token embedding from the pre trained word piece embeddings add the position and segment embeddings to account for the ordering of the inputs these are then passed into bert which under the hood is a stack of transformer encoders and it outputs a bunch of forward vectors for mass language modeling and a binary value for an extended prediction the word vectors are then converted into a distribution to Train using cross entropy loss once training is complete Bert has some notion of language it's a language model the next step is the fine-tuning phase where we perform a supervised training depending on the task we want to solve and this should happen fast in fact the Bert squad that is the Stanford question-and-answer model only takes about 30 minutes to fine-tune from a language model for a 91% performance of course performance depends on how big we want Bert to be now the Burton large model which has 340 million parameters can achieve way higher accuracies than the bird base model which only has 110 parameters there's so much more to address about the internals of Berk that I could go on forever but for now I hope this explanation was good to get you an idea of what Burt really does under the hood for more details on the transformer neural network architecture which is the foundations of bird itself click on this video subscribe and stay safe a lot more content coming your way soon and I'll see you soon buh bye
Info
Channel: CodeEmporium
Views: 116,869
Rating: 4.9675279 out of 5
Keywords: Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Neural Network
Id: xI0HHN5XKDo
Channel Id: undefined
Length: 11min 36sec (696 seconds)
Published: Mon May 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.