A Deep Dive into NLP with PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so this tutor is gonna be an LP with PI torch we're gonna do some recap on NLP basic of neural nets and then we're gonna go further in the second part into more recent developments in NLP hopefully gives you a sense of how do you build more recent deep learning networks to train those models so before we start here's the link to the slides and there the second link is to all the notebooks so it's hosted on github I also posted it in a stack channel of tutorials where you can just go directly and then also on that github you have the link to the slides okay I'm just gonna quickly show you how you're gonna open up collapse and run your notebook with it github mm-hmm search for stout feet and then make sure that you're on the develop branch here we have to update the notebooks you can see here there are six different notebooks that we can run through you can just click one of them and then I will show you how yeah so we specifically make the notebooks in a way that it doesn't take too long to run so you can while we were given your slides you can run it on this side as well and what is really good is that it also gives you a runtime of the GPU so you can disconnect and then you're connected to a GPU which takes much faster to run through all the selves good so click this claim er so we won't be have the time to run through all the notebooks one by one but you can run it on the side as follows through the that the steps I just showed you yeah and let's get started so I'm Jeffrey that's two Zanna my teammate we're both data science people in Scout B and our 5 second promotion we're a I powered b2b sourcing platform in Germany where startup we are very data science bias so we're always constantly looking for data enthusiasts deep learning people Python engineers we're fully python-based check out Scout be calm all right so prerequisites and goals we assume that you have some machine learning exposure and then you're somehow intermediate Python coder or you use Python late today yeah and you understand what the basics of neural nets so that when we're introducing the PI torch way of building neural nets it's easy for you to follow the ghost of this is that we want to go go give you a recap of deep learning and then take that part to the second where we can say how do we help you build a deep learning based NLP model more recent developments for instance like sequence to sequence models or trans doing transfer learning fine tuning etc ok that's going to be the outline so the first part is gonna be introducing PI torch some of you might have already known these things static versus dynamic basics and then we're gonna do recap on the neural nets how do you train a basic fee for network at PI torch define your loss functions do gradient descent etc we're gonna also do it slight recap on NLP basics but also in pi torch and how do you do tax pre-processing how to represent your text into a discrete space etc and then we're going to more recent stuff how the continuous representation works with embeddings then we're gonna do iron urns which are sequence models how does it work different types of that how do you do text generation or sentiment classification and we do the more recent sequence models which happens within these two years those models are then the basic of recent developments you see in transfer learning so things like GP d2 and Bert etc and it give you a basic understanding of why is it useful okay so Susanna is gonna go start with the PI torch intro price which works and where PI torch fits in the overall deep Learning Library ecosystem okay so this is a cool graph karpati put on his Twitter it might be kind of hard to see but the purple one the dark purple one is hi torch so is this line here and then we also have tensorflow which is the pink one at the top so you see since our flow is still very dominant and academic research for deep learning but pi torches on the rise and of course this doesn't include 2019 in the year into 2018 so the to talk about deep learning libraries we first want to talk about what they do like what's their primary objective so we want to be able to define a model a loss function some kind of optimizer so we want to define how our matrices process and how the model is set up to learn and then we want to also support automatic differentiation which is back propagation so we have this ecosystem where we have a bunch of these deploying libraries you might recognize some of them high-torque is the one that responds if you don't recognize the updated logo yeah so a common factors I think differentiates these two the most in terms of how their API is structured is whether they are a oh yeah they're this whether they are a static versus dynamic deep learning library so if you look on the green those are dynamic so this includes pie Georg dinette chainer and then the more traditional ones with sensor flow care osteon oh I guess it's more outdated now h2o for it's all static computation graphs so when we talk about those what's the difference so for a static computation graph the whole idea is that we define how we want to have our model flow and then we run it so it's kind of like it has this kind of like compile step to it so we define everything first and then we feed data into it so there's actually some good advantages of doing it this way this house tensor flow works is where we can is easier to distribute this and it's also nice it's but it's more complex to code this because it's it doesn't work with your intuition especially with Python being a dynamic language interpreted language that we would imagine that the deep learning library API would also be more dynamic so dynamic is defined by run instead of defined and run so our computational graph of how we want to transform my matrices is defined on the fly so we want to be able so a nice thing about a church is that we can just throw a debugger anywhere and be able to look at what our matrices look like what the dimensions are you can't do this so easily in sister flow and there's some nice things you can do with those kind of models and there's also you're able to do some more complex kind of networks that you wouldn't be able to do and sister flow but it's mostly it's a lot more pythonic to do it that way because it's interpreted language and yeah okay so hi torch is a deep learning library in Python so you can kind of think of it as India razo GPU support India arrays being in numpy so it looks a lot like numpy arrays but you can do gradients on them yeah so does automatic differentiation optimization and has these dynamic computation graphs so just a little bit of a feel for the pie chart syntax so we can create tensors just like that so you created since there it's kind of just like an umpire ray we can do support broadcasting so if we do two times that it will broadcast it to each element so that's quite nice we can also send our tensor to device that device could be your CPU or it could be the GPU so kudos for the GPU so what's cool about Pi search is that we can define some variables here so we have X and some so we just have some variables at the top and we have weights and we require gradients to calculate on our weights and then we type some more code and this graph starts to get more filled up so we do some matrix multiplication tour shot mm it's matrix multiplication so we have these operations and we're able to start building our graphs and then we keep adding some operations and then we even add this other line so you can see this computation in the graph is being generated on the fly as we define our code instead of all in one step and then if you want to do the backpropagation and you just do this dot backward command okay okay so we're gonna talk about a little bit of then how do you use pi towards what she just showed into building a very basic fee for network so a bit of recap on neural nets or any model specifically you have your inputs which is your training data if you did into the model which can black box feet for network sequence Network whatever a generates a prediction given that a probability probability guess of that model or a regression that is a number and you use your target which is basically the truth versus this prediction and you try to access how wrong is my model prediction on that so generally your loss function ideally the lower the the better that means your prediction probabilities very in sync with what the target distribution is and then you try to do some weight update with that you try to say how do I go down this hill so that how did I change my parameters so that my loss would go down the hill we look at it from a math side like assume like a basic linear regression problem you do matrix multiplication so that you predict a very linear output from the X input you give your predictors Y hat and normally use a routing routing square error or in this case a mean square error to access how different is my prediction versus the target and you do gradient and I update your weight so that's how it looks like as a basic fee for network you have your inputs and then your feed through hidden states you do activation on each of the each of the layers so that it can fit to any learning non-linearity in the world what is very useful about deep learning networks is that there's this universe approximation theorem that if you stack multiple layers together they can actually approximate any function any nonlinear function in this world so in the end you just need to stack more and more layers so that that model general is better towards different ranges of input and you have different kind of love last functions here we show all kinds of common loss functions and loss is important that it tells you how wrong your model is so that gives your model a guideline of how should I update my weight so for instance if you're talking about regression problems you probably mostly use l2 loss which is the linear regression kind of lost I just showed you if you're talking about classification kind of problems where the output is basically a prediction of the probably down the class then you can use negative negative likelihood basically it's a synonym for maximum likelihood estimation or cross-entropy which is actually optimizing the KL divergence so that it goes lower you can use each other and one is regression the other one is for a classification throughout this tutorial mostly if we're predicting for a class we'll use cross entropy for it okay so how exactly do you do go by that and PI torch you initiate class this is already built in so an N is actually module within pi torch that you can just call and I said I want cross entropy loss this gives you this class and then when you're calculating your your loss then use the prediction versus the your prediction for model versus the target you can feed it also the batch so that your emphasize is then the batch number by one access and then the other size their prediction and then I calculates to these predictions and give you the cross entropy estimate for it so that's your prediction and then that's your target your truth and then when you're doing gradient descent after you have built this computational graph in your models you want to update your learning rate you want to update your parameters so you set the learning rate which the dynamic or can be static and right now everyone use dynamic for instance optimizers like Adam that tunes that and it goes by the your surface has lost pretty well but PI torch does it for you so you have a lost surface which can be in this case to two parameters it can be undone dimensional space where you have a surface in this hyper space and you're trying to find the position on this graph that low minimize your loss and we're projected to 2d the ghost like that you want your model to be able to after each train after a parameter update you want your model to be able to walk slowly down the hill till the point where is the minimize and loss that means your model is actually doing a good job of predicting the target and pi torch what you do is quite straight forward so you have to find your lost function and you're feeding your prediction versus your target to get the loss which is a scalar which is a number and you do backward which is actually taking gradient of this loss the scalar loss versus the computational graph within your computational graph you are ready to find the weights so the W's before your ex is your input your w's your weights your predefined these weights so that when you call backward it's actually gonna compute the gradient on the fly and I store those gradient on the variables and then the second step is then you call optimizer does step because when you initiate your optimizers are already knows that which are the parameters I will want to update for each of the gradient so it actually multiplies the learning rate which in this case is dynamic with the gradient that you have applied to on the variables and then updates the weight for you so that's it actually four lines of code you define your computational graph which is a matrix multiplication define your loss you've reading your prediction and then you just let pi PI torch do the rest okay so if you're training if you're writing your own fee for in your net I'm gonna show you how you can write with a few to three basic class then you can do your own fee for predict fee for network that's how you will actually take your data set in Python you can load it with pandas or something and then how do you transform that into a object that PI torch can understand you initiate a class depending on what your data set is in this case is our data set and you inherit the class from data set which is from PI torch building within in it when you're initializing it you can do whatever operation you want on top right so I read my pandas dataframe I tokenized my words I do transformation into vectors etc those operations and I store on this class so that when I call the class it returns me the input and the target and you will always need to define these two methods for the class it's by default required by PI torch so that you can call item by index so I initiate my my data set as an object and when I do index like a square bracket I 0 it gives me the first record that's what get item is does for you and then you have to write one for for length so that when you call land by the data set that gives you how many records you're actually having within this day yes and that's how you initialize it so you just call my data set maybe you were feeding a path where you want to read your your Python file 5 I'm sorry where you want to read your data file and then you can index tall by index or you can call my size yeah okay so now we so now you have load at your data how do you do your models and that's a based very basic fee for network it looks like in pi torch it's inherited from a different class and in the module is basically mirror net module so I said I want to create my first classifier yeah which is a binary classifier of 0 & 1 when I initialize it when I initialize it I give it three layers FC 1 FC 2 and C 3 they're all the inner layers you see by the size of the brackets actually telling it which is my weight size by default since you're calling linear layer PI torch understand this not as numbers but as variables numbers are things that don't change while you do in differentiation or somehow things like your input like your target you don't want your model to update your input data or update your output data variables are things that you can train so if you call requires grad equals truth on any variable that you have defined that PI torch will interpret like if I have a computational graph I've been doing a backward pass on it I can update those values and linear layer is one of those so essentially you're initiating three attributes or objects underneath this my classifier by size of 128 32 and then with the second pass I want down from 32 to 16 and down to 1 1 means it's 1 or 0 so and then you have a second forward path method which means I initiate my class which is my model and then you said model input data then it's gonna go through the forward pass this is where PI torch actually build up this dynamic computational graph and it calls it by sequence so you say I take the input I feel free through the first layer I give it a nonlinear activation call relu which is a rectifier you can Google rectifier it's basically taking the max between zero and the value so actually did sell all the negative values take those input it's already a nonlinear output freely through a second layer does the same thing and then feed it through the last layer and the output size I give a sigmoid activation which basically outputs between 0 & 1 bounded and that's my computational graph when I initialize this class by that and I said to CUDA means is sending to your GPU yeah and you can send it to 2 device you can initiate your device and I send it to device it tells your GPU that there's this model available that since all your permit parameters to GPU and then when you do model and you're feeding your own inputs it's gonna call the forward pass which basically builds up the graph okay so you have your data set loaded you have your model computational graph built let's see how do we do training on top so you said you define number of epochs that you wouldn't want to Train in this case 10 for instance make it faster and the loader is basically this data set that you have prepared so you split that your data set into batches say each batches I found about 512 records so your inputs and your target is going to be both a tensor of batch size versus the input size I called yeah I already initialize my optimizer so you Adam and I tell my initializer that I told my optimizer that please zero out all the gradients of the variables because I don't want the previous runs gradient to affect what I'm trying to do right now so our zero that out and I do the for a path it's simple as after initiative model used a model give it an input and predict output which is in this case also in the same shape of batch lies and your prediction so you take your target flashlights prediction batch size truth and batch size prediction you feed it through your loss function you get the loss which is a scalar and you do backward when you're called backward pi torch does this gradient descent gradient for you automatically so it does automatic differentiation of that and then you call optimizer that step those two number of epochs as you defined and then update those weights so that goes down the hill and that's pretty much it when for how you put a very basic fee for network oh yeah so go through that okay so now I'm gonna do a very quick breeze through of NLP basics basically how do you do text reprocessing and then how do you use pi torch in the case of NLP some use cases where the real world NLP is using for instance language comprehension things like audio or my speech to text and interpreting that text for instance I Alexa or say Alex a please Google something understands that understand your speech turns that into text and understand that text in this talk we're gonna focus mostly on text to text or text classification kind of problems so we don't do audio files that much here second use case machine translation so yeah use Google Translate for instance you have a text and your output is a text in this case which were we were introduced later how do you turn this sequence into a different sequence and then text generation or give us some text as a seat and you try to let the model start run write writing its own article or for instance summarization and give a very long text can you find the most important parts of that text and then use language model to generate a summary that makes sense yeah my Google News you see those like headlines or so named entity recognition this we use it a lot as Cal P so we actually get a lot of those free text from everywhere but when we're trying to identify what is what I does is it a company name is it a product name is it etc it's all embedded within all these like full text from the websites and Andy are actually does it for you so it does tagging for the positions that it says this is a the class that you want to that you have to find it for so this is a company name then we can extract that part out of that so it's a lot with information extraction sentiment analysis things like is it a positive review or a negative review am I happy with my text review etc so this is this is focusing on the text classification which will also introduce okay so say you have your free text long range of text and how do you turn that into this data say that I introduced my torch and then you can run your model on top so first thing your tokenize it your reading your file and you have one column which is like all these full text descriptions of some company so first thing you tokenize it if you're we're talking about Western languages this is mostly how you would do to organization covering things like meandering Japanese or so on there they're different and then you would try to remove stop words because those things stop words are things like a D B et cetera those things gives you very low signal they're everywhere so this gives you very low information aware of the text you should focus on if you just take the frequency those words is gonna be ranked on top but then it's going to bias your model it's going to think that's important but it actually tells you nothing because in every text you have DOS so you remove things like be out of this this sequence of tokens and then you'll lemma ties it so what you want as your vocab is to be as sizable as possible so that so that when you meet a new text you wouldn't see a lot of unknown text unknown tokens inside that but still you don't want your size of vocab to be too large so that you only have one occurrence of that text that for that your model wouldn't learn nothing so in English for instance you have all these past tense you have multiple like s in the end you have different cases for the word we try to find the stem of the word so you would lemma ties it in this case that jumped the past tense becomes the present tense jump with that when you do in your vocabulary or vocab or will become like you will get a higher like overlapping between the words on different training sets and I probably will replace real words you want you try to limit your vocab size so that only like things were occurs very frequently I got captured so your mask out things like sneaky for instance if that's not happening very frequently so you count your whole corpus what's the frequency about these words and they're trying to limit your vocabulary what you see here that the sneak has a y in the end there is a way of masking out so that we're not losing our information we're not taking all the real words and say this is an on token we try to bucket that into things like things having Y in the end maybe tells you a bit or two things have a are in the end tells you a bit or two so you have several different variations of unknown tokens still remaining your corpus but they're just masked out so they are things all having Y in the end what all goes to the same bucket will be replaced with death and that's pretty but your pre-processing and then now you what you want to do is then how do I turn this sequence of tokens into a representation like numbers so that my model can understand and the most most classical one my traditional one is a discrete representation so what things call one hot encoding with generally our vocabulary so you do your pre-processing and then you get all the unique words this is basically the vocabulary the things that your model knows that this is a word is another unknown word and you take your text sequence which is on top and you say for this word where's the position of that so in every column you just have 1 1 1 all the rest is 0 there's one is called 1 hot what you get out of that is a first numerical rapist representation of this text right you have your corpus which is this sentence and then for each of the word you encode it like a one hop vector but then how do you know how this corpus is similar or different than the other corpus that you have say you have two sentence and you both do this one hot encoding it's a matrix so then you will do bag of words representation what you do is you take your one Hunter encoding and you sum by this direction by your vocabulary direction so that in different position if you hit the same work then account increase this basically becomes that that's your bag of words vector and this vector represents this corpora this text you're inputting so it says on is one occurrence there's no tour occurrence I see two times gray I see two times the etc yeah but note that the normally you would already strip off during when you're pre-processing it because you're removing your software it's and while you get in Python is in the end like a dictionary so you have formed a word tutor occurrences you may also do indexing so that you generate your vocabulary by the word token towards index and just use the index to reference the count this is the descriptor discrete representation of your text but it also has flaws so it wouldn't be able to understand synonyms if words are different say a cat and kitten or a dog versus a hound the representation will be different and it wouldn't be able to tell the difference because your model thinks is two different vocabulary that's why is a discrete representation you wouldn't be able to project the words into a continuous space and get the similarity between those we will talk about that later but for now this is a kind of easy comprehensive understanding of what the composition of the text is so you can use things like a scalar and which is not even pi torch you can count vectorizer it gives you that automatically okay so we can actually use this bag of words vector and use that to predict sentiments like taking the movie reviews with pre-process that turn it into bag of words each of those reviews turn it into bag of words representation and use that feed through a basic PI torch network I PI torch before Network and use that to say is it a positive review or a negative review so you have examples like those all the inputs are this bag of words representation that we talked about like you see in the first example 2 times of good 1 times movie one time of interesting it's very likely that this is a positive review and you have things like terrible or hated that occurs frequently then it's likely that is a negative review what you do you take the bag of words representation which is basically a sequence a number and the size of that vector is always your vocab size each of the index the size is a vocab size each of the indexes stay in the count of that word we think that that text and those numbers you feed it through they're fully connected network that we showed before like the class you can initiate different number of layers as you want and then your output basically is still 0 1 so you will feed it through a sigmoid activation to get the probability prediction or in this case you can also get the logit because pi torch does it for you you count the loss generate this lost gallery lost value that based on 0 1 versus the prediction probability and then you do back backward back propagation with it and that's your first text classifier so let's see how you do it exactly all right so you can open collect and then load this first notebook into it I won't run it on the fly but I've already generated the outputs for you so these are just the packages you need you see like n n functional etc so these are the preparations for it basically you download the text and that's how the text looks like as your raw data so you have one column which is basically the reduce is full of text and then you have the label 0 and 1 what we try to do we take this we try to take this and then do a bag of words representation for each of the records so that's how you do in pi torch you're initially your data set class you're reading your data you do count vectorizer on it that's from SK learn so that turns it already into the bag of words and then it generates this is basically your vocabulary and then the reverse vocabulary which is basically from from index to token and from token to index and your sequences is basically what your training data is so in the end that's this bag of words representation and your labels is the target that you want to predict you so ideally your prediction is very close it's almost the same as this is your one target and then of course you need to write these two functions that to either index by the item I get the item right by the index or get the size by length so then I just loaded in and we can see few examples of that yeah okay yeah you can interrupt any time when you're running the notebook is there is any problem or confusions around that so that's how our model would want right we want a very simple feed-forward network where the input is our bag of words and then to predict the class that's the map yeah right so that's what what is wrote over here is what this graph is so we take our input we do some matrix multiplication and we do activation and we do a few of those until we hit the output yeah and when I wrap it into this this thing which is a sigmoid that it gives that skills your output into between 0 and 1 and we'll use that as the properly prediction now we define the loss and then use you call your optimizer saying please step we do backward propagation and then we call our optimizer saying please take a step after all the parameters so that we're going down hills and over here it is how you define your neural net same as what I showed before so you initialize it you can fit a few parameters the size of my vocabulary and then also each of the layers how many neurons I wanted to have you always give it a number like by power of two 512 256 etc actually give it's actually more optimized in this way and when you're calling my model pi torturous tell will tell you what kind of like model structure you have over here we have a very simple model that we have a vocabulary of 3000 and then we try to project that to a size of 128 and then further down on a second layer until on the output layer we want to predict 1 or 0 ok and this also you see before we initialize it with what's called a binary cross-entropy so it's a it's also built-in you can also use cross-entropy which automatically says assumes this multi class but in this case since we only have 1 and 0 I use binary binary cross-entropy and then you call your optimizer in this case is a dynamic one called Adam so it's going to know by this is dynamically change your learning rates so that we're taking a step you're not over crossing so I would say that if you call by parameters only returns you all the variables that trainable so would work but it's like a standard way of doing that so that you make sure that none of those static values got feed into it and got changed yeah true yeah okay yeah and they were giving a learning rate of initial learning rate but I would adapted to it each round it runs okay then you call them Auto that train which basically just tells your model that you're in a training mode and over here you can run inside it so we specifically do only ten rounds and it is quite a small data set so when you run it it doesn't take seconds to get the output yeah we train it for ten rounds and each of those loaders will give me a batch of size of 512 right yeah anyway it gives gives me back the first batch batch by batch so for each of those batches I then do the same thing that's basically one round over there over the model I said zero my gradients on the parameters I give it a four pass so that generated my prediction I use the prediction versus a target get the loss and then this is basically the lost number and use that number to do differentiation backwards and then you take a step to towards the right direction and then this is basically just recording how our loss goes down throughout each round right here you see the output so it starts from point seven and goes down okay yeah we can see a few examples so here we give it a text and then we tokenized that text we do pre-processing over here by the this thing so that turns it into a bag of words and then we use that to say please give me a prediction output and giving the output and which is freedom to just ignore to get the prediction and then let's see like if we say like it's more than 50% that it says is a positive then we print positive otherwise negative yeah let's see a few examples so for instance here is this poor excuse for movies terrible etc is very likely is a negative review and the output of it that's the sigmoid and then the other one cool cat save the kid is a symbolic masterpiece right and it might must be like interesting intriguing or these kind of words and it is a positive sentiment yeah so there's a few examples you can write your own as well give it a text and then try to use this this method to get the prediction yeah that's pretty much it basic intro and then yeah so then it's got to talk about continuous representations right now we have a discrete one but then how do you know the cinnamons between words yeah [Music] okay so a big thing that made deep learning for LP possible was this idea of the embedding so before with an LP words were always treated categorically like with this one hot representation or something like that but we really want it to be that we have like a similar representation if there's a similar meeting we want to be able to have this property so we can come up with this idea of the embedding where we have for every word we have this vector so you look down words per word and there's this some kind of representation and we would hope that with whatever representation we come up with that we can have properties that synonyms would have a similar factor representation so you can see kind of like right here the words on and D both have relatively similar vectors so that's what we want so this is also how you can create many layer and pi torch so you can basically have PI torch just learn embeddings on the fly while you're training but let's go into how we can train these and betting's ourselves or how people train them so words avec I won't go into the details of the algorithm but I'll go over the high-level concept of what it is so we kind of have this assumption with words avec where we have words that are similar in location we'll have some kind of similar meaning and if we get enough text we can imagine that we can get this really good representation of words based on its neighbors so the basic idea is that there's a neuro network best trained and it's going to predict a target word given some kind of context word so if we imagine we just pick a random word in here then we pick another random word within maybe ex neighbors that we can have say that these two are a training example for these two words have some kind of meaning connection so it's called a skip grand model and there's also the first quote that you should know a word by the company it keeps so that's like the idea of the context giving the way the meaning so the traditional vector pre-trained and betty's that people will use for these in alchemy models is glove this from Stanford this is one of the pictures from their paper so this really cool thing with what they did with this words avec is that they're able to create these representations for every word and they actually have a semantic meaning behind them so of course the vectors they train are either like 50 dimensional or 100 m inch or 300 I mentioned all you can there's an algorithm called t-sne which will take these high dimensional representations and plot them into 2d an approximation algorithm so we can see that we have these words plotted here and we can even see that they have some kind of semantic geometry to them so for example we can take the word King subtract man and plus woman and that's gonna approximately equal Queen so these with these kind of metaphors if that's how we can know that the algorithm is working correctly so you can see on the bottom right you see King and then bomb laugh you see man if you woman on top left and then if you do the vector operation you can get Queen that's pretty cool so there's also a way we can visualize and he's glob embedding so there's this really cool repo with this user and I'd be hard to see from here we can basically so what they did was they took these high dimensional word embeddings and they decided that they wanted to plot them in 3d way so you can kind of like zoom around and these are all from gloves so if we stop here oh gosh yeah maybe I'll get you hard to see well you can kind of see each of these nodes but on your own time you can look at that it's super cool you can see like words like mother father sister brother will all be in their own little cluster countries in Africa will all be in their own little cluster also you'll have whatever kind of topic you'll be able to see where it fits in this 3d space and it's really cool to have this visualization of how the and that needs work okay okay so he's also embedding this notebook that you can check out we'll go as in detail with the other one but basically a cool takeaway from this one is how can we load these blog vectors easily so traditionally I call them sit out and people would like grab this file name from the Stanford page they'll open up the file name and like split each line that's a little bit tedious so what we can use is we can use this library in Python called magnitude which will load them in an efficient way called pipe agna tude you can download this file and two seconds place on my internet speed and it didn't take very long and you can query a list of words and then get the glove back there's nice and easy a cool thing also is that we can do these vector operations on the word so if we define a cosine similarity function and we get the glove vectors for each - the two words that we want to compare we can actually get the cosine similarity so we can just get some random words like dog and cat and see they have a high similarity and like treating cat lower you know I'm king and queen higher so you can kind of do these kind of similarities between two words and then also a show here we have the TSM II plot so I just thought of two categories of things I thought up of animals and of household objects that I just listed them oops listen them all out here so I just on the top of my head and then we can use the t-sne algorithm to be able to then plot these in two dimensions and you can see that there's a very clear difference so the t-sne algorithm has no idea what's what's a dam or what's a house with object galaxies are these vectors of numbers and it's able to plot them in such a way that there's a very clear distinction in the way these vectors are formed that separate them that's the general idea yeah okay so now let's talk about Arnaz so we think about what sequences so because the think of like weather patterns so it might be sunny two days and rain then we have Sun and rain music it's also a sequence where we have something and then the next note is predicted from the previous note and text is of course also a sequence so the good thing that is in common is that they all have some kind of time axis so it's some kind of pattern or some kind of distribution that involves over time so things appear everywhere yeah so neural networks up to this point didn't really have a good way of handling sequences like this because it will all just look at like everything together and wouldn't be able to have this kind of idea of what came before and what's gonna come after so when we get a recurrent neural network we have this basically it has a loop that feeds into itself so at every time step the network feeds some kind of state back into itself so we can look at the basics of aren't in so we can basically take this which is a representation and unfold it per time so if we unfold it for time we can see we had for every input X we can see that input here so it's like for every input X we give it to our kernel network at a time step and it passes a hidden state on to the next so it's able to maintain these so it's input hidden state and then these are some output so yeah at each signs that the Arnim unit uses the previous hidden state and the input at that time step to predict a new hidden state and also an output so the general function of how we describe this is that H sub T is the hidden state we have some kind of function that takes in the previous hidden state and also our X of T our input and this is parameterised by theta so it has kind of weights to it as well so we have also different types of arnaz so we can have okay so one to one this is just like no iron and this is just like a classic neural network like a feed-forward Network for image classification we can have a one-to-many so this is where you have like one input image and then we generate some kind of caption to that so it's one to many so these blue on here are like every output you have many-to-one where we take a movie review which is like several words and then we have one output like is it a positive review or is it a negative review I mean can have many too many so this is like machine translation text generation so we have like an input like translation in English or we have English and we translate it to German for example and then we have minutes and mini so you can also have video classification on a frame level so we have a something that's called the vanishing gradient problem so with traditional rnase is traditional earnings had a really hard time managing these long-term dependencies what it would do is it would forget things really quickly like over like four times in steps it would forget something they had a really hard time with that so we can imagine when we do the back propagation every one of these we had a gradient be multiplying right and because of the chain rule so if we have some gradient early on that has like a value of like less than one this is going to get multiplied and multiplied and multiplied by multiplied is gonna get really tiny so this is called the vanishing gradient so the effect of this for an RNN is that words that appear really far back are gonna have it like a negligible impact on later on if they had a small gradient this is really bad so like you can imagine you can say something like hairy my best friend and classmate from my childhood back in Oklahoma is here so we asked our network okay tell us who is here and they're gonna say Oklahoma because that was like right before so tarnan is not able to really comprehend like these kind of something that's really long back that we need to take into account for answering our question another problem is this exploding gradient so this is vanishing we have exploding so it's similar it's just the instead of our being less than 1 our grades being less than 1 we have it be larger than 1 for early time sets and it's get multiply and multiply multiply multiply it's gonna get really huge so one possible remedy is you can radiant clipping so you can say I want the max or min value my gradient to be these certain values and it's been like cap it out but there's still this fundamental problem with our n ends is that they have a really hard time handling these long term dependencies so this is going to motivate the discussion of different art and architectures so this is our simple RNN that we just kind of talked about a bit so it's it's pretty yet so simple then there's an LS TM which you probably have heard of so it kind of solves the screening problem by remembering things so it has these certain gates that help it remember and we also have a GRU which happened after the LOC M which was basically a simplification of the LSP M by having less operations that made it train faster without suffering too much performance loss okay so the symbol R in it is with this equation so it's really just we take a previous in estate we take the 10 X X of T our input we do we have weights and we just do the dot product and atoms together apply a 10 H to normalize them between negative 1 and 1 and then then we that's how we get our new hidden state ok so you'll actually see us on the next slides I want to discuss it if you see this notation with the little circle it's just this is just like an element-wise product so the fancy name is it's Hadamard product but it looks like this so it's just it's the way matrix multiplication you might think it works at the beginning of your math class okay so with Ellis Ian it gets a bit more complicated with the implementation we'll go through each part okay so C is our cell state H is our hidden state X is our input and then we have our next hidden state we have our next cell state okay so this first one in the cyan we have our previous cell site so this is basically like a fast fast lane so because Arden's had a hard time taking what's previous and taking old memories and pulling them to the beginning are to the current prediction this kind of provides like a fast lane for the lsdm to transfer information so we have different operations that will contribute to this fast lane so this vessel a basically it only has a multiplication or addition so it has either stuff zeroing out or added to it so it's it can help with maintaining old memories better so this gate right here is called oh forget gate so this one is this F sub C so yeah so the picture here corresponds with the equations here so they're equivalent but it can be easier for some people to see one other people to see the other so with the forget gate here the main idea between this forget gate is that it will make it so that this is a sigmoid here sigmoid scales it between 0 and 1 oh yeah so I'm sorry so each of these ones in green is its own neural network so each of these it's like a little mini neural network there each parameterize by weights so you see W sub F u sub F our Bost parameters of this neural network here and though because I you serve eyes what's here so these these are each parameterize by weights so what this one does is it tells our network whether we want to forget something or not so it's gonna output with the sigmoid 0 between 0 1 so 1 is like close to remember everything serious like let's forget this thing so an example of where this is useful is like if we're talking about like if we're like having a model that's trying like predict like the gender of someone with like their name then we might say when we get a new name let's forget the gender pronoun that we were using before to like start over for this new person for example so that would be a really nice use for this forget gate here next let's talk about is the update gate that's right here so what this basically does is it will take our previous hidden state and it'll apply a cygwin and it will say which parts of our previous hidden state do we want to like focus on I guess we've also have @nh so this takes it this basically just scales our head and States this is like the input and this is like what we want to remember ever input and then we add these to our fast lane for our cell state and then we also have this output gate which is this Oh sup T here and this is basically how we want to compute our next hidden state okay so this is all pretty complicated but the overall idea is that these LS teams are pretty good at remembering stuff that happened long ago much better than simple Arnon so a nice little so something you might wonder is why do we need to have like this cell state and also this hidden state like because they're both kind of doing similar kind of things they're both taking the previous output and then like predicting something new so that was the gr you actually doesn't have this cell state so they simplified it by just having this hidden state and they modify this in and state so it's similar it's just like a simplification so this is the update gate so they actually for this LCM they had this forget and update but they both kind of do similar things someone's like oh forget these things the other ones like let's remember these things so that's not super nice because they're both kind of doing like the same thing so let's have one update gate right here and then we also have a reset gate so the update gate tells us what we want to remember or forget and then the reset gate oh sorry yeah this one's a reset gate but yeah um so the overall this is just a simplification of it so the nice thing about here is the previous LCM has a lot more operations you can see there's like a lot more stuff going on here if you count them and this has a lot less so the effect of this is that this is a lot quicker for a training and what they found that there's a similar performance okay so let's talk about the PI torch aren't in how we describe our model for defining our aren't in so we first take our class and we inherit from in an OP module like before we must define and init and a forward and then for our in it we also had to inherit from the parent class we define an embedding layer as our encoder we define an RNN as our GRU and we also define decoder which will take the output of the GRU shape and project it onto the shape that our target so if we're doing like classification will want one if we want to distinguish between three classes we'll put three and then we define forward so four words like how we want to take our inputs that were given to us and then make really transform the do some transformations to make them into our output so we first apply an encoder on our input we get encoded we get an RNN so this takes an input in a previous hidden state so to get that hidden state we actually call up here outside the small definition and it hidden and then we'll get this hidden state and that's the stuff that is right here then for our output we apply a decoder and then we return an output in the hidden state and we basically call this in a loop for a per item in our input okay so let's look at a since mint text classification example yeah okay yeah yeah so this is an arn tend to do so before we were doing it backwards through our text classification so it was not taking into account position of the words at all and for this one we'll do it Arnon so for example for one way that this could do better is if you say like not good you know you need the position to know that what that means like you need to know that not up here before good if it was in a different ordering you know or if it was unordered it might have a different connotation okay like if it said not good versus not bad you know it's it's hard to it's about you going backwards you can't distinguish this as well yeah so basic idea will use a gru here and we defined an RN n so this RN n has an encoder with a embedding layer we define we mesh define a gru here and then we have a decoder so it's very similar to the code snippet you saw yeah so it's doing so instead of actually taking in the hidden state would take send the entire since that's again one input at a time for loop we'll take all the inputs together and then apply them all together to the RNN so this will do it like this passing of the hidden state like under the hood they said I think like explicitly doing it yeah so we have our inputs we have an encoder the GRU decoder will get locked lockets logits back I guess and then the logics and the target you can compare them with your criteria which is a loss function which would tell you how your model is doing and then you get lost and then perform green to sit on so we accentuate BCE with la gets lost which will do a binary cross entropy and then optimizer you define with Adam and then similar to before we just have another training loop it's actually I think nearly identical to the one before yeah but we get perform we see the model has learned something and then we can see it similar to before like give either negative sentiment or positives and submit to these different reviews so the next notebook we're gonna look at is for character generation so we want to look at like given some input characters we can also have our targets as like one time step off so for ever you like Hugh will predict you you would predict I see you know it so we do this like one-off setup so we can look at this model to help understand what's happening so we take our yeah we take our input and we take our hidden okay yeah I see yeah so it's basically for every time step of the input will apply to GRU to it and we'll pass the hit into itself okay so let's open this it's just the time step so you always want to give the previous line the next one you could do also so you have it like yeah okay so it's a teacher forcing so you want to always have the previous one predict the next one model concept so basically we'll try to train a model that if I said I am and it's gonna predict the next word so I'm using my word so you can use a very long text like Nava or somehow and I always shot it by Windows so you give a few words predict the next one and then shove the like move your window a bit and then you train like this that's what a language model is and it's okay so this is text pre-processing where you have our data set class we get it load our data we get a vocab size we make something that will turn characters and the season the indices back the characters yeah so and then we turned them into it so we have for every input will have the list of integers that describe its position in the dictionary so we have an RNN model here so we'll take our yeah so I guess is very similar to the one before so this one we actually do to carry the hidden here so instead of hashing like all the inputs at once and then having one target we said do so that we give it one character then we give it then a yes we give it one character you get the hens date and then we calculate the loss per character so we keep feeding like the correct answer back to the model instead of playing the model like carry its own state so that we can see this in the training loop so for inputs targets in our batch so for every character in our input we will apply our will we will apply the model to that one character and then we'll have our model actually applied the criterion on that one output that it predicted versus that correct character output so that's why there's like this one off here yeah so this training loop and then so the results aren't going to look amazing because it was just like a really simple model but you can see since as for our generate weight-loss articles but you can see that it's it's generated during like English looking text that sounds about like weight loss you know so we'll actually do a text generation after this that would be considerably better but this is the simple model that's easier to understand yep if you do drop out or what they do this model doctrine you do model eval and that will like turn off your drop out and everything yeah so if you were doing in this loop if you were doing both a training set and then also a validation step what I usually do is have separate functions one train and then what another one fouled eight and at the Train one I'll have a model train and the validate one of model eval and then it'll make it so that the drop so for the evaluation but it will turn off the dropouts and everything yep but yeah they show your first simplicity but yeah okay yeah so this just generate some not not passable text but it's doing assuming this job and then will later do better next generation so a short recap we see backup words which is a discrete representation just like what's the composition of words within this text then we say we don't know synonyms and we talked about embeddings which basically projects the meaning of the word by taking into account of its context where does it normally show up and we found out that okay we can use this embeddings to to into the are nouns that you just see and we can also create meaningful vectors for the whole text so we feeding a sentence and then the hidden layer is going to accumulate what's the previous word it takes so this representation versus the bag of words backwards doesn't take an account of the position of the words right so you just take say the occurs three times and so on on and that does take into account so if we take the last hidden State it actually can contains all the the ordering meanings of the previous word Plus that you have feeding the embeddings so it takes an account of meaning of the word plus the position and we use that for language modeling basically predicting the next word or next character etc we can also produce something like sequence models that's more useful for instance like translations we're taking a sequence of English text and can you turn those representation into a vector that can use that to omit translation for like French or German yep and that's what we're gonna talk about next so the most simple one out of the sequence models for every sequence to sequence so our input is a sequence of text say English and then we feed that into encoder you can imagine that like a simple iron or GRU so by each step the reason the the new word as a new input construct the next head and stay with the previous hidden state together and it goes on until it can turn the whole input sequence into this context vector you see here and then this this pass on to the decoder where the decoder takes position by position so say this is my input context which conveys all the sequence of information of the source can i generate your target one step at a time that actually represents as the translation that's output so here is an English to German translation and both of these encoder and decoder can be recursive neural nets so you can do two RNs one reads in the input one spits out the output okay we can look at some different applications of sequence to sequence so machine translation is one from one kind of text through the other language we can do question answering or chat for someone ask a question someone answer that the milk model answers you can do date formatting or you can do also speech-to-text or speeches also like audio format you can cut it into small chunks and I use that as timestamps and I use that to predict what's the text for each of those chunks or few chunks and what we do at Scout B we do name to domain for instance we sometimes see on a websites their company name we use NER to extract those company names before it's very important to know the website of the company so can we use the input sequence of those names to predict what its website is to predict its domain and suppose sequences so yeah I see the examples here for the first one is translation from English to German the other ones QA like how are you I'm fine and the other one is date formatting so you if you have a data set that contains a lot of like 10 9 really like free text like dates it can come with different kind of format like this Tufts of july 2019 or different formats can we all normalize it into a sequence output of that or in our case a name to domain given the name of the website can we per day actually their URL where the domain of the website so it's actually quite useful if you train it well and then it's gonna do good predictions on that so here's a animation which from some of that source there's all these blogs about great animations that you can really understand well with a few 5 to 10 minutes so each time step from the input it feeds into the encoder imagine the RN that we talked about before so say your french french input from the first word input it emits the hidden state and your RN goes on the second input with the previous headed state generates a new hidden state etc what you do is you take the last hidden state which you assume contains all the information from the previous tokens you give that to the decoder and then decoder takes a time my times then considering its new input from the previous word and then your context to generate this translation yep that's basically how it works and then you can think about both eyes are an N and how the model will connect together oh sure and it's thick yeah this is also another animation more in detail she like from hidden one to hidden to two hidden three and that's our output the last hidden state and use that to emit the next translation the next word for the translation okay on top we want to add something called attention yeah and this attention mechanism actually boost this model quite a lot so if you have a plane let's say else TM for your encoder and then I'll stand for your decoder although I was TM compared to our NN captures long-term dependencies much better but you only feed the last input like the last hidden state to there if we go back we only feed the last hidden state into the decoder and in ironin state like in your last day that means the words that's most recent is actually most important so some of the techniques they try is like bi-directional they read the text from this side versus like both directional and try to feed those contacts into it but still both sides like that the start and the end got captured the most so it doesn't actually tells you if I'm on this position of the decoder which of the word is the most important this alignment right from my input I am Jeffery and back to a to a German input if I'm an M which we my around that second times them which were should I focused most from my input state and that's what the idea of attention is so there's two difference from attention to water a simple sequence to sequence with our ends or our stems first that instead of just feeding the last hidden state we take all the hidden States from the source so all the sequence hidden state outputs we all feed into the decoder and then secondly the decoder each times then when it's gonna predict is going to look at all these previous context like these are the single vectors that on each of the time step and you say which one of that is the most important and then the end this is actually a alignment problem right I take my output sequence my position and I look at all these previous input timestamp and and say which input is the most important for me right now to to make the next prediction and that's the idea of attention and yeah here we have a more detail don't worry how to the wait are actually calculated but and your encoder stages is exactly the same as the sequence to sequence or a simple R and then at each time step you have this output so say our input is three words this is I love PI torch for instance yeah I love high torque each represented as this hidden state H 1 H 2 and H 3 and then your encoder at each step it takes this output like the previous output token end and that's it it gets the new hidden state of that are in and then use that so that's actually your previous word representation and your target you take your target and then you compare that to all the hidden state oh yeah oh yeah hey yeah so it goes on until it says that the most likely next word is the end token so when you're training it you give it a star token and you get it Pat it with the end token so that when you're training a nose okay this is my start that's my end so at which lens is most likely that has to end in reality when we were training it we also give it like a maximum length so we said that this model is not going to go on forever if it's like go out of state so I don't know yeah so you see over here that's what does at this step so it takes this hidden state from the decoder like the turn the previous target words representation it compares with all the previous hidden state and do the scoring they use the score to say which are the ones that is most important so that's the one you that's the one you use for weighting discoloring the lighter it is the more emphasis is the song so you scale your input into different you scale your input basically of all these vectors and then you sum it so that your context vector right now at each time step of the decoder is going to look different because your encoder is always the same like encoder gives you the same input but then for each step of your decoder it should gives you a different emphasis on which one so actually it does the scoring that's on top and then use that to predict the next word so we feed that together with the hidden state through a feed-forward network with 10h scale it and then the output is basically a sauce with four softmax yeah so we'd be on your output layer you project your hidden size of this two times the hidden size which take an educator you project that to the vocabulary space and you do a softmax which one after next word is most likely and that's I'm one step and then you can do a second step the second step basically takes it takes into the account of the previous hidden state here and then at this step that compares out of all the known contacts vectors which one are the most important at this position guess again spits out the next one that's overall idea so why is it performing much better is basically it actually gets all the context from the input that's one thing and also at each step is able to focus on different part of the the context that's most relevant for next prediction and those are the so-called encoder/decoder intention is you can think about an attention here as an alignment task yeah I take my decoder turn Stein step I compared to all the encoders and tell me which part should I pay attention to yeah so attention determines which part of the interest input sequence is most important for my time step at decoder that's why it's called encoder decoder intention because it compares the encoder sequence versus the decoder sequence so over here you have the encoder sequence all of those and then you have one decoder sequence it tries to come up with a score and then you feed it into overall this is the softmax of this rule so it says out of all these inputs from the encoder on this this timestamp what's the one that's most important so for instance from English to French English is our input and then the first time step is that the is the most important and this is actually from the original paper that that that does that they give a boost on the blue score of translations this this matrix is what you see here of alpha TS there's actually the attention weight so so each of the cell is scaled by the softmax to between 0 and 1 and if you this should be equal to one over here as your son by rose and actually when you look at the plot you can see how the alignment goes right so it's not necessary that each time step like x times 3i correspond to time step of encoder it actually learns it by itself says maybe your language ordering is different between two languages so when I'm when I'm here your p.m. that correspond to here in European with a different different position but actually give it a big boost so it says this token or this context from the encoder is very important important on this time set that's why I gives a very accurate output up on this position after decoder so you take your weight and you scale by your input by this weight and I've summit this is what we see before after that you have scaled by different colors and then you sum that into this context vector and that's the context vector specifically at this time so that's why you see CT after decoder and you use that to generate the attention which gives you a prediction in the end that's how the cell works okay that's how you do it in Python so at the previous slide you see the score before you do the wave the score and there's different kind of method you can do scoring for that you can do some matrix multiplication between a decoder encoder etc there are different ones you can see family of those here we're using the most original one from the paper where you actually transform both the encoder input and the current decoder hidden state and then you concatenate them and scale by them scale by 10h which is a score that tells you how much they are aligned and I used that to fit into a soft max to scale them between 0 & 1 that gives a description distribution among the different input context and then use that to scale your encoder output which is basically your different encoder position context into the real context vector at this stage you want yeah yeah that's how you do attention with sequence of sequence in M PI torch but a few researchers at Google they found out that actually retention is what matters right so your sequence to sequence although it gives you these different positions at each time step but with attention you're able to say what are all the alignments between the input and output sequence for each of the output it actually looks at all the input and say what's the most important so that's why this paper comes out attention is what your what you need it's all unique it comes from one and a half years ago it's one that really changes the game of NLP what they said is that we can use a lot of attentions we just use attentions to align on the input text among even amount of words and if we do that we stack all these encoder/decoder we stack all these like attentions together it's going to become a an encoder and I call also stack all these attentions together that becomes a decoder can I just use that and get rid of ireland's at all okay I stack a lot of those attentions which actually tells me all the alignments between text and one one thing one of the good thing is that super paralyzed able because you have all these attentions which you can train on each of those separate GPUs and also they initiate something called a multi head which is basically initially initializing a lot of different attentions each of those attentions you can think about like an ensemble model where you each of those I'm one of the ensembles attentions is able to learn alignments within its own text and it's like what they said is much faster to train in reality they did it in a very beefy machine on a very B femoral and that's what generated all these coming forward like open a is transferred the transformer model etc those are like extremely large machines extremely large models but they're very useful for transfer learning etc okay let's go a bit into the detail of how transformers work in the end it's a stack of encoder and decoder similar yeah the only thing changes is that each of those like encoders does is not a Arnon anymore they're just they're just all these things called self attention their attention selves you stack all these together that actually what it does is it creates an alignment between its own text so my input text say it's an English text like I love PI torch at each of the steps it actually aligns within yourself like I which were that correspond to most likely within my input text and a second position third position etc that's what that's what it does inside the encoder set and then a decoder step it does similar things it takes the target output feed it through the attention and then plus that the attention output from the encoder and then code determines which kind of output should I generate for each of those positions so it does similarly encoders decoders fashion English to German yeah what it what it actually looks like is inside each of those encoder and each of those decoder is two parts one is something called self alignment this is what you take the same sentence and you say of each of the position each of those word positions which have the other words that it most relates to you take that representation use Fida through a normal fee for network and then you and then you stack them up right like in the previous slide we see a stack of encoders together and a stack of decoder imagine you've just have one then you have a representation of the encoder and then you send it to the encoder decoder attention and here you have the decoder self attention which is the representation of your decoder step together this encoder decoder attention similar to the attention we talked about before takes those books input and then predict the next word okay okay cool so that's how the equation looks like in reality you take your input and then you generate something called a query a key and a value yeah and why is it called self attention is because your query and your key is actually the same same sentence same sequence in the previous life decoder encoder you take your probe your key then is your decoders sequence and then sorry your Curie is your decoder sentence and then your key is your encoder sentence and you try to align decoder with the encoder with self attention is actually looking at the same sequence and said my query is myself at this timestamp my key is all the inputs context of of the same same sentence and which of those is most important so does this dot product of the transpose and it scales by the value to generate this so this is actually your you can think like your encoder representation out of the self detention we're also decoder works if you take the same sentence and just want to know which of the words we think that correspond to the other and what they do is they do multi-head which is basically each of those layers is one attention and then generate multiple ones of those does act like a Zombo kind of method that allows the model to learn actually different alignments between its own self attention so yeah I'm not gonna go into detail and that's all the overall model looks like from the paper and yeah I'm gonna skip through these two slides you can look at it later but let's see how actually works so over here we have a easier one so transformer is like a lot of and Sambo's which doesn't train like scale well right now here so over here what we show is a sequence to sequence model with attention this one I show before so the attention is basically encoder/decoder attention okay how you prepared the data is in similar fashion to inherit by the data set and you pat it this is where you want to Pat the sequences so that is all in the same length that's the start of the token and of the start of the sequence end of the sequence token and for unknown words so you pre-process your text into those tokens remove the we're words and they generate those sequence of indices which represents your input and output sequences and then here is our model definition so similar diagram before what you would do is you're you initiate encoder for it which basically takes the embedding embedding is your input word into the embedding fils it into the a RNN or in this case it's a GRU and when you do the forward pass it takes the the previous hidden state and spits out the next it's the plain RNN then you define another decoder which in this case similar in the first part which is the GRU like the the recurrent neural net taking into the account of the embeddings from the decoder and then feed it through attention weights and your and when you do the for a pass it takes both the encoder output with the current hidden state output from from the GRU and then try to scale to get the score and use that score to to scale your attention weights and use that to predict output yeah that's the to class encoder and decoder and then you pass it on together so you initiate a new model saying it's a Antonio decoder we think that you call your encoder you call your decoder and each of the four paths you each of those four four paths for us here when you call the model it feeds the data into encoder generate the hidden hidden context which is basically your contacts at that time step and then takes the decoder hidden state together with that to generate the next next word prediction yep and then you can do training similar it's similar fashion your initiate your model you do it's exactly the same thing here and then here are a few examples that generated so I was really proud of that I don't speak French I cannot tell how good this is but this is only like one layer right so is one layer of encoder one layer of decoder and we trained it for only ten that box and all we say like this is kind of probably yeah I see like for the words I know that looks pretty much decent what you see is from the decoder if you say that as I can read them very random sequence of text then is maybe it's not relearning something but if it actually allows to something similar to French the idea was we would make very simple notebooks that are easy to understand rather than of course for training on production the first thing is that you do more pre-processing so that you stream down to the domains I really want to write right and the other thing is we probably will pie it up being a more beefy models that that predicts much better because this is only like train on 10 times on one layer if you imagine the free stack of three layers and I give it like a several thousand round a box of training that will become much more like convincing at the output actually true yeah so you can use an existing data data set that gives you the score of blue yeah yeah so so you do when you do pre-processing you already like lowercase remove software etc alright those are stripped off what you can do to get in the end character model especially it's easy to just leave and all the special characters all the capitalizations if you have character level model the downside is that your sequence will be much longer in this case right because you are though single characters but then you wouldn't have the problem of unknown unknown tokens and then you can leave in even caps because your size is just gonna be caps and small letters and some numbers etc and I was still for translation I'm still lowercase them if we are doing like word level because you don't want the position of that word I have to become like a new vocab or somehow right so you want those two like worse to be very high very high and overlapping so that actually captures to which position that should generate that word the best but I guess you can do some pre-processing for your predictions as well that makes that into something that can be seen as a proper text all right all right okay just fast-forward to the end slide and we'll just leave it there this one you
Info
Channel: Coding Tech
Views: 32,479
Rating: undefined out of 5
Keywords: nlp, natural language processing, pytorch, machine learning
Id: 4jROlXH9Nvc
Channel Id: undefined
Length: 100min 17sec (6017 seconds)
Published: Tue Jul 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.