Sentiment Analysis on ANY Length of Text With Transformers (Python)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we're going to take a look at how we can apply sentiment analysis to longer pieces of text so if you've done this sort of thing before in particular with transformers or even lstms or any other architecture in nlp you will find that we have an upper limit on the number of words that we can consider at once in this tutorial we're going to be using bert which is a transformer model and at maps that consumes 512 tokens anything beyond that is just truncated so we don't consider anything beyond that limit now in a lot of cases maybe for example you're analyzing sentiment tweets it's not a problem but when we start to look at maybe news articles or reddit posts they can be quite a bit longer than just 512 tokens and when i say tokens tokens typically maps to words or punctuation so what i want to explore in this video is how we can actually remove that limitation and just consume as many tokens or words as we'd like and still get a accurate sentiment score whilst considering the full length text and at high level this is essentially what we are going to be doing we're going to be taking the original tensor which is the 1361 tokens and we're going to split into different chunks so we have chunk 1 chunk 2 and chunk 3 here now we want most of these chunks or all of these chunks in the end are going to be 512 tokens long and you can see with chunk 1 2 they are 512 already however of course 1 361 can't be evenly split into 512 so the final chunk will be shorter and once we have split those into chunks we will need to add padding we need to add the start sequence and separator tokens that is new to then don't worry we'll explain that very soon and then we calculate the sentiment for each one of those take the average and then use that as a sentiment prediction for the entire text and that's essentially a high level what we're going to be doing but that is much easier said than done so let's just jump straight into the code and i'll show you how we actually do this okay what we have here is a post from the investing subreddit it's pretty long i think it's something like 1300 tokens when we tokenize it and obviously that is far beyond the 512 token limit that we have with bert so if we want to consider the full text we obviously have to do something different and the first thing that i think we want to do is actually initialize our model and tokenizer because we're using bert for sequence classification we will import the four sequence classification model or class and we are importing that from the transformers library so that is going to be our model class and then we also need the tokenizer as well which is just a generic tokenizer so those two are our imports and then we actually need to initialize the tokenizer and the model so the bert tokenizer is pretty straightforward and then we are going from pre-trained so we're using a pre-trained model here and if we just open the plug-in phase transformers models page so hoganface.co models and we can head over here and we can actually search for the model that we'd like to use we're doing text classification so we head over here and filter by text classification and then the investing subreddit is basically full of financial advice so we really want to if possible use a more financially savvy better model which we can find with finbet and we have two options finbert here i'm gonna go the process ai to invert model and all we actually need is this text here we go back to our code and we'll just enter it here so process we want slash and we all just want this on the same line like that and we're also going to be using the same model for our but for c with the classification so bert sequence classification and we do the front pre-trained process ai finber again and that's all we need to do to actually initialize our model and tokenizer and now we're ready to actually tokenize that input text so when it comes to tokenizing input text for those of you that have worked with transformers before it typically looks something like this so we write tokens or whatever variable name you'd like to use we'll use tokenizer encode plus we pass our text here we add special tokens so this is the cls separator tokens padding tokens so anything anything from this list here so all these tokens used specifically within bert for different purposes so we have padding token which we use when a sequence is too short so bear always requires that we have 512 tokens within our inputs if we are feeding in 100 tokens then we add 412 padding tokens to fill that empty space unknown is just when a word is unknown to bert and then we have the cls token here and this appears at the start of every sequence and the token id for this is 101 so just we'll be using this later so it's important to remember that number and then we also have the set token which indicates the separator which it indicates a point between our input text and the padding or if there is no padding it will just indicate the end of the text and they're the only ones that we really need to be concerned about so typically we had those special tokens in there because bert does need them we specify a match length which is the 512 tokens that would expect and then we say anything beyond that we want to truncate and anything below that we want to pad up to the max length and this is typically what our tokens will look like so now we have it's a dictionary we have input ids we have this token type ids which we don't need to worry about and we have the attention mask and that's typically what we would do but in this case we are doing things slightly different because one we don't want to add those special tokens immediately because if we add this special token we have a cls or sort of sentence token and then we also have a separate token at the end and start of our tensor and we don't want that because we're going to be splitting our tents up into three smaller tensors so we actually don't want to add those yet we're going to add those manually later and then we also have this max truncation and padding obviously we actually don't want to be using leds because if we truncate our 1300 token text into just 512 then that's just what we would normally do we're not actually considering the whole text we're just considering the first 512 tokens so clearly we also don't want any of those variables in there in our case we actually do something slightly different we still use the encode plus method so tokenizer encode plus we also include text this time we want to specify that we don't want to add those special tokens so we set that to false and that's actually it we don't want to include any of those other arguments in there the only extra parameter that we do want to add which we want to add whenever we're working with pi torch is we want to add return tensors equals pt and this just tells the tokenizer to return pi torch tensors whereas here what we had are actually just simple python lists and if we're using tensorflow which switches over to tf in our case using pi torch and let's just see what that gives us okay so here we get a warning about the sequence length and that's fine because we can deal with that later and then in here we can see okay now we have pi torch tensors rather than the list that we had before which is great that's what we want now we have that we actually want to split each of our tensors or the input ids and the tension mass tensors we don't need to do anything with the token type ids we can get rid of those we want to split those into chunks of length 510 so the reason i'm using 510 rather than 512 is because at the moment we don't have our cls and separator tokens in there so once you do add those that will push the 510 up to 512 so to split those into those chunks it's actually incredibly easy so we'll just write input id chunks and we need to access our tokens dictionary so if tokens and then we want to access the input ids here and you'll see here that this is actually a tensor it's almost like a list within a list so to access that we want to access the zero index of that and then we're just going to split which is a pie torch method by 510 and that is literally all we need to do to split our tensor in two batches and we repeat this again but for the mask and just change this to attention mass again we don't need token type ids so we can just ignore that and then let's just print out the length of each one of our tensors here so for tensor and input id chunks just print the length of it so we can check that we are actually doing this correctly okay so we can see we have 510 510 and the last one is shorter of course like we explained before at 325 so that's pretty ideal that's what we want and now we can move on to adding in our cls and separate tokens i'll just show you how this is going to work so i'm going to use a smaller tensor quickly just as an example so we just need to also import torch so we do that here okay so we have this tensor and to add a value on either side of that we can use the torch cap method which is for concatenating multiple tensors in this case we would use torch cat and then we just pass a list of all the tensors that we would like to include here now we don't have a tensor for our tokens so we just create it within this list and that's very easy we just use torch tensor and then if you remember before the cls token is the equivalent of one zero one when i say when it's converted to the token id so that's gonna come at a start of our tensor and in the middle we have our actual tensor and at the end we want to append our one zero two tensor which is the separator token okay and we just print that out we can see okay we've got one zero one and then we have our sequence and one zero two the end then after we add our cls and separated tokens we will use the same method for our padding as well but we want to write this logic within a for loop which will iterate through each chunk and process each one individually so first i'm going to create a variable to define the trunk size which is going to be 512 which is our target size and we've already split our tokens into jumps up here so we can just iterate through each one of those so we'll just go through a range of the length of the number of chunks that we have this will go 0 1 and 2 and now we can access each chunk using the i index here so first we want to add the cls and separator tokens just like we have up above so to do that we go input id chunks we get the current index and then we just do torch cat which is just concatenate and then we passed a list just like we did before which is going to be torch tensor and then in the middle we have a we're going to replace that with this okay and then we want to do the same for our attention mask but of course in our attention mask if we look up here it's just full of ones and the only two values that we can actually have in our attention mask is either one or zero and the reason for this is whenever we have a real token that bert needs to pay attention to we have a one in this attention mass whereas if you have a padding token that will correspond to a zero in this attention mask and the reason for this is just so but it doesn't process attention for the padding tokens within our inputs so it's essentially like telling bert to just ignore the padding so in our case here both of these are not padding tokens so both of them should be one okay and then that gets us our sequences with the cls separator and added attention mass tokens in there so now we need to do the padding and realistically with padding we're actually only going to do that for the final tensor so what we will do to make sure that we don't try and pad the other tensors is just check the length of each one first we'll calculate the required padding alone which is just going to be equal to the trunk size minus the input id chunk then we want the index shape zero so this is like taking the length of the tensor okay and for chunks one and two this would just be equal to zero whereas for the final chunk it will not it will be something like 150 or 200 so what we want to do is say if the pad length is greater than zero then this is where we add our padding tokens so first we'll do the input id chunk and again we're just going to use the torch concatenate method this time we have our input id chunker to start i think it's a chunks not chunk and also here this should be mask chunks so let's just fix that quickly and here we first have this and then the parts following this need to be our padding tokens and to create those we are going to do the torch tensor again and then in here we're going to just add one zero in a list but then we're going to multiply that by the pad length so if the pad length is 100 this will give us a tensor that has 100 zeros inside it which is exactly what we want and then we'll copy and paste this and do the exact same thing for our masking tensor as well okay so now let's just print out the length of each one of those tenses so for chunk and input id chunks print the length of that chunk and then we'll also just print out the final chunk as well so we can see everything is in the right place and here so just copy so this here needs to have an s on the end oh and up here so when we first build these so if we just print out one of them you see that the input id chunks is actually a tuple containing three of our tenses so what we actually want to do and just close that is before we start this whole process we just want to convert them into lists so that we can actually change the values inside because otherwise we are trying to change the values of a two port which we obviously can't because two points are immutable in python which means you can't change the values inside them so we just convert those to lists and then we also need to add an s on here and there we go we finally got there so now you can see okay here we have 514 so let me just rerun this bit here and then rerun this okay so it's because i was running it twice it was adding the is adding these twice so now we have 512 and then we can see we have our tensor so this is just printing out the input id chunks see here we have all these values and this is just the final one so you can see the bottom we have this padding if we go up here we have our starter sequence token 101 and down here we have the end of sequence separator so now what we want to do is stack our input ids and attention mass tensors together so we'll create input ids i'll use torch stack for that and that's going to be input id chunks and then we also have the attention mask that we need to create so we do the same thing there and that is the mask trunks and then the format that expects us to be feeding into data is a dictionary where we have key value pairs so we have a key input ids which will lead to our input ids tensor here and then another one called attention mask that will have the attention mask as its value so we'll just define that here and this is just the format that expects so the input ids and then we have the input ids there and then we also have the attention mask and we have the tension mask in there now as well as that bear expects these tensors to be in a particular format so the input ids expects it to be in a long format so we just add long onto the ends there and then for the attention mask we had specs integers so we just add int on to the end of that and then we just print out input date so we can see what we are putting in there okay great so that is exactly the format that we need now we can get our outputs so we pass these into our model as keyword arguments so we just add these two asterisk symbols that means it's a keyword argument and then in there we pass our input date and this allows the function to read these keywords take them as variables and assign these tenses to them so there we have our outputs you can see here that we have these logits these are our activations from the final layer of the bert model and you'll see okay we have these values what we want in the end is a set of probabilities and of course this is not a set of probabilities because probabilities we would expect to be between the values of zero and one here we have negatives we have values that are over one that's not really what we would expect so to convert these into probabilities all we need to do is apply a softmax function to them now softmax is essentially sigmoid but applied across a set of categorical or output classes and to implement that we just do torch and then functional and then we just add softmax onto the end there and we need to access the output logics which is in indexes zero of the outputs variable so that is just accessing this tensor here and then we access dimension minus one so the dimension negative one is just accessing the final dimension of our tensor so in this case we have a 3d tensor so this is like accessing the second dimension or dimensioned number two because when we have 3d tensor we have dimensions zero one and two minus one of zero is just the dimension two if that makes sense so imagine we have zero one and two here if we go here and we take negative one we come around here to the back of the list and that is accessing the second dimension okay so that is going to take a softmax function across each one of these outputs and then we can print out so now we have our probabilities so the outputs of the finbet model these ones here in the first column are all positive so this is the prediction of the chunks having a positive sentiment these are all negatives so the prediction of the chunk having a negative sentiment and these are all neutral so if it has a neutral sentiment so we see here the first and second chunks are both predicted to have a negative sentiment particularly the first one and the final one is predicted to have a positive sentiment now if we want to get the overall prediction all we do is take the mean so the probabilities and we just want to take the mean and we take that in the zeroth dimension we should just go from here down take the mean of those three take the mean of these three and take the mean of these three as well print out let me see here okay negative sentiment is definitely winning here but only just it's pretty close to the positive so it's it's reasonably difficult one to understand and this is because over here we have mostly negative kind of negative and most positive so it's a bit of a difficult one but negative sentiment does win out in the end now if you'd like to get the specific category.1 we'll just take the odd maps of the mean and that will give us this tensor if we want to actually get the value out of that tensor we can just add item onto the end there and that is it we have taken the average sentiment of a pretty long piece of text and of course we can just use this code and iterate through it for multiple long pieces of text and it doesn't really matter how long those piece texts are this will still work so i hope this has been a interesting and useful video for you i've definitely enjoyed working through this and figuring it all out so thank you very much for watching and i will see you again in the next one

Info

Channel: James Briggs

Views: 1,748

Rating: 5 out of 5

Keywords: sentiment analysis, transformers, deep learning, python transformers, sentiment analysis with transformers, machine learning, tensorflow, pytorch, natural language processing, nlp, language, artificial intelligence, artifical intelligence, sentiment classification, huggingface, huggingface transformers

Id: yDGo9z_RlnE

Channel Id: undefined

Length: 27min 10sec (1630 seconds)

Published: Wed Mar 10 2021