HANDLING IMBALANCED TEXT DATA - NLP | PYTHON

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey guys so in this video we will be implementing a spam classifier using a python and natural language processing along with that our data will be imbalanced in this case so we will be seeing uh three to four techniques on how to uh deal with imbalance textual data all right so without wasting any time so let's straight dive into the implementation so here as you can see i have already uploaded my data set onto the google collab notebook and just make sure you have selected the run type to gpu accelerator and save the instance all right so the first step is to include all the necessary libraries that we will be using in this video so these are a couple of libraries that we will be using so just press shift enter so as we as you can see we are going to use an ntk library which is a standard for uh natural language processing and we we are going to use numpy pandas and skill on also all right so in this video we are going to use the uh the classical knife base algorithm a machine learning model which is uh really good for textual data all right so the first step is loading our data so let's say df is [Music] [Music] and i want to separate it with the help of backslash character okay and i don't want any headers yeah i think it's headers okay never mind and print df shape and df head all right so as you can see we have approximately 5500 samples and two columns all right which corresponds to zero for label and one for textual context okay so the next step is to actually i will just rename the column names because zero and one doesn't make really sense okay so this piece of code will uh rename the data set according to your provided labels alright so here my zeroth table will be labeled and one label will be text all right and we are going to do a in place method so as you can see we have successfully renamed the labels to our desired values okay so the next step is to actually see the count right whether the data set is in balance or not so for that we will count the number of uh labels we have for amp and for spam so we have a super easy utility for that inside the pandas data frame so we will say bf label dot value accounts so here as you can see for the class ham we have approximately 5 000 samples and for the class spam we have around 747 samples so that's a really bad data set or you can say in balanced data [Music] so uh before dealing with the imbalance data let's see if we have any null values that is if we have any uh length of zero text all right sorry uh x with zero length so let's say df dot is known [Music] so it counts the number of null values all right and uh provide us the sum so in our case we have zero and zero alright so we don't have any number cases so uh let's just try to calculate the length of each text each record of this text column so for that i will see df of uh i'll create a new column say length is equal to df dot redf of text dot apply it will be a lambda function x return length so this line will basically create a new column inside our banners data frame length which will uh keep a length of each record right so you will see for yourself now [Music] so as you can see the first text is of length and eleven second is for 29 and so on all right so now uh in this text as these are unprocessed text alright so this these texts will also contain uh unnecessary words uh punctuation stop words and right uh and unnecessary symbols so we need to remove those in order to proceed further so for that i will define a couple of utility functions which will help us to clean our text data all right so these are some mutilated functions that i have written so this basically converts every record to a lowercase this is pretty simple and this is this is used to remove uh all the numerical values all the numerical characters only uh non-numerical characters all right on this and and this is important all right it is basically nemetizing function which which basically uh reduce the words to its original uh origin word all right it's lemma actually and remove constellation is easy remove stop words list words which do not contribute to any meaning in this sentence those are called softwares and extra white spaces all that so next step is to apply all these functions to our text column so this is the same way that i calculated length earlier like this and in the same way we are going to apply all these functions so now let's calculate length again after all this clearing function and let's see if we reduce any length i'll say after cleaning sequential text [Music] dot apply um the same function x and then of x let's see the head now so as you can see we have actually decreased the length of the text so those were all unnecessary symbols and characters that we removed using these functions so the next step is to actually convert these labels to integer values because we cannot just feed these labels into our machine learning models and expect it to learn anything all right so for that i will create a label map which will later apply to the uh label column right so as you can see i have a label map in which i have assigned ham to zeroth class and spam to one class and this is just uh next step is to just apply this label map onto our label column so now you can see we have converted the uh string label into integer labeling all right so the next step which is the most important step is vectorizing so now we have all the text value and the label values after the cleansing function now we need to vectorize it in order to feed it to the neural network or you can say a machine learning model because obviously machine learning models works on numbers and not on strings so we need to vectorize the text values so for that i will be using the tfid vectorizer so here i have defined my tfid vectorizer with the name df without balancing because right now we are dealing with an imbalanced dataset all right later we will change it to tf with balancing right so this will actually fit and transform your whole text column and we'll create a sparse matrix all right as pass matrix is a matrix in which uh the maximum [Music] the majority of the elements are zero all right if you don't have the knowledge on how dfid vectorized i will provide some resources in the description so just hit shift enter to execute this cell okay so now we can see what it looks like and this remember we need to convert it to an array in order to feed it to the machine learning model all right so tf wob that means without balancing dot shape let's see the shape so you can see we have we had earlier 5572 rows and for each row we have 7906 features all right so that is our sparse matrix as you can see here see this is our sparse matrix which i was talking about next i will display the data into test into train and test [Music] so now we have h strain x test and vitamin b test all right so we are ready to feed into into our my base model so i'll create the model like when b is equal to gaussian nb [Music] so we have trained our my base model now the next step is to predict uh the predict on the x test values and make some predictions alright and then to calculate the accuracy but we don't actually look at accuracy when we are dealing with imbalance dataset right we we rather look on the matrices like uh precision recall f1 score right so just keep that in mind that is really important in the field of data science whenever we are dealing with imbalanced data we do not look at accuracy we prefer not to look at accuracy and prefer to look at other metrics like uh like we call precision and essence or all right so let's do some predictions now which actually this should be uh my base and actually me now let's just calculate the accuracy this for the sake [Music] so as you can see it is 87.9 percent accurate but it is not really 87 percent actually dollar because this is uh might be over fitting on our data set right because majority of the class is ham and not spammed so let's look at the classification report that will tell us about the precision and recall and f1 score of the model so hit shift enter so here as you can see the precision is really bad right for the ham class it is 0.98 and for a spam class it is 0.58 similarly f1 so to improve this we need to augment our dataset that is we need to artificially generate our data set somehow so for that we will be looking at two approaches the first one will be random over sampler which actually is not recommended for actual data set because it is our over sampling technique and other one is the augmentation technique in which we are going to uh convert the words into the into the respective embeddings based on the birth encased model and using that they will be able to generate new sentences similar to those so that approach is quite useful and better as compared to over sampling so let's just look at both of them so for that we need to install uh two libraries the first one is nlp org and the second one is obviously transformers so it has installed successfully so i'll just clear the output okay so next i will import the necessary module from the nlp library and let's define our augmenter uh which will be using our birth base uncased word embeddings all right [Music] so this will actually download the whole birth base until model which is used to calculate or get the contextual word embeddings of the sentence of word all right so let's just test how this augmenter works and what does it produce all right so for that i will be taking a sample text from my dataset let's say i have taken the 100 sample okay so let us please don't text anymore nothing else all right so now we will use our augmenter to augment this sentence and create another artificial uh sentence similar to this one all right so for that i will say organic sample text will be equal to my augmented dot augment method all right so press shift enter to execute it okay so let's see what it has generated so as you can see it says please don't mean text anymore nothing else should say so it is quite similar to the original text so we can use this to augment our data set so now i have created a function which will uh select only the uh class label which have uh which are minority class samples okay and then it will uh and i have provided a 200 sample sponsors testing purpose alright so this will take around 200 samples from our uh minority class and augment those and append this into the augment text list all right and then using this augmented list we will append it to the original data frame all right so that is the only thing which we are doing right now in this function so uh before fading my original data frame to this function i will just remove the length and length after cleaning columns because we don't have those columns here all right so that will throw error if we include it so we have new df is equal to this right so now we can just speed this df into this function which will augment our data set and append it to our original data frame so this will just uh create 2000 samples i have added it to 2000 but for fast performance or due to time being i will just reduce it to a thousand samples you can experiment it with more or equal number of samples you have in majority class okay just hit shift enter to execute the cell [Music] okay so actually this will take a lot of time so let me reduce the samples to 200 only all right i don't know why it is happening because also i've selected the runtime type to gpu also i don't know let's just give 200 and let's just run it now it should take less time [Music] oh it says 12 minutes as you can see after 11 minutes we have successfully augmented our data set so now let's see the class variances okay [Music] label dot valley grounds [Music] so now you can see for the ham class we have 48 25 labels 4825 and for spam plus we have 947 samples so earlier i'll show you we had we had around dot valley forms [Music] so as you can see we had a 747 examples for spam class and now we have line 47 examples for spam class because we kept samples to 200 okay so it generated 200 augmented samples for us all right so now we have to do all the same procedure uh in order to test this data set already but uh we need to remember one thing that we don't have to we don't have to use the augmented data set for testing purpose because we uh we always test the our model on the original data set and not the augmented data set so now what we need to do is that we need to train our live this model on the augmented data set and test on the original data set all right so that so that is why i'm doing this kind of thing right now so now let's again create a tf idf vectorizer and convert our text into vector vectors so so now our tf would be df with org all right previously it was without augmentation so let's see the shape of our dream [Music] dataset so it is also a sparse matrix as we have seen earlier [Music] so let's again create a model and fit on the x train tf [Music] and why train tier [Music] um yeah right train y train all right okay so we have trained the model so now let's make some predictions and calculate the accuracies f1 score precision and recall all right like we did earlier and you should see that i am making the predictions on the x test df which is uh coming from here also i actually forgot to convert the x test to extract df alright so i will just do that so this is so this is how we'll be transforming our xs into x test sparse matrix alright so now we can do the predictions we got the predictions and now let's just simply calculate the matrix so here we are calculating uh the confusion matrix the accuracy accuracy score and the classification report okay so let's let's shift enter okay so as you can see the accuracy has also increased 91 accurate and you can see the uh confusion matrix here and now you can see that decision is again not simple because we have only increased it by 200 samples and you can see the recall has increased for zero it was it is 0.9 and for 1 it is 0.99 and so let's just compare the result so as you can see recall was eight seven and eight six and nine two six five as you can see our score and recall has increased tremendously so it is not so good quite right now but you can make it better by increasing the amount of samples to provide here all right so it's just up to you how you can and you can experiment with it all right so this was about the nlp and transformers augmentation and for some more next few minutes i will make a separate video for that because this video is going to be is becoming long so yeah that's it for this video and i hope you enjoyed it you
Info
Channel: The Artificial Guy
Views: 6,122
Rating: undefined out of 5
Keywords: nlp, natural language processing, python, nlp python, imbalanced dataset, machine learning, handle imbalanced dataset, machine learning imbalanced dataset, handle imbalanced text dataset, imbalanced nlp
Id: ubxfWPg2dJ0
Channel Id: undefined
Length: 20min 36sec (1236 seconds)
Published: Fri Oct 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.