Advanced NLP Tutorial for Text Classification with Hugging Face Transformers (DistilBERT) and ktrain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone welcome to this brand new lesson in today's lesson we are gonna talk about text classification with hugging phase transformers and basically we will be using here a distilled bird model otherwise you can also use any other hugging phase transformers and you can follow along with this video lesson so ah before that let me explain you few more things ah the first thing which i am going to explain here we are going to use a fetch 20 news group data all right so this data is available at scikit learn here at this address all right so i'm gonna just copy it here and then i'm gonna paste it here let me uh make it here data set all right so this data set we are going to use thereafter uh once we we use this data set it says that we have here 20 classes total number of samples are around 18 000 and there is only one dimensionality and the feature is text data and if you want to see this data set in details you can see it into a kaggle so at this address here ah let me share this one as well all right so this is going to be a kaggle link and at this link you can set uh you you can see here a details about this data set this data says that it has a list of these 20 news group all right so these are the graphics ms windows and the pc hardware hardware mac ibm so these are kind of the data set which is discussing about these topics like in the sport it discuss about the baseball and the hockey and in the science it discuss about the electronics medical and the space so all those things all right and thereafter these are the uh these document text data document which is uh being discussed here all right and this is complete data set uh for example here for a medical in science group and this one is a complete data set and it is discussing about this medical information so this whole text data overall architecture which we are going to design here that we will be passing a text data as the input and then there will be a burnt model in between and then that butt model we'll be doing uh internal tokenization and classification and then finally we will be getting our target output on the test data set and the validation data set and of course we will train our model on the training data set all right and the pre-trained model ah that's mean the transformers which we discussed earlier that we will be using these transformers you can get these transformers from here let me pre-trained models so this is going to be the pre-trained models link all right superb so these are the data set which we are going to download from skiller and details about this data set is present on kaggle and pre-trained model is available at this hugging phase dot co transformers all right so we can use any of these models but here for the simplicity i'll be using a distilled butt model you may ask that why i can't use other models like but another because these models are pretty much large model and the large model takes a lot of the time and in this video i don't think that uh you will be waiting for more than hours and the two hours to get your models trained so that's why i'm gonna use here a distilbert model so with the help of the distilled bird model we are gonna learn about the concepts and apart from that concepts you can just change these names apart from the distill but then you will be also able to use other models as well although those models might take some times in training so i'm gonna use here a distill but all right perfect so i think i'll be using here distilled but base and cased model all right so this model i'm gonna make here my base model all right superb so these are the introduction now let's go ahead and set our google cool app notebook if you don't know how to work with google cool app you need to visit their coolab.research.google.com thereafter in file open a new notebook and then you will be getting something like that you can change its name and the headings other things all right thereafter you also need to set here a gpu hardware accelerator otherwise otherwise otherwise it will take a lot of the time for training all right so let's go ahead and start working with our model so the first of all we are going to install k train all right so pip install k train k train is a wrapper library and this one is wrapper for tensorflow and keras so this k train we will be using for our uh hugging phase transformers huge all right so it might take some time and uh currently we have k train version zero point two three point two that's mean if you if if you get your k train is not working then you can install specifically this version of a k trend to get it working because the currently at the time of making this video only k train 0.2 3.2 model is available in the market all right while it is a training ah we can do some other settings all right so there are few settings i think yes all this sorry while it is installing in fact not its training so installation is done here there after few more things i'm gonna use there and those things i already have so i'm gonna just copy and paste it here then i'll explain you there so there are these are the some settings where you want to set for your gpu use and plots etc will be plotted in line of the notebook so these settings are required to make it properly work let's go ahead and run it thereafter once it runs then everything is on the place now we can load our data set all right but before that we are going to import here um sqlearn dataset and ktrain so import ktrain thereafter from k train import text sorry that should be there import text thereafter from sklearn dot data sets import we have here a fetch 20 news group all right so only these python packages we will be needing and once we imported all these necessary packages let's go ahead and load the data set do remember we are going to load this 20 news group data from the sql and total it has 20 classes but we are not going to use all those 20 classes but we are going to use only few classes so i'm gonna say here these are the categories and the categories which i'm gonna use here alt dot all right thereafter social and then we have here this religion and then i'm going to use here a christian all right then here we have a comp dot graphics and then psi mid all right so these are only uh i think these are four only perhaps okay so these are only four groups and let me see few more i think we can use ice sports perhaps yes let's go ahead and use this one and there all right so let's go ahead use these five only so we'll be using only these five categories that's been what we are going to do we are going to pass here the set of classes i mean the data set for which set of classes we want so instead of downloading all the eighteen thousand rows it will download only those data set which have these classes so by doing this we are going to save some time in total classification time all right so we have their categories let's go ahead and get the data set all right so the first of all i'm gonna get here the training data set so the train data set each their fetch 20 news group and in that i'm going to pass here the subset of the data set and that i am going to get the training subset of the data set and thereafter it asks about the categories and there on the categories we have here this categories all right that's when import these five categories data set and thereafter i'm gonna set here the shuffle shuffle is equal to true that's mean this is going to randomly suffer all the data set which is imported and thereafter i'm gonna set here this random state and random state i'm gonna set here 0 all right so this random state is random you can set any random state i'm setting random state to 0 so that i can reproduce same result whenever i run this cell all right so with this i have got their training data set and similarly as i get this training data set i can do for the test data set and their subset i'm going to use a test subset and apart from that every other things will be same without any changes so we have got the trend data set and test data set all right now if you print here this uh test data set all right so once you print you get something like this it says that okay the uh this notebook data limit is exceeded so this cannot be printed moreover it has dot data and dot target otherwise one more thing you can say that test dot keys all right so this test dot key says that we have a data file name target name target and the description all right so basically we are interested in the data and target variables all right so these are the data and this one is the target and if you want to print there you can actually print this target and in the target name you will be getting those categories all right so it's a simply like this it's target names that's when these are the labels and these are five labels which we have imported in our notebook and these labels we are getting there and similarly you can also test for the training data set without any issue all right now i have here x train is equal to train dot data and then y train is equal to train dot target then i have here x test is equal to test dot data and then here y test is equal to [Music] ah here we have a test dot target all right superb so these are the four all right so x trend data set y train x test and the widest data set that's when we have got the data set let's go ahead and check the shape of this data set x train data set and x test data set all right so shape is not available because these are already list so we cannot get the shape but instead of say perhaps we can get the length of this data set all right so it says that total 2854 we have in the extreme and 1899 in x test all right so x test is a little smaller than the x trend so these are the text data i'm not sure if it is able to print yes so this can print although this is huge data set so i'm just simply going to comment that one all right so we have prepared our data set let's go ahead and build our transformer model all right all right so there i'm gonna build this transformer model all right so if you remember i already i had told you that i'm going to use this distil part uncased model all right let's go ahead and copy it although you can use any other models but i am selecting distilled but model because this model is small and it runs faster its training is fast so that's why i am going to just select that one all right so there i have model name is equal to distilled birth uncased so this distilled part base that's mean this one is a smaller version of distilled bird model uncased means it is case insensitive that's mean lowercase and uppercase doesn't matter all the data gets converted into the lowercase of the data all right let's go ahead and get the transformer so there i get their trans is equal to text dot transformer all right so there we get the text dot transformer and in fact let me make it a small model name so i'm gonna get there the model name all right thereafter this model name is equal to model name and the maxline maxlin is supported by a button the distilbert model is maximum 512 maxillin so that is i'm gonna select here 512 maxlin and then it asks about the classes name so what are the classes name so these classes name are the categories which we had already selected all right so those are the categories there after once you do that one you get there a trans all right model name and your you you pass the classes name that is the target and the maximum length of the tokenization is 512 all right now let's go ahead and get the train data all right so the train data is a mix of x strain and the y train there i am getting here a tr trans dot pre-process all right so pre-process training data set and there i'm gonna pass extend and then finally y train and the similarly i'm gonna say here the test data is equal to trans dot pre-process test data and there once again i have x test and then finally i have y test so with this i get training data set and test data set prepared for the transformer model all right once it is done thereafter i am going to create there a model is equal to trans dot get classifier that's mean now it's going to get the classifier of distilled bird all right so it is going to get the distilled bird classifier once we get this model now we need to get their learner so the learner is equal to k train dot get learner and in that i'm gonna pass here a model and then the training data is equal to train data and then finally i'm gonna sit here the actually validation data all right so this val data i have given the name for this validation data as their test data and the batch size which i'm gonna use here that is the 16 all right let's go ahead and get the learner now once you get this learner now we have reached at the final stage in the final stage we can train our model but the always problem there is one problem the problem is how would you know that how do you know that which learning rate is fitting best for your task and that you can get with learner dot lr find and there i'm gonna say so plot is equal to true and then max a box i'm gonna set here 10 all right all those who it will take a lot of the time so for that purpose i'm not going to run for all the apocs this lesson so i just sorry this cell i just wanted to tell you that this is a code which you can use to find out the best learning uh best learning rate all right so i i have already run this code and i know that there is a learning rate which is around 10 power minus 4 suits best for this problem so i'm just gonna select that one all right so i'm gonna just stop it otherwise um as you see that uh it's gonna take three minute per epoch that's mean the 30 minutes only for this uh the lr plot all right learning rate plot so i know that what is that one but if you have any other data set then you should run this one first to know that which learning rate suits best for you so i'm gonna fit here the one cycle method and the learning rate which i am going to use here that i am going to use uh the minus 4 all right that's the 10 power -4 all right and thereafter i'm gonna just use it for one epoch all right do remember i'm just gonna use it for one epoch but you can use it for multiple epochs to just improve the accuracy since we do not have much time that is why i'm gonna just train it only for one epoch let's go ahead and run it now you will see that the training has started and it will take around five minute with the gpu on a google collab to train this model so we are gonna just wait for some time so that it complete its training all right so training is done there now you can clearly see that one there all right and your learner is there so the training is done and at only single apoc we have got a 91 percent of accuracy that's way too good and validation accuracy is pretty much high than the training accuracy that's mean there is a fair chance that you can train it for more number of epochs so that it can improve the overall accuracy so overall you have got the idea how you can use the transformers model you can also use other transformer models to train your text data and now the time is to test the uh the matrix all right ah test its error metrics let's go ahead and see it with learner dot validate all right so learner dot validate so it's gonna take just a little time to plot the confusion matrix there and once we get the confusion matrix and just wait for a few seconds i think it is doing a internal validation and classification to get the confusion matrix so that's why it's taking a little time all right so this one is a confusion matrix and i think the classes we have not got so for the classes we can pass there a class names is equal to categories all right so these are the class name which is stored in the categories so we are gonna just get those there all right now you have got there uh the labels as well and the overall accuracy precision recall f1 score all those things all right there is average accuracy which is 91 percent all right so this is how you can get the confusion matrix and apart from that let's say you want to know that which are the data which is not performing good all right so for that one you can say the learner dot view top losses and n is equal to 5 and there we have got pre-proc is equal to trans all right that's mean this was a transformer all right and uh now it is going to evaluate and it's going to tell you that a top five rows i mean top five data points which had performed which has performed badly all right so while it is processing let's go ahead and make our way to make our way to predict the final outcomes are a target on a new data set all right so there i'm gonna just say that predict new data all right sorry ah predict on new data would be much better all right so it says that these are the some output these are the some input which had uh which had a wrong output prediction like if you see there we have this one 311 all right so this 311 you can get like this x test 311 that's mean this and this input was actually the medical science but it was classified as a ethenism all right aetherium and there is a id this one it was actually graphics it was classified as a baseball so these are the top five although you can change their n and you will get list of those top badly classified rows there all right perfect let's go ahead and predict on new data so i'm gonna just use their predictor is equal to k train dot get underscore predictor and uh there i'm gonna pass learner dot model and then i'm gonna just pass there pre-proc equal to trans that means that was the object of a transformer all right let's go ahead and say that x is equal to i want to say that here jesus christ is the central figure of christianity all right so i want to classify this text data and i want to know that what is the target output i mean what is the topic of this text data according to our model or according to our classifier and in that case i'm going to say there the predictor dot predict and just i'm going to pass there x and this is going to say that wrong classification it says that this is a baseball all right though this is not a baseball all right and apart from that let's go ahead and take something else all right ah let me see something else all right let's go ahead and take some science related our medical related tasks and uh uh even though i don't know if i put there no that's that's not right all right let's go ahead and take the another one so i'm just gonna take a medical related this one so i'm gonna just put it into a double quote all right perfect so this one is medical related that's mean the science medical we should get the result all right so it's something like this all right so as we are getting only the 91 percent accuracy so that uh there are 10 percent chances that our the input might get misclassified but as you see there our training accuracy is only 82 percent and validation accuracy is 91 percent that's mean you can train it for more number of apoc to get the better accuracy all right superb and uh apart from that i think uh we have uh covered almost everything what you need to know that the final task i mean the final point in this is that how you can save your model all right that's mean how you can save your model so that you can use it for a later purpose otherwise one more thing let's say if you want to know explanation about your predictor i mean about your model how you can do that you can just write their predictor dot explain x ala that's mean it will do the classification and then it will do i think this one is not installed that is why it is not coming all right so in that way we can actually uh we can ignore this one not an issue all right the final one is if you want to save this model you can save it like the predictor dot shape and then you can provide the path where you where you want to save this and i'm just saying that the model all right let's go ahead and run this now you should be able to see that model here all right so this model is stored there even if you want to uh give its another model like let's say distilled but model the name so you should be able to see that here as well all right so there this distilled model and if you try to download it from here you won't be able to download it so you can download it one by one file and then you can place it in a single folder like this one all right so all these files will be downloaded overall overall the size of file for this distilbert model is around 200 to 300 megabyte but do remember if you select any other models those size may vary from one gigabyte to gig one gigabyte to two gigabyte all right perfect superb so this is all about in this lesson thanks a lot for watching it i'll see you next lesson
Info
Channel: KGP Talkie
Views: 10,564
Rating: 4.8778625 out of 5
Keywords: kgp talkie, kgp talkie videos, Text Classification with Hugging Face Transformers (DistilBERT) in Python, huggingface, huggingface transformers, huggingface transformers tutorial, transformers tutorial, transformers tutorial blender, transformers tutorial nlp, bert tutorial, bert tutorial python, distilbert tutorial, distilbert model, distilbert vs bert, huggingface tut, bert in python, huggingface gpt2 tutorial, huggingface bert tutorial
Id: s3LBdmZb00g
Channel Id: undefined
Length: 33min 20sec (2000 seconds)
Published: Mon Nov 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.