NLP Tutorial 17 - Multi-Label Text Classification for Stack Overflow Tag Prediction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

start this once again this live video so today i'm going to show you how you can do a multi-label classification so the moreover it is going to be a live video and if you have any question you can also you know ask your question you can leave your comments in a comment box and i will try to answer those question all right perfect so here i have already opened this data set in my browser it says that this is a stack sample and it has a 10 of stack overflow question and answer and the data size is pretty much huge and uh if you want to download this data then of course you have to log in in this kaggle but i have also uploaded this data at my github repository you can check my git repo here all right github lets you made it so there you can check it and you need to come here in reproject reach there is all csv ml data files download and there i have uploaded just few hours ago let me see it here somewhere you will see there stackoverflow.csv so this is a stack overflow all right and if you click on view raw then you would be able to saw all those data in raw format all right and this is a link which we will be using here to read this data all right so this this data has text data and the tag all right and how i process this data let me tell you i had downloaded this huge data all right so this is around the three gigabyte of the data and then i did some pre-processing and after those pre-processing i had uploaded it on the github so that everybody can use it all right otherwise preparing this data set directly from after the downloading on from this kaggle site is really very difficult all right and for that purpose what i'm what i did here i did some pre-processing and then uploaded it on a github so let's go ahead and get the link of this file from here and the pre-processing which i did you can also get that from my github repository here so at this git repo sorry what is this at this git repo you have here pre-process kgp talkie all right so this is the pre-process kgp toki package i have prepared this package for the ease of pre-processing many people feel you know i feel that it is very tedious and difficult task to do a pre-processing that is why i have you know made this one and it's really very simple to use so you need to install this pre-process package thereafter you can simply call this function then it will automatically do all the pre-processing in your data set like it will do the pre-processing like contraction to the expense and email removal url html tags retweets all those things it will be done and if you want to take a very detailed course on the nlp then of course i have here a course you can visit this course and you can learn about the nlp in very detail all right perfect superb all right let's come back to this stack you know the stack sample this code and all right so here i'm gonna use a google cool app why i want to use google cool app because google cool app provides a free gpu all right although we will not be using a gpu but still for the simplicity so that you can also use directly i'm gonna show you how you can do it on google cool app all right so the first of all you have to create here a new notebook all right in this new notebook it might take a while let's go ahead and wait for a few second all right so while it is preparing all those things uh let me do some settings just a second all right so what i'm gonna do here i'm gonna rename it and then i'm gonna rename it to multi class sorry it's actually the multi-label it's not a multi-class it's a multi-level classification all right so this is a multi-level classification and then i'm gonna click on connect although it automatically get connect all right so this is a multi-label classification let me bring this header file here so first of all i'm gonna add here this text and i'm gonna put it heading one all right so this is multi-label classification all right so in this multi-level classification the first of all i have to you know load our data set and to load the data set what i have to do here i have to import pandas as pd and i have to also import numpy as np all right i have just received here a message rahul hi rahul how are you all right thanks for watching it if you have any question please go ahead and ask all right so we have imported here bandage and numpy thereafter we are gonna load this data set so let's go ahead and copy it and you can directly load the data set from from the web so here we have a pd dot read csv and then we need to give here url from where we want to read let's go ahead and see so this data is actually huge and it took a little time to download it and it says that there is a default numbering system that's mean this is a index and we need to put this index in an index and for that i have to put here huge sorry i have to put here huge index actually sorry it's index huge uh it's index column and i'm going to say that use index column is equal to the zero and now you can see here properly it is printed out all right perfect let's go ahead and read it into a data frame df and now we have read it into a data frame df thereafter we can do here df dot head so this is head of this data frame it has a text data and the tag all right and this text data is actually coming from here stack overflow and if you see this data it has here these column all right in this column let me show you so in these column it has here body so body was actually let me show you on the stack overflow itself all right so there we have a stack overflow in this stack overflow uh let's go ahead and search like a python so what happened here this is your question in this question you have here a title and this is your body so what i have done here i have put this body and the text data together to just pre-process our data set all right perfect and one more thing let me share this uh you know the notebook link here so that you can also follow and let me share it here all right so that everybody can follow it if you are following this lesson and i'm gonna change here this link sitting to everyone anyone with the link all right let's go ahead and copy this link and let me comment it here all right so now i'm sure you should be able to see this all right google collab notebook and if you are not able to see this let me know if i need to make some changes i'll make those changes all right so just receive the message from saltwine hi hi how are you all right superb so this multi-label you know yeah let me fix it to like multi-label text classification all right so it's gonna be like multi-level text classification perfect superb all right let me bring it into a full mode so there we have now multi-level uh text classification all right so we have got our data frame and thereafter what i'm gonna do here once we get this data frame we need to now uh you do start pre-processing so here i'm gonna use tf idf tf idf each term frequency and inverse document frequency all right and but before we use tf id a few more thing let me show you that this tag if you see there this is actually ah this is a string all right so let me show you how i'm saying that this is a string so there we have here tags and then i'm gonna show here i log 0 and if you notice that this is a string and apart from that if you check the type of this first index it says that this is a string but we want it to make make a list all right so to make a list we have to evaluate this function and how we can do this evaluation for that evaluation we have to import here ast all right so this ast will do a literal evaluation thereafter if i use here asd dot literal eval all right and in this literal ul i'm gonna use this one so that you can see it here how it is working and yes now you can see properly here what is the difference here let me print that one here as well so the difference if you notice there this one is a string and this one is a list all right so this is you know this is literal evaluation which we need to do here all right hi prim kumar jha so i have just shared here a link for this notebook all right yeah go ahead and see in the comment box the link is available there all right so as dot literal uldf tag i lock let's go ahead and change these string data in in the form of list and to do that i'm gonna use here a lambda function so a stacks df tags dot apply there i have a lambda function so how this lambda function work lambda function iterate over each row and then the row data is placed inside this x thereafter we need to perform this operation all right so a s dot literal eval so i'm going to do here ast dot literal evil and then x so this is going to change all these you know the tags data in the form of list and you can check it with the head and this head says that now it has converted in the form of list so with the very clear it is here that we are dealing with multi-label text classification now do you know the difference between the multi-label and multi-class ah let me show you here a multi-label versus multi-class and if you see in the image you will notice that and this one is a perfect example for multi-class and a multi-label in multi-class only one output can be true all right at any time so this is example of multi-class but in the multi-label many output can be true together for example here it has a cloud and this is sun and this there is a moon but in multilabel a sun and cloud can be together and these two are true here and then a moon i mean only a single level can be true there and then we have here once again a multi-level where all those levels are true together so this is the beauty for this multi-label and there is another one for a multi-label so in multi-class we can select only one at a time but in a multi-label we can select here multiple animals together so this is the same thing we are doing here in multi-label we can select here multiple tags that means we can predict here multiple tags together all right so this is the data frame we are done with uh string data to a list conversion all right thereafter now i'm gonna show you how you can convert this in like like a into one hot encoding so for one hot encoding we have here a multi-level binarizer all right let me show you how you can actually how you can import necessary necessary python packages which we are gonna use here all right so the first of all we will be using here tf idf vectorizer so from scalar dot feature extraction dot text import tf idf vectorizer all right thereafter once again from s killer dot pre-processing import multi-level binarizer all right we need this multi-label wine eraser for a multi-label classification and then from s killer dot model selection import train test split all right perfect thereafter i'm going to show you how you can import the machine learning algorithm which we are going to use in this lesson so from scalar dot linear model import sgd classifier from s scalar dot linear model import logistic regression so i'll be using here a stochastic gradient descent classifier and logistic regression sorry yeah logistic regression and then i'll be also using here linear svm all right so from from scalar dot svm import linear svc all right so these are the machine learning algorithm which i'll be using here in this lesson thereafter we are gonna have here one versus rest classifier why we need one versus rest classifier because we are using here a logistic regression and multi-level classifier so we'll be taking one class at a time and then we will be also classifying uh one class versus other class so for that we need a from skiller dot multi-class all right import one versus wrist classifier all right perfect superb so these are the machine learning algorithm which we will be using here and one versus rest classifier will do a classification one class versus other class all right perfect superb now let's go ahead and uh convert our tag in the form of one hot encoding all right let me add few code cell here there are all right so the first of all i have to get here y is equal to data frame dot tags and if you see this y y will be here a series and these are the series for the tags and then this y we need to convert it into a multi-level winerizer so let's go ahead first create here a multi-level winerier object here and i'm going to say here it's like multi-level is equal to a multi-level winerizer and then y is equal to multi-level binarizer dot fit underscore transform and there i'm going to pass here y otherwise i can directly pass here like this one tags all right now if you check your y here so you have got your y in the form of uh in the form of like one hot encoding all right and if you check there ah this was the example like this one all right and the classes here if you want to know what are the classes in which this y is uh you know the chord converted you can also check those classes with the this one multi label dot classes here all right so these are the classes and do remember i had just you know the chosen top 20 classes for uh from this stack overflow data because this data was huge it was a three gigabyte huge data and it was taking a lot of the time for a training and in live classes we cannot wait for an hour to train our model so here i'm gonna show you the way how you can train it thereafter you can use your full data so that you can test it in you know the offline mode and you will be free to wait for an hour to complete its classification all right now these are the classes and if you want to make it like in the data frame you can also do that one pd dot data frame and if you pass your y here you will get your data frame like this one and thereafter you can also pass your columns name and this column name are these one all right so this is like this one you have dot net android sp net like this one and this has been converted it like in one hot encoding and a multi-label classifier expect a target variable like this one all right perfect superb so we have got our y that's mean we have prepared our y data set now let's go ahead and do our tf idf vectorizer so how this tfid of vectorizer work the tfid vectorizer is term frequency multiplication with inverse document frequency so here i have a tf idf is equal to tf idf vectorizer thereafter i have here analyzer and let me show you few details in this tf idf vectorizer so in this tf idf we have these are the input parameters which we can use by default this analyzer is equal to word is selected although you can also change it to a character level all right but by default i'm also going to select here word and if you want to test it then you can also change this let me show you this analyzer here so it's a character all right if you select a word analyzer it will do a tokenization word by word but if you select a character then it will do the tokenization character by character and this character word this is the character word boundary so it will do the tokenization with character by character and it will also detect these word boundaries all right perfect thereafter i'm also gonna select here max features so this max features will make sure that we should not select the dictionary size more than a max feature selected so by default i'm going to select here 1000 once again this is a live video so in the live video we can't wait for an hour to complete the training all right so that is why i'm gonna select here a small number of feature so that we can train it just in a few minutes and we don't need to wait in this live video so max feature is selected here of 1000 but you are free to change it to 2000 and 10 000 and a larger max feature you select it will contain a larger dictionary representation for your tf idf thereafter we have x is equal to tf idf sorry it's tfidf dot fit underscore a transform all right so it's gonna transform the data and what is the data the data is stored in the text alright and this this text is having here title and a body of a question all right so there i'm gonna pass here df dot text all right perfect let's go ahead and run it and apart from that let me ask a question to you i have already said link for this notebook but i am not sure that these codes are being updated in those notebooks or not if you are able to see those notebooks with these updated code please let me know all right so tf idf is done here and we have here x data and if you check this x this x is a sparse matrix all right and thereafter if you check here this t f idf and you need to put here a dot thereafter it will suggest you these classes which you can use here these classes says that these are the some you know uh these inputs sorry these classes i mean these methods which we can use on this uh tf idf vectorizer and this one is also having here the stopwatch says that which stopwatch we are using in this tf idf and this vocabulary says that uh the what are the features which it has uh learned here all right you can check it here all right thanks a lot for updating me all right so here it has a tf idf vocabulary and these are the words which uh this tf idf has selected all right and uh this is the full vocabulary and it is limited to only 1000 all right perfect let me let me comment it so that we can save a space here otherwise it was printing a pretty much a large print there all right so this vocabulary is showing us actually what are the word which this tfidf vectorizer has you know done the tokenization all right so we have our x and if you check the shape of your x you will get the data something like this one so it says that in our df we have 48 976 rows and each row is having 1000 feature and this feature is limited to here all right if you change this feature then this will change here and at the same time we have y dot shape this y says that of course the rows will be always same and it says that we have a 20 that's mean we have a 20 column and those are the 20 top 20 tags from a stack overflow so our x data and y data is ready now we need to do train test split so how we can do here trend test split you can write there train test split this strain test display takes x as input and then y as input as well so x is a input variable and y is target so x is a feature y is a target here and then i'm gonna say here a test size this test size i'm gonna select here 0.2 thereafter i'm also gonna select here random state random state is zero thereafter i stratify so stratify is equal to y i stratify will make sure that we divide in equal ratio in a train and test data sample and this trend and test returns here x train data and x test data y train data and y test data all right and then says that at least populated class in y has only one member so what happened here when i you know i had given here stratify equal to y and i told you that it tried to divide the trend test split in equal ratio as a 0.2 test size here that's mean a target variable y but what happen if your target variable is having only one class for a particular class or for particular level then in that case it cannot divide that 0.2 of that level in a test set and 0.8 in a training set all right so for that we have to remove it do remember you need at least two class for each label if i mean at least two instance for each level if you want to use here i stratify all right so there we have got our x train and y train so x train and y train is done uh sisir maharaj is asking a question here while using tensorflow keys tokenization does the same thing of converting the words yes yes so ah the tensorflow keras tokenizer also do the same thing all right and i'm not using here these tokenization because uh when you do tf idf vectorizer it doesn't preserve the ordering of any text data so if we lost the ordering of the text data in that case we can't use the deep learning or keras because in the deep learning ordering is matt order order of feature matters but in the classical machine learning orders of features doesn't matter all right even uh uh i would like to just add few more comments here in a tf idf if you don't want to use a tfidf you can make your own algorithm for tf idf it is pretty much simple tf idf is term frequency multiplied by the inverse document frequency i have here you know the very good i think the three hours of a lecture on the youtube for complete text pre-processing in that complete text pre-processing i have shown there how you can use your own algorithm to convert tfi uh but here i mean how you can use your own algorithm to generate a tfidf vector all right so we have here our extreme y train data set let's go ahead and use our machine learning algorithm let me add few more code cell here all right all right so let's go ahead and build our model so we have already imported a three model from sql if you remember we have here sgd logistic regression and linear svc so we are we are going to use those sgd logistic regression and svc so here i have sgd is equal to sgd classifier all right and this sgd classifier this is a stochastic gradient all right if you put your cursor here it says that this is a you know a stochastic gradient descent model based on the stochastic gradient decent learning all right so we have here sgd thereafter lr is equal to we are having here a logistic regression in this logistic regression if you notice inside this logistic regression ah let me see here there uh it say it it says that let me show you very important part here uh somewhere it's a solver yes so there is solver lb fgs we can use either lvfgs and many other type of the solvers are available all right if you notice here let me show you the solvers solver also matter there to yeah define here so these are the solver for a small data set lib linear is a good choice whereas sag and saga are faster for large ones for a multi-class problem only newton hg7 saga lvgf handle multinomial loss lib linear is limited to one versus rest scheme all right so these are the method which we can use so let's go ahead and use our you know by default lbfgs so we have to put here solver is equal to lb fgs thereafter we have few more features here which i'm not gonna you know use much there but you are free to use those as well so those are like a c value so c h inverse of uh inverse of regularization all right uh thanks thanks thanks a lot sagar really i appreciate it thanks a lot for watching all right thereafter ah so let's go ahead and leave as a teach all other things thereafter we have here svc svc is equal to linear svc all right so there we have linear svc so we have got our model sgd logistic regression and svc now we are going to iterate one by one so it's pretty much simple for uh for classifier in and i'm gonna create here a list and those list are sgd and then a logistic regression and then i have here svc all right and then here we have a clf and clf is equal to 1 versus rest do you remember that when i started this lesson and i was importing the one versus rest i told you that we have to use here one versus rest so that it can it can classify one class accuracy so we are going to use that that here one versus rest classifier that means here we have 20 classes i mean in the output in those 20 it will select one at a time and then it will take other 19 as all other classes and it will do the same thing for everyone i mean every other every data all right ah let me see i think i have lost the internet i'm not sure why ah are you able to see let me know i'm not sure whether you are able to see at my computer it is saying that we are not able to connect please let me know if we are connected all right so that i can you know they start immediately without uh trying to see the internet connectivity all right ah guys let me know please let me know if you are able to see because in my computer it says that we are not connected and youtube is not receiving all right so yes you are getting the data all right that's okay superb now let me start so here uh reserve kapoor is asking that all right i think connected all right so i have got here the excellent connection so that's mean i'm connected here all right superb perfect everything is okay not a problem let's go ahead and get start with this uh you know the coding so i was uh you know uh i was saying that reserve is uh you know interested in to know more about the std reserve uh today we are talking about the multi-level text classification and if we go uh in algorithm by algorithm it will really take a lot of the time so definitely i'll keep in mind that i'll try to make one more video on the std classifier as well all right perfect superb now let's go ahead so here we have a clf is equal to 1 versus rest classifier all right thereafter what i am doing here clf one versus rest classifier now in this rest classifier we have to give the link for our estimator so what is the link for our estimator just wait a second all right so here uh start this once again this live video all right so clf dot pit thereafter we need to provide here x underscore a train and then y underscore train so with this it will start training the data uh it will start the training this model and thereafter let's go ahead and run it but here is there is a slight problem here uh what is the problem happening here we are dealing with the multi-label text classification all right although i have run it but i have to stop it and i have to explain you few more things to make it understand all right so to do that what i have to do here i have to you know take you on a link there let me take you on a link so that you should be able to understand this multi-label all right so in this multi-level we have here multi-level classification in this multi-level classification it says that if you you know the notice here is that multi-level classification problem must be assessed using different performance measures than a single level classification problem all right perfect so what that what does it mean it it says that we cannot use those methods which we have been using like matrix dot accuracy and uh and other methods we have to use something else and that method is here hamming loss and you know this jacquard similarity all right so what is this jacquard similarity jacquard similarity or this record index is a size of a intersection of the predicted labels and true labels divided by a size of the union and predicted and the true label how you can know it let me copy this term so that i can show you some images to understand uh to make you understand all right so there i have here all right so i think i'm having really very slow internet i don't know why this happened i was having good internet but suddenly it dropped all right all right so here we have got this record similarity in this record similarity what we can see there let's say if you have the two classes here this one class and another class it is now uh it is now being as a multi-class so what happened here we calculate the ratio how much intersection it is coming divided by all right the union that means let's say if you have a you know the two classes in those two classes we will see that how many predictions are true for that two classes divided by uh how many classes or how many predictions are for those two classes all right so it's like the intersection for the classes which we are pretty divided by the union of those classes in how many uh in how many classes and in how many labels are those classes it might be a little confused it might be a little confusing at first when you start but let me make you once again here so it says that let's say if we talk about ah can we see here two classes together i think we can see here all right let's say if you have these two classes together sql and asp.net all right so this says that how many classes we have which are overlapping and divided by how many classes i mean in in the prediction how many classes we have which are overlapping correctly divided by how many classes we have total for sql and asp.net together all right so this is the overall process to understand this so what i'm going to do here i'm going to use that record similarity method here so the first of all i have to implement that one all right jacquard score so this jacquard score i have to implement it and to implement this record score you can see these uh here we have to first find out the intersection and then divided by the you know the addition of those intersection so we can do here we can write here a method like dip j underscore all right j score and in this i have y true and then y print all right then and so i'm gonna say that here so there is jacquard although the spelling could be something like this one so this is actually correct spelling we have something like this one all right so there we have a you know the jacquard so this jacquard thereafter we are going to write here np dot minimum so it will say us that how much we are actually you know the intersection so we have here np dot maximum once again it says that i i i'm having here uh it seems a good and excellent connection sometimes it connects very well and sometimes it get disconnected and sorry uh you have to face this one so we have a np dot maximum and guys please let me know if you are connected all right np dot maximum then we have y underscore true and then y underscore bread dot sum ah sorry so this is sum and this sum is going to be along the axis one all right super so this is how we can calculate the j score thereafter we are gonna return here this you know j score all right so this is the method of returning this j score thereafter i'm gonna use here another method which says that print underscore print score and this print score takes here y print and then clf so that i can tell you that which classifier we are using to print our score all right perfect let's go ahead use the first line of print there i am saying that which clf i'm using here all right and then i'm i'm saying that here clf underscore underscore class dot name sorry this should be like this one sorry so this name will return that which classifier we are using here one versus rest classifier all right thereafter i'm gonna use here once again print all right j card score and in this checkout score uh i'm gonna use here a format so in this format i'm gonna like i'm gonna write like this one a format and then i'm gonna say here j score and in this j score i have to pass here y under score grade all right and then sorry this should be the y test so the first one the y underscore bread thereafter once again i'm gonna use here we have got here uh jacquard score and uh and i'm gonna just use here a new line all right new print line perfect let's go ahead and run it thereafter once we are done with this jacquard score then after this clf dot fit that's when one fit is done then i'm going to say here you prayed is equal to clf dot predict using here x underscore test all right and then i'm using here print underscore it's print score and in this we have here y e print dot classifier all right so the classifier is this one let me copy it and paste it here uh yes uh prem kumar yes it is available but uh i wanted to show you that why we need to use this uh jacquard and uh how we can do these things otherwise yes it was available so this method makes you comfortable that okay how it is uh being used and uh how it is performing all right so we have here print underscore score y print and then classifier is there all right perfect superb let's go ahead and run it now if you run it we have to wait whether we get the error or it is working good so this says that i think the something is wrong there all right so something is wrong there ah let me see what is wrong it is not printing correctly here this record score all right oh sorry i i think i'm missing here so there should be a mean and then a multiplication with hundred all right so this is the mean perfect let's go ahead and run it now so we are going to see how much accuracy we are going to get here super in the first one we are getting here a jacquard score of 4400 is best but we are using here a multi-level classifier so sometimes it gets really very difficult to get that much of the accuracy which we get in general in our uh in our single you know the classifier multi classes classifier all right so linear svc is giving us around 52 percent of the record score which is really not bad for this data set all right and which is a you know a good for this data set all right now let's go ahead and see how you can how you can do a model testing with real data so let's say if you have x like this one how to write ml code in python and java i have data but do not know how to do it so this is x and for this x we want to do a prediction in for our model which we have just trained so for that i have to get here a tf idf so i do here tf idf dot transform thereafter i'm gonna pass here this x now if you see this xt this is a tf idf vectorizer which we have got there and this is a sparse matrix now we can simply do here clf dot predict and if we do here clf dot predict on this tf idf data we get like this one all right so these are the two one that's mean these two classes are one now we have to also know that what are those classes and to know those classes you have to write here a multi-label dot inverse transform and inside this inverse transform we have to pass this clf dot predict now it will show us that what are the label which has been predicted here and it says that for this particular data it has predicted here this java and python all right perfect superb congratulations you have really trained it and you have got to know that how you can work and apart from that let me show you a few more thing here all right i was talking about the tfid with riser here how we can do this ah let me show you here another feature the another feature which we can play with that is the n gram range by default this n gram range is limited to one one but let's go ahead and select it one and two and uh this max feature i'm also gonna you know increase it to let's say 5000 all right perfect let's go ahead and run it and then see after this let's go ahead and see how much accuracy we are gonna get if we increase the n gram range all right based on that one then we can identify that which method is uh you know doing best for our purpose all right so of course uh as i told you at the start of this lesson if we do with the large feature set then it will definitely takes time so here we have got x dot shape now we have here uh 5000 data but do remember this 5000 will really take a time a lot of the time all right so now this is score is running here let's go ahead and see and i think it is increased all right it has improved for a gd classifier earlier we were getting around i'm not sure it was 42 or 43 percent guys if you remember please let me know all right so logistic regression is saying that uh we have to increase a maximum number of iterations so that uh solver can be converged solver can converge and superb super i mean really superb so we have increased our accuracy almost eight percent earlier we had here uh earlier we had here uh you know 58 point uh i i think earlier we had a 51 percent and now we are getting here at 58 percent guys please let me know if you really like this method to play with these n gram all right and if you feel that it is really good so i can do some more testing here to show you that how you can test your you know how you can increase your accuracy in mean time prem kumar jha is asking about can you explain clf dot class dot name yes prem kumar definitely i would like to explain you for that you can see here a one versus rest all right here we have one versus rest classifier so if you open here this scalar multi-class one versus rest classifier then what do you see there so these are the input parameters and there are some other attributes and thereafter let me show you here it has some those are you know internal internal variables which we can access and those internal variables we should be able to see those variables are we able to see those here all right so those are not those are not visible here but uh i can explain you like let me come back to here once again so with this one actually where is it all right so with this one we are getting here the name of classifier because this one versus rest classifier is taking the classifier and this classifier is we are feeding here sgd logistic regression and svc so this clf dot you know underscore underscore class underscore underscore and this name says that which classifier is being used in this one versus rest classifier all right perfect in mean time let's go ahead and play with another analyzer which we have character all right if you see there i think we had a car all right so with this character n gram range i'm gonna put here uh 2 okay 2 is ok otherwise let me put it 3 i think 3 will be you know a lot better than the others so once again it's going to take some time let's go ahead and read and all those things again i mean really this is great to test the models how these models are performing all right so what it is doing here okay the tf id of vectorizer is being done yes frame comma so this is a in build feature i think now the internet connection should be okay all right so here how this engram is working in mean time why while it is training let me tell you let's say if you have a some word here all right like this one let's say you have all right let's say you have this one so in this if we select analyzer character then it will do first up you know l e t s all right then there is a space so all those things will be done first that is the 1 gram then for 2 gram it will take these le et and ts together and like that it will also do for you know for a trigram and foreground like those all right so in mean time it is uh okay so here it says that the accuracy is decreasing here for our character classifier and how many features it has it has a 5 000 feature once again all right so for this particular feature and for this particular uh problem the corrector boundary is not working for this one ah prajwal vendor is asking about the can you explain the transformers uh prejual i have a very good course on the advanced machine learning algorithms like bird transformers distilled birds all those you can watch those i mean you cannot see that here we have a bit let me send you link here https bit bit dot ly forward slash there i have i think the name of that one is but underscore nlp all right so this is the course here you can see in this course i have explained about the transformers and pretty much advanced algorithms all right perfect so once again it says that solver convergence is not being done here so solver is not converging so why it is happening let me complete this training so i'll show you here all right so for the logistic regression jacquard score is here 40 let's go ahead and wait for the linear svm as well and see how much accuracy we are getting here thereafter we can again come back to uh let me see here we can come back to once again at word analyzer all right thereafter i'm also gonna heat it to two there after one more thing i would like to show you here is that if you use a stopwatch removal then how our accuracy is gonna change all right so the linear sva for a linear svc we have got here accuracy better accuracy so that is a 58.86 percent so we are getting here a better accuracy in this case all right ah one more thing let me do here uh let's go ahead uh copy this one and put that here and now i'm gonna select it only for svc so that i can do here a fast training and a testing all right so svc says that uh that is uh you know for 5000 features it is producing better accuracy but in meantime i'm also going to change it here and i'm also going to show you how you can test it with some other feature and some other features like um we have tested the analyzer max feature engram i think we should have here a stopwatch all right here i'm not able to see this tfid of vectorizer so that one is not coming so that is why i'm gonna put it here all right so in this what is happening here um in the tfid vectorizer we have these input parameters so auto you know auto correction was not i mean auto suggestion was not coming here that is why i came to here so what's happening here we have these are the input encoding the code so i am interesting pretty much we are tested this analyzer here in this stopwatch all right so i'm going to select here this english stopwatch so in if we select english stopwatch what happened the dictionary will not prepare i stop words so stop word will be discarded and then i'll show you how the accuracy are actually improving so stop words there and then there we have english let's go ahead and run it so in mean time it is while running let's go ahead and run these things and i want to run this svc but i don't want to this one because svc is constantly producing a better result so for that purpose i'm not gonna test all right so we have increased accuracy by removing uh stop words now let's go ahead and do some more testing here guys one more thing if you are really liking this uh you know the testing the parameters then let me know otherwise i'll stop this video here and one more thing you might be wondering that there are many other methods like a grid search all right and agreed search method is also a very good method to doing all those things and if you are interested in a more algorithm uh you know what do you say that algorithm optimization then uh definitely i think uh let let me show you that actually in the algorithm i mean i'm talking about here algorithm optimization all right so i have here a course that is a bit dot ly for slash this one is intro underscore nlp so this is the introduction nlp course and in this nlp course all right you can visit it by typing ah let me let me paste it here https bit dot ly forward slash there we have intro underscore nlp all right so this is the course and in that i have explained these things in you know the pretty much detailed explanation and i think those are here let me show here all right so these fine tuning of machine learning algorithm so the way i was you know showing you that how you can improve your accuracy because if you remember we started with the 51 percent of accuracy but by doing some testing and settings we had increased it to a 61 percent of the accuracy that is a huge improvement so there are many ways i mean we have just played here just few parameter but in this in this video i have shown you you can play with many parameters like how you can do these factorization etcetera and then how you can do hyper parameter tuning so that you can improve the accuracy and by going through these lesson you will learn how you can use a grid search and how you can improve the accuracy all right and then i have also many other you know ways to learn to make you understand how you can deploy your email models how you can you know learn the word embeddings club vectors how you can do the resume parsing and all those things and the working codes for all the videos are available there and you can download the resources from this resources section all right so there we have got a 61 percent of accuracy let's go ahead and do some more changing here so the change which i'm gonna do here i'm gonna put here n gram range to three here and this max feature i'm gonna increase it to a ten thousand and definitely it will take a lot of the time because we are doing here engram 3 and max feature 10 000 so it's gonna take some time but let's go ahead and see how much accuracy now we are gonna get with this one ah thanks frame kumar thank you so much all right so i think a t of idf is still being done so it is happening because n gram range is here one two three and the maximum feature was 10 000 so we have also increased the maximum we have increased this maximum feature as well now this has started now you see there superb i mean this is super we we're at 61.2 percent accuracy now we have reached here 62.42 percent of the accuracy so this is the one way there are many more ways we can do i mean we can also play with some other parameter in this linear svc as well all right so in this linear svc let me copy it from here from there to here so that you we can play with those uh methods here i'm not getting that auto correction not sure why but if you see there we i think do we have c there so we have this c c inverse of regularization all right and for this c i'm gonna use here uh by default it is 1 but i'm going to use here 2 and for c is equal to 2 i'm going to see here how much accuracy we are going to get all right do remember if you are looking for any automatic testing then you can see there this uh fine tuning ml algorithm so there i have shown you how you can do those testings all right i think it's 62.54 percent of the accuracy but if we make it three let's see and what happen here in meantime we can also copy this one and then we can put it here and then we can play with more number of parameters there all right so the accuracy is decreasing here 61 percent let's go ahead and check with c is equal to 1.5 all right so by default the finality is here l2 but we can also test it with l1 penalty and this is regularization parameter and this is strength of regularization but inversely proportional all right and a class weight we can also use here a balanced class weight all right let's go ahead and test this one if if it increases this j score and all right so this one has increased the accuracy a little and now we have reached here 62.80 accuracy and let's go ahead and test this one class weight is equal to i'm gonna use here balance uh sorry it's is it balance or balanced let me see here it's actually balanced all right so for balanced how much it is gonna make let's see uh while it is training uh let's go ahead and show you all right so for balance the accuracy is decreasing a little so let's go ahead and remove it thereafter one more thing i'm gonna test it then i'll stop this video by default this is penalty selected as l2 but i'm gonna test it with l1 as well so if you select here l1 then see what happen here um uh it's penalty is it wrong i'm not sure oh sorry it's a this one is penalty all right so now i'm going to use here a penalty is equal to l1 but when we do the penalty is equal to the l1 and loss is equal to the squared hinge are not supported when dual is true all right so i have to set here this dual is equal to false when we want to use this penalty all right all right so we have to wait for a while while it is completing it might take a time here as we see by is taking some time all right so in mean time guys if you have any question you can ask uh after the testing this this one i'm gonna stop this video i think it has been almost an hour all right so it's it should be done here it should be done and it's taking time all right keep watching keep watching okay all right so it's it is taking time i think it will take some time so in meantime you have got the idea wow i mean this is really so pop this is superb i mean we started with the 51 percent accuracy and then now we are almost reaching at 64 of the accuracy i mean do you really believe this i mean we started with the lowest of the accuracy level thereafter we you know the kept increasing the accuracy and there are a lot of the scope to increase this accuracy if you select any of uh you know these methods which i have shown here you can definitely increase these accuracy you can go beyond the 64 percent of the accuracy all right superb so guys thanks a lot for watching this video and i'm gonna stop this one i mean really it has been a fun to work with all of you all right bye bye and see you in some other lesson

Info

Channel: KGP Talkie

Views: 14,897

Rating: 4.9133129 out of 5

Keywords: machine learning, artificial intelligence, natural language processing, multi label classification, multi label classification keras, multi label text classification, stack overflow, stack overflow explained

Id: DkCF5xb0840

Channel Id: undefined

Length: 75min 1sec (4501 seconds)

Published: Mon Aug 03 2020