Complete NLP Machine Learning In One Shot

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello guys so I'm quite excited to start the NLP series that is natural language processing for machine learning in this video we are going to see the entire road map like how you should go ahead and prepare for natural language processing and NLP is an amazing domain amazing Tech part with respect to machine learning or deep learning a lot of research is basically happening in natural language processing itself so let me go ahead and let me just share my screen and uh let me go ahead and let me explain about the road map like how you should basically go ahead with the preparation so uh to begin with uh till now probably if you know machine learning okay so in machine learning you have actually seen right we solve two different kind of problems one is supervised that is supervised and the other one that we basically solve is something called as unsupervised machine learning now in most of the supervis machine learning use cases where we specifically solve two different kind of problem statements like classification and regression right in all this problem statement right what we have seen is that let's say if we have some specific set of features like F1 FS2 F3 F4 like this it can be any number of features these features are usually called as independent features right so if I probably talk about this these are my input features or I can also say this as my independent features independent features right and similarly I also have my output feature which is my dependent feature right which is my dependent feature so what is our aim in a supervis machine learning model is that with respect to this features we obviously we'll be having lot of data points in the output it can be a classification problem or a regression problem and over here I may have continuous values or I may have uh classified points like ones and zeros it can be binary it can be multiclass okay so let's say if I have this kind of data points over here what is our main aim we usually create a model okay we usually create a model and we train the model with this data right so we basically create a model and we train this particular model with this specific data set right with this specific data set then our model will be able to make or will be capable to make some predictions whenever I give this kind of input data now in machine learning specifically if I talk about this features right now F1 can be a continuous features FS2 can basically be a categorical features and it can also be different types of features over here and during this particular scenario let's say that some of our features are completely made up of text let's say one basic example that I really want to give is about spam class classification let's say this is my example over here spam classification now in spam classification what all features we may have we may have let's say uh my main aim is to basically detect whether a email that comes to me is a spam or not spam okay so let's say that this is a classification problem that I really want to solve now in this scenario some of the features that I may have one feature is that I may have something called as email subject okay I may have the next feature as email body and my output feature is basically whether this mail is Spam or ham ham basically means not spam so let's say over here I'll give you one example let's say the email subject is like billion so here you can see that it is a completely a text right email body it can be like you wanton a lottery of okay I'm just giving you an example you won a lottery of billion dollar and I think you get this kind of emails right now obviously when you get this email u in real world scenario we will be classifying these particular points and this will basically be a Spam so I'll put a category as one or I can also put it as spam itself now here one thing that you can notice over here is that whenever in our input features we have we we have a continuous variable and we have a categorical variables and obviously we have different different techniques to convert this categorical variables into continuous values right uh there are techniques like one hot encoding there are techniques like Target encoding ordinal encoding all those techniques are there which we basically do it in feature engineering but let's say that if my entire data is a text or a sentence like this all right in this particular scenario I will definitely it will not be that easy for a model to understand right obviously because the model cannot understand human language right right now I have written in English tomorrow it can be in Chinese there after tomorrow it can be in some other languages so model is not directly capable of understanding this particular text so what should we do in this particular scenario so we have techniques we have techniques where we can convert this all text into some meaningful vectors meaningful vectors now what are vectors vectors are just like numbers only but understand those vectors represent some meaningful information with respect to this particular text okay and whenever your input data is in form of text or sentences we basically use something called as natural language processing so that we'll be able to process this particular data and we'll be able to make the model understand to solve use cases like spam classification right so this is the entire context behind NLP and why it is so much popular because nowadays you see lot of examples Alexa right I think many people use Alexa many people use Google home right many people use some automated device like let's say the AC is running right and you say hey switch off the AC switch on the AC how that particular machine is able to understand that it is all because of natural language processing Google is extensively doing some amazing research with respect to NLP and they're coming up with some amazing things yes we will be learning uh both with respect to machine learning and as we go we' also try to learn with respect to deep learning but just try to understand in this video we are going to understand the road map like how we should go ahead and prepare with respect to the NLP so again I'm going to write the road map of NLP okay and again we are also going to use different different libraries I will just going to draw a pyramid kind of structure and we will go in the bottom to top approach okay so let's go ahead and let's try to understand what should be the road map of NLP initially to begin with initially to begin with I'm just going to create a small block okay to begin with this first block is basically initially you need to know one programming language so let's say that I am going to probably go ahead with python programming language super important right with the help of Python programming language we'll obviously be able to solve lot of use cases of NLP but when I go to step one so this is basically the step one the step one is nothing but it is basically called as text preprocessing and this text pre-processing initially will start with some basic things like whenever we have this kind of text Data what are the text pre-processing things we need to do how we can Bic basically clean that particular text Data all those things will basically come in text pre-processing okay the techniques that we are probably going to apply in this is like tokenization tokenization is a concept wherein you convert a paragraph into a sentence a sentence into a words different different things we also going to learn techniques like climatization we're going to learn techniques like stemming we also going to introduce to words like something called as stop words all these things will basically be come covering in the text pre-processing part one so this is basically the step one okay so I'm just going to write this as step one super important and initially we'll be starting with this okay and again uh the entire detailed syllabus obviously I'm going to make video by video to make you understand each and everything but on the thousand feet overview I'm just giving you this there are a lot of topics inside this which we really need to be familiar with now coming to the second one okay so the second one we basically say it as text pre-processing step two okay so again this is also uh text pre-processing technique but we little bit we we try to increase the complexity of it and it tries to solve more problems so here in the text pre-processing to here we focus on converting the text Data into vectors so here I'm just going to write it down this is my step two here I'll write or I can basically text pre-processing part two here we are going to learn topics like bag of words okay TF IDF we also going to learn things like unigrams byrams there a lot of Concepts like this which we are going to cover it again this is again a text pre-processing technique but again understand what is the main aim the main aim is basically to convert or let me just write like this step one and step two instead of writing like like this I will just write it in a simpler way okay so here I'm we are basically focusing on cleaning the text right cleaning the input cleaning the input in this particular step we are trying to focus on converting our input text to vectors and this is a super important step because this Vector should be able to make sure that the context of the statement should be able to get captured right so at the end of the day in NLP with whatever techniques right now it is there like a Transformers birds are there which is quite Advanced Techniques if you are able to convert this input text into some meaningful vectors you'll be able to solve those particular use cases in a better manner right so this is basically the Second Step where we focus on uh converting the input text into vectors still there are more Advanced Techniques of text pre-processing which I will go with the third step and here we focus on here I'm just going to write it as text preprocessing with respect to part three okay and here what all things we basically focus on we use more Advanced Techniques which is like word to word to W average word to W and this is also a technique to convert the input text into vectors now you may be thinking Krish in the Second Step also you have written the same thing in the third step also you have written same thing yes guys understand as we go from the second step to the third step this conversion of the input text to vectors it is better than the approach that we basically use in B that is bag of words tfidf unigram byrams right but as a learner learner we really need to know all these particular steps right now over here we focus more on techniques like word to we and average word to here the uh word to and average word to is again a kind of a deep learning technique but uh we will try to learn this and we'll try to understand how it basically happens okay now coming to the next step uh that we focus on right if we continue understanding the entire road map here we also focus on understanding RNN lstm RNN Gru okay Gru RNN now again guys this is a deep learning technique which is basically used for handling or solving text related use cases like spam classification text summarization and many more things so again over here these are some neural networks you should be familiar with in the road map before uh basically when we enter into the deep learning part this is super important to understand and again uh as I said this is a part of deep learning technique okay but since I'm writing about the road map I really need to mention about all those things now coming to the next one uh there's also a technique which is called as word embedding okay so this is an also an amazing way to convert input text into vectors internally if I talk about so this particular text preprocessing it also uses techniques like word embedding okay so this is also a technique but this technique is basically called as word embeddings and word embeddings internally uses some amount of word to W but we can train our own word embedding techniques okay so again uh this is a technique of converting input text into vectors now similarly we have techniques like Transformers and birds also so coming to the next one this is basically Transformer again this is an advanced technique I just really wanted to mention all these things to you so that you will be able to understand but so if you really want to become a pro you really need to go with this particular pattern and try to complete till B and try to see the application of that but as we go up from bottom to top the accuracy of the model keeps on increasing because and and remember one more thing the size of the model also increases right which whichever machine learning models or deep learning models that you are probably trying to create to solve the NLP use cases it will keep on increasing as you go to words Transformer and BD now initially as I said we are going to learn with respect to NLP with respect to machine learning NLP for machine learning so what all things we are going to focus on we are going to focus on these three things first and when we'll start deep learning we will probably focus on this three okay so here I'm just going to write this important thing is that all the this three steps will be the part of machine learning and now when I say machine learning we will be using libraries like nltk and there also libraries like Spacey which will actually help us to do perform all the task which is available over here in the bottom three right but we'll focus on one Library called that is called as nltk so suppose if you know nltk I think learning space is also very much easy in the case of deep learning we will usually go with libraries like tensor flow or pytorch so both libraries are quite amazing with respect to this as you know tensorflow is now an open source py toch is an open source tensorflow has been created by Google py toch has been created by Facebook right but at the end of the day what is the main aim your input data is in the form of Text data and you basically have to perform some amazing kind of text pre processing where you convert the input text Data into vectors and you are able to solve amazing use cases of NLP using both machine learning and deep learning and this is a brief idea about the entire road map app uh like how we are going to prepare hello guys so we are going to continue the discussion with respect to NLP and in this video we are going to discuss about some of the amazing use cases and these use cases we use it every day in our day-to-day activities right so let me go ahead and let me share my screen and let me talk about every use cases one by one we'll be discussing about it and this will give you a clear idea about what is natural language processing now here I've opened my Gmail tab now in this specific Gmail you'll be able to see that uh let's say that I really want to write some email right so I'll say hello Crush okay and let's say I made a spelling mistake okay I wanted to clarify so I have actually written a wrong spelling so here you can see that automatically the spelling has been corrected right and this is all because of NLP and in this also we can like suppose if we receive a email from someone right we can also write an automated email from them so that basically means it will basically tell us that what kind of text it will autogenerate and it will give it for us and we can actually send it so I I would like to I wanted to clarify about the data science and let's say I'll write course okay now let's see what it basically gives it can also give you a suggestion like this and now you can basically select the course right so this is also one amazing use cases and we use it every day now let's say I go to my LinkedIn so here uh rudra Pratap Singh is one of my student and automatically we can see that two tags automated replies there in the LinkedIn itself and this is also an amazing application from NLP so that if you really want to just provide this any of this particular two tag you can just click it and you can reply back so in this in short is saving you a lot of time now similarly if I go with respect to one more application that is Google translate and I hope everybody uses this let's say that I just go ahead and write one language I'll just say how are you right and I probably want to convert this into Arabic okay so it looks something like this with this specific language suppose if I really want to convert this into Hindi to uh then you can basically say that automatically the text is coming like right in Hindi we basically say for how are you so this is also happening because of NLP amazing application and we use this and in most of the websites also right let's say in LinkedIn if somebody's posting in some different language below only you'll get an option of something called a c translation so once you click that particular thing and automatically gets converted to English not only that guys let's say that I'm just going to go ahead and search for krishn so krishn is my name I'm a YouTuber I'm also a co-founder of ion. but here if I probably go and selct on select on images now here you can see that it is already been able to detect my my image and it is in turn converting that text into images it is understanding the text and it is basically thinking that okay it it knows that it is talking about myself and based on the web right it is being able to capture all the images that is available in the web so this is also an amazing thing and probably if I probably go ahead and select videos here also you'll be able to see all the videos of my in my YouTube channel or probably in all the other public platform that I have which is being accessible of my web so this is an also amazing example text to image text to video and a lot of research is also currently going on with respect to different different companies there's an amazing company which is called as hugging face and hugging face have actually created an amazing models and solutions for solving question answering session uh question answering it has basically question answering models sum summarization text classification translation right all these particular use cases are there and you can see that how many different kind of models it has like for question answering right it has 23 380 models summarization it has 588 models uh for test classification it has this much models right and here you can see that how many companies are basically using this Google AI Intel Speen Bridge Microsoft grammarly right all these things are basically using this uh and again this is an amazing application of NLP itself because all these applications are in short taking the text Data only and Performing different kind of task on top of it so one more amazing thing that I really want to show you as an application with respect to Alexa and Google Assistant uh I have some of the Alexa home configured over here I can control my AC's and lights but with respect to Google Assistant a simple application I can go and open it so here I have my Google Assistant now here you can see that hey Google do I have any doctor appointments tomorrow sorry I can't find anything on your calendar that matches that so now here you can see that uh already it is being able to retrieve from my calendar it is able to see whether I have a doctor appointment or not tomorrow so this way it is really really easying all the task with respect to most of the critical things right that we do in our day-to-day activities so we use extensively NLP a lot in our day-to-day activities and uh this is what we are going to learn in the specific course how we can build this kind of models what kind of text pre-processing actually happens we'll get a clex idea behind it how these applications also works hello guys so we are going to continue the discussion with respect to natural language processing in this video we are going to cover some of the basic terminologies that is required in NLP you really need to understand these terminologies because I'm going to repeat this terminologies again and again when we are discussing the other topics so the topics that is going to get covered in this video is about Corpus documents vocabulary words you really need to know all these topics what exactly it is with some basic examples now usually whenever we get a paragraph a paragraph is usually called as a corpus okay with respect to documents whenever you have any kind of sentences you really need to understand that the sentences are also usually called as documents what about vocabulary vocabulary is nothing but all the unique words that are present in this paragraph that is basically called as vocabulary usually we have a dictionary right we usually say that what is the vocabulary in that particular dictionary all the unique words are the count of all the unique words or all the unique words that is present in the dictionary it is called as vocabulary and with respect to the words all the words that are present in a corpus that we will basically Define all those separately as a specific words itself so these are the basic terminologies that you really need to understand as said in this video we're going to discuss about something called as tokenization and tokenization is a very important step whenever we try to solve any kind of use cases with respect to NLP now what exactly is tokenization right so let's say that I have a paragraph I I write over here that my name is Crush my name is Crush okay and I have a I have an interest in teaching I have interest in teaching machine learning NLP and deep learning and DL now let's say that if I have this specific text this test I can consider basically as paragraphs so this will be entirely Corpus okay so this is my entire Corpus that is available which is nothing but a paragraph of uh words right so if I probably combine all these words it becomes a paragraph now tokenization is a process wherein we take either a paragraph or a sentences and we convert this into tokens right now suppose let's say I want to perform a tokenization on this particular paragraph and over here from this paragraph the tokens that are usually generated it will basically be called as sentences or documents so let's say that I will be applying a tokenization on this and with respect to this let's say that there I'll try to convert this entire paragraph into sentence so I may also add one more line over here let's say full stop I'm just writing one more full stop over here okay and I will also write that I am also a YouTuber okay so these are the the two sentences that is present in this paragraph So with respect to this particular tokenization if I perform a tokenization on this paragraph it will basically create sentences my first sentence in this particular case will be my name is crish okay and I have interest in interest in teaching ml NLP NLP and DL okay so this what this is basically my document one or sentence one my next sentence that I'm going to probably write over here because the full stop is over here right so when we convert from a paragraph when we do talk tokenization from paragraph into sentence it will be looking for this kind of characters like full stop or exclamation I'll show you practically how this can be actually done with the help of Python programming language so the second sentence that I will probably be have having is like I am also a YouTuber right I am also a YouTuber so again if you really want to understand what exactly is tokenization tokenization is a simple process wherein we are converting a sentences into sorry where we are converting a paragraph into sentences now there may also be a scenar Ario that let's say that I have some sentences okay and on top of this I can also perform tokenization again so let's say on top of this I'm performing a tokenization now this tokenization technique that I'm probably applying will convert the sentences into words right so let's say I say over here it is basically getting converted into words so each and every word will be a separate word so my will be a separate word name will be a separate word is will be a separate word Kish will be a separate word and will be a separate word I will be separate word have interest in teaching all this will be a separate words itself right so this process is also called as tokenization so in short words can also be a token sentences can also be a token right this is very important to understand and why it is required because this is a part of text pre-processing because each and every word in NLP needs to be converted into a vectors so we really need to take up each word and try to do this kind of pre-processing and there are lot of steps like cleaning and all which I will also be showing you but in this video we are going trying to understand about tokenization so I hope you have got an idea about Corpus you have got an idea about sentences now let's go ahead and understand about vocabulary which is also called as unique words okay now let's say I have two sentences I like to eat apple juice sorry how can we eat apple juice I like to drink apple juice okay here I will again continue and I'll write my friend likes mango juice okay now let's say that this is my entire paragraph okay now in this paragraph you know how many sentences are there there are two sentences because there is a full stop over here right so I will just divide this into tokens so let's say I'm going to perform something called as tokenization over here okay and this will get converted into tokens and right now the tokens that is present over here will be sentences right so my first sentence will be I I like to drink apple juice so this is my first sentence and second sentence is nothing but because there is a full stop my friend likes mango juice mango juice now see when we have the sentences obviously you can you can go and count each and every words right let's say how many total number of words are over here so if I probably go and count 1 2 3 4 5 5 6 7 8 9 10 right so if I again count it 1 2 3 4 5 6 7 8 9 10 11 so total I have 11 words but if I try to count the unique words how many unique words are there if I make the count again so I will be one unique word like another unique word 1 2 3 4 5 6 7 8 9 see like and likes are two different word so I'll say 9 10 but this juice is getting repeated so the total number of unique words will basically be 10 words right let's say instead of this likes there was something called as like at that point of time the number of unique word the number of unique word will be 1 2 3 4 5 6 7 8 9 right I will not count like and juice right but already likes is there so it'll be counted as a separate word so whenever I get this unique word as 10 words that basically means in my dictionary in my this complete paragraph This is my vocabulary so this is all the possible words that I have right that is the 10 words right now since I've have converted this into like so I'm just going to make this as not n words I hope you're able to understand the basic differences between Corpus documents vocabulary and words right so this entire thing is super important when we are learning about tokenization again if somebody ask you what is the definition of tokenization you can just say that tokenization is a process to convert either a paragraph or a sentences into tokens if I convert a paragraph into tokens that basically means I'm converting a parag into sentences I can also convert a paragraph into words hello guys so we are going to continue the discussion with respect to NLP in our previous video we have already seen about tokenization we have understood about basic terminologies like Corpus paragraph We have understood vocabulary we have understood about words now let's go ahead and do some practical things you know how much we'll be learning with respect to Theory so first of all I'll just go ahead and open my Python 3 notebook file so here uh first of all uh and and for this right we are going going to use libraries like nltk now nltk is one amazing Library let me just go through this and let me just show you nltk is a leading platform for building python programs to work with human language data so if you really want to work with NLP things like tokenization creating converting a sentence into vectors can be easily done with the help of this nltk libraries there's also one more Library which we basically say it as Spacey so if I probably search for Spacey NLP so here you'll be able to see that again these are completely open- Source libraries you can also use spacy and I will also be showing you with the help of spy how you can perform different different things right so Spacey is also there nltk is also there one important assignment that I really want to give it to you is that try to find out the differences between nltk and Spacey and uh just let me know probably in the comment section of this particular video or in the upcoming videos Okay so this one task I want to really give it to you so that you try and try to understand what is the difference between this two open Source libraries now to begin with uh since we are going to initially start with nltk what I'm actually going to do I'm first of all going to install nltk now here you can directly install it from here or you can also open the command prompt and directly install nltk right so what you have to do is that just write pip install nltk once you do this automatically the installation will be done so what I'm actually going to do I'm going to basically install from here and I'm going to show you a tokenization example so let me write it down tokenization uh example okay perfect so this is the tokenization example so here you can see that uh collecting nltk requirement already this is done the installation has been done a new Rel release of pip is available I don't want to update it right now because this pip will almost cover each and everything right now right now uh this is perfect now what I'm actually going to do I'm going to show you how we can perform tokenization like let's say if I have a paragraph how I can convert into sentences and then how I can convert into words all those things I'll be discussing and multiple ways of tokenization also I'll be showing you okay so let me go ahead and let me make few cells so that I can directly go ahead and execute it now one thing over here yeah let's go ahead and let's start now over here first of all I will try to Define my own Corpus okay Corpus basically means paragraph so let's say uh if I really want to create a multi-line comments I have to basically use this triple Cotes I will write hello welcome comma okay I'm just giving some sentences over here so that you'll be able to understand okay so hello welcome to fresh NX tutorials or let's say I just write I neurons right or I will write something like this kushak NLP Kush n's nlps to tutorials okay and this is my sentence okay and I can continue with my second line so I'll write please do watch the entire course to I'm just writing exclamation I'm using different different characters over here to become expert in NLP okay so this is what I have actually done i' I've basically defined a simple Corpus uh which is just like a paragraph which has around two sentences and I'm just going to use this particular Corpus so let's go ahead and see this corpus now once I see this Corpus here you can see that even though I just I'm just printing this right I can also print this see then the sln will go slash and indicates the new line okay so if I'm probably printing Corpus here you'll be able to see the text how it is basically visible now with respect to tokenization the first step that I'm actually going to do is that I'm going to convert a sentence into paragraphs okay so I'm going to basically convert a sentence into paragraphs so for this how do I convert it and with the help of nltk it is definitely possible so I will write from nltk dot tokenize okay and I'm going to just import okay so in nltk there is uh there is a library which is called as toonize and if I import this sentore tokenize so what sentore tokenize does is that it tries to convert a paragraph into sentences so this is the function that we are basically going to use or functionality that is we are going to use which is present inside nltk do tokenize okay so you can just say that this is a kind of package inside this and I will initialize this package in order to convert or sentence into a uh sorry a paragraph into a sentences so here I'm just going to write sent tokenize and let me just go ahead and give my Corpus so once I give my Corpus here you can see that it is giving us a list a list of sentence so here's you can see over here this is my sentence over here hello welcome to Krish nik's NLP tutorial so here you can see this is my first sentence and this is my second sentence and from exclamation also it has divided it because I'm getting three right so here you can see hello welcome to Krishna NLP tutorials full stop as soon as it finds full stop it is just going to to make a next sentence so here wherever slash in was present it is using that and wherever exclamation is also present it is basically making sure that a new sentence is getting created with respect to that and that is what send tokenize actually does if you really want to find out the definition of this here you can see that uh you can also provide different different languages what are languages it supports you can just go ahead and have a look onto that and uh if you don't find much documentation out of it so what you can do over here you can just go and search for nltk sent tokenize right sent tokenize so you will be able to find out the doc documentation page directly and you can refer it from here right there are different different tokenization see send tokenize is there word tokenize is there which we are basically going to discuss right now we going to focus on this okay so in short it returns a sent tokenized copy of text using nalt case recommended s sentence tokenizer and current class it uses this called as Punk send tokenizer so along with the full stop it is making sure that whever exclamation is basically coming it is going to make as a uh as another sentence so this is perfect uh we are able to get this right now I will go with the next tokenization and before that what I also want is that if I probably save this in a list of sentences so let's say this sentence is also called as documents which I already discussed in my previous class uh previous session right so if I probably go and see these documents this is basically my list you can also check out with the help help of type so type documents right and if I probably want to iterate through this let's say from sent in documents from from for sentences for sentence and documents I can also print this sentence parall right so here I can just Define it as sentence so this is my first sentence second sentence and third sentence this is perfect we were able to do with respect to the sentences now let's go ahead and do it with respect to the word tokenize so next tokenization technique what I am actually going to do is that next tokenization is that I can convert a paragraph I can convert a paragraph into words I can also convert a sentence into words sentence into words okay perfect now for this what I'm actually going to do first of all let's go ahead and see with respect to paragraph uh so already uh you know that uh uh with respect to converting a paragraph into words I'll be using again another Library so I can write from nltk dot okay tokenize from nltk do tokenize okay spelling mistake is there so tokenize I'm going to import word let me just write it down word tokenize okay so here we can basically use this word tokenize over here and let me just execute it over here and let me go ahead and write word tokenize and here if I directly give up my Corpus here you can see that each and every word has been divided hello world and hello welcome here you can see all the characters like comma full stop has been treated as a separate character all together or separate words all together right so here you can definitely clearly see that each and every word has been splitted with respect to this only one word that is not having this right so if I probably go and see over here hello welcome to krishn so this kind of word has not been splitted this all is being considered as a single word but if I consider with respect to full stop with respect to exclamation with respect to comma it has been considered as a separate word so here also you can find it out with respect to this right so this is very very simple with respect to uh converting a paragraph into words and why do we do this because each and every word will have a different importance and we really need to perform some pre-processing on top of it so right now initially when you get it you know which are the important words you have to take it you have to pre-process you have to clean it you know and uh that is the reason why we specifically focus on each and every word okay so this was about the word tokenize and uh now with respect to sentences also you know that how you can basically do just go over here and just write paste it over here for sentences in this now here you're printing the sentences right now what you can basically do in that is that after you probably get this sentence after you get the sentence all together you can just just directly apply word tokenize right so here you can basically write word tokenize on sentences so you will be able to print everything over here right so hello welcome to this this this this uh please do watch the entire course this this is there right perfect so here we have seen that how we can basically convert a sentence into words right now you can also do one thing uh over here is that you can use another one more Library let me just talk about one more Library probably you have seen that so so I'm going to just write word punct toonize and in this word punct tokenize what we are basically going to do is that if I try to apply this if I try to initialize this word Punk tokenize and if I try to provide my Corpus let's say so here you just try to find out the difference and one difference that is clearly seen is that this apostrophe s has also got splitted before it was not getting splitted right see apostrophe s was a single word but now you can see that it has been splitted so that is the reason why we using this punctuations um this Punctuation is it is basically uh making sure that the punctuation will also be treated as a separate word perfect so this is good there's also one more technique which is basically called as Tree Bank word tokenizer and also again I'll try to tell you the difference what exactly it is with respect to Tree Bank word tokenizer so uh I'll just try to execute it and you try to find out the differences with respect to that let's see how much you'll be able to do so I will write from nltk dot to toize import Tre bank tokenizer so I will initialize this tree Bank tokenizer okay let's say that I'm initializing this into tokenizer something like this something to one variable and then I can basically use a function which is called as tokenizer do tokenize and once I give my Corpus here you'll be able to see that I'm actually able to get it okay now just see over this what is the difference with respect to this you will definitely be able to find out some difference when comparing with this specific thing okay so just let me know what is the difference that you are able to see I know there's a very minute difference so with the let me tell you the answer you can just pause for some time and but you can check it out but let me tell you an answer with the help of tree Bank word tokenizer here you can see that full stop will not be treated as a separate word it will be included in the previous word itself now here you can see that full stop was a separate word here also you can see that full stop is a separate word right but with respect to the last word right full stop will be separate you know because here you can see that right if I probably see the sentence right after this we have a new line right and after this our sentence is getting closed for the last Full Stop only it will be considering as a separate word but with respect to this particular full stop it will be considering as a part of this it will not be considered as a separate word only this is the difference with respect to this and again it can be handed in some of the use cases but not in all but at in a generic way we basically most of the time use word tokenize or send tokenize right so yes uh this was it with respect to the tokenization example I hope you like this particular video hello guys so we are going to continue the discussion with respect to natural language processing and now we are going to move towards some more techniques with respect to text pre-processing already in our previous video we have seen tokenization we have seen that how we can convert a paragraph into sentences and then probably a paragraph into words or converting a sentences into words right so in short we have seen that how we can actually do tokenization with the help of nltk so in this video we are going to focus on something called as stemming which is a very important process altogether and uh what exactly is stemming I'll also provide you the definition we'll also see a lot of examples with respect to that and we'll also try to see the different types of steming so I have opened a file over here so here you can see regarding the stemming and I also given the definition now let's understand what exactly this is with some good examples now first of all we'll see the definition over here it shows that stemming and this definition is taken from the Wikipedia so stemming is the process of reducing a word to its word stem okay this is super important guys to its word stem that affixes or suffixes or prefixes or to the roots of the word known as LMA okay stemming is important in natural language understanding and natural language processing now what exactly this is I'll tell you some examples let's say I want to just create some cell below okay over here if I have lot of text right let's say that I'm trying to solve a sentiment or I'm just trying to solve a classification problem classification problem and the classification problem is very simple we basically need to find out whether the comments on the product is a positive review or negative review right so when we are solving this kind of problem statement so in this what we'll be having in our data set we'll be having the comments or reviews I can say I'll be having reviews and based on this particular reviews and this reviews will obviously be some kind of Text data and we need to basically create a model where then we can basically classify whether it is a positive review or negative review that is very simple now usually in this reviews let's say that I have some of the words like eating okay or it can be eat right or it can be like eaten right it can be different kind of words but at the end of the day it actually represents the same thing regarding eating right so this is basically eat eat is the root word or I can also say it is the stem word word stem of this all the words right because itat is very much common and having this variety of words for a problem statement will not impact much with respect to finding the output like positive or negative review try to understand what I'm actually trying to say over here I may have different kind of words like eating eat eaten or there may be also other words like I can also make a combination of like going gone right gone goes right at the end of the day it is basically talking about go right so go is a word stem of all these words that are present right so it is not necessary that we need to have similar kind of words again and again because this increases the number of input features in short because each and every word represents a vector as we'll see as we'll go ahead you know after text pre-processing we'll try to convert this text into vectors so having this similar kind of words instead of having this similar kind of words I can just have one word that is just like go right and it will try to and we'll try to solve the problem with respect to that so stemming is actually helping us to do the same thing so finding this word stem can be actually done with the help of stemming and there is also a concept which is called as limitation we'll try to understand the difference as we go ahead but first of all let's go ahead and see that how with the help of nltk we can perform stemming okay so what I'm actually going to do is that I'm just going to take some example examples let's say that I have all these words okay so I'm just going to remove this and let's say that I have all these words now right now you have words like eating eats eaten writing writes programming programs history finally and finalized okay so I'm just going to execute it and let me make some more cells okay so I'm just going to delete this cell because I don't require it okay I'll just create a cell below okay and let me press excape and let us go ahead right so here you can see all these particular words are there now let's see how we can find out the word stem of all these particular words with the help of stemming the first stemming technique that we are probably going to use is something called as Porter stemmer okay Porter stemmer and there are again different different types of stemming techniques which I will probably be showing you and then we'll be able to understand that what all things it will be able to give us okay now in order to apply this pter stemmer it is very much simple in nltk already those functionalities basically present so I will just write from nltk dot stem import pter stemmer okay so once I initialize this here you can see this I'm going to use this pter stemmer which is just a kind of class and for this we have to initialize it we have to initialize an object for that so let me just create an object and this will basically be my stemming and once I do this now in The Next Step what I'm actually going to do for each and every word I'm just going to apply this stemming process okay so it's very simple how do I do it I will just iterate it so I'll say for word for word in words right and here I'm just basically going to write print and uh let me write it down as word plus I'll just give some type of marking this is the word and the stem part will be nothing but I'll be using the same object stemming dot stem there's a functionality there's a function called as stem which will actually whenever we push any words inside this it will do the stemming thing that basically means for eating probably it may give you eat for eats it may give you eat right so something like that so I'm just going to give my word over here so once I execute now here you can see that perfect eating is coming as eat eats comes as eat eaten is coming as eaten only writing is coming as right which is good rights comes as right programming comes with program programs is nothing but program history here you can see history is becoming his h i t o r i so here it is a major issue right and I'll also talk about the disadvantages finally becomes final finalized becomes final this is this is there right now let's say that uh over here it looks good right probably for eaten you can see that nothing has happened it is given the same word but if you see some words like history now here you are actually getting history Ri so the entire meaning of this particular word has actually changed and this is the major disadvantage of stemming when stemming is basically applied you know for some of the words you know you may not get a correct exact meaning the form of that specific word may change so this is the major major disadvantage with respect to stemming let me show you some more examples now suppose if I say stemming do stem I'm just going to apply this particular stem word on a word called as congratulations okay and if I execute it here you can see the word the meaning of the word is basically changing it should have told like something like congratulate right but here you can see congratul it is being it is basically changing the form of the word now the word does not have any kind of meaning right so this is again one major major disadvantage of stemming okay similarly if I try to show you something like stemming do stem and if I probably write something like sitting so here you'll be able to see let me just write it down sit now see for stay sitting it is giving a very good word so that basically me stemming works for uh very good number of words but for some words it does not give you us a good answer right so this is the major disadvantage of stemming and this all will get fixed with the help of lemmatization but whenever you have any kind of problem statement like uh classification problem review classification or we really want to see whether it is whether an email is a spam or a ham right whether it's a Spam or not a Spam we should definitely go ahead with using stemming you know and uh again some of the words will not come in the right form but yes instead of using pter stemmer we have other different kind of stem stemmer techniques which we can definitely use to improve it okay now let me go with respect to the second one now with respect to the second one I'm just going to this cell because they don't require it the second cell is that which is called as Regus stemmer class now this regular expression stemmer class it is nothing but this is a class with the help of which we can easily Implement regular expression stemmer algorithm so we can just provide a regular expression and it will be able to apply the stemming purpose in that okay so it basically takes a single regular expression and remove any prefix or suffix that matches the expression perfect now what I'll do I'll just create a cell below and and again I will make some more cells and I'll try to show you some example now first of all we need need to initialize this so I will write from nltk do stem import what is the name Regus right so I'm just going to write reg stem perfect then we basically need to initialize it so I'll say regor stemma is equal to Reg right so I initialized it this is perfect uh it is giving us an uh error saying that regular expression is required now this is super important now let's press go ahead and press shift and tab now here you can see the first parameter that goes is something called as regular expression okay and this is some minimum value we'll try to understand what is exactly it is now here you can see a stemmer that uses regular expression to identify morphology affixes right understand this morphological affixes I'll try to show you an example if you don't know the meaning of this any substring that match the regular expression will match will be removed so this morphological affixes is basically in short a regular expression which will match whatever words we are basically giving and if it matches it will get removed okay now let's take some example okay over here uh and uh here you can see that this param is the minimum length of the string to stem okay so if the minimum length is somewhere around four then only you'll be able to apply this so I'm just going to apply this same thing and I'm going to paste it over over here okay so here is my reg stemmer now what I'm actually going to do I'm just going to write reg. stemmer now you should understand over here what I have actually given I have given ing dollar okay here I given s dollar e dollar able dollar okay so I've given all this regular expression and this will make completely sense when I will be implementing it now you see just pause the video and just let me know what is the output that you feel will be getting for this eating now eating you should see over here ing is there a regular expression is there so and but dollar is also there okay this is super important dollar is also there now if I try to execute this I'm able to get something called as eight okay now in short what this is happening is that this is basically saying that wherever on the last word it is ing or it is s or it is e or it is able just try to remove that now the next example that I really want to give is that let's say I want to I have this particular word ining okay so here you can see ing eating right what do you think the output will be will it whether it will be eat or whether it'll be something else okay so here if I try to execute it here you'll be seeing in8 why because the regular expression says that only in the last last with dollar if I probably remove dollar and if I just execute it now see everything will get removed okay or probably I just want in the starting so I'll just use this so here you can see ing eating so this this will not work okay so we can basically have something like this uh and we can basically check how we can actually remove it so this is perfect so here you can see wherever ing is there it is getting removed perfect so this is good we have done this okay I eat now let's say that uh I also want to try something else uh you can definitely try with different different word you can use this able e s whatever regular expression you can basically write you can go ahead and write it and you can check it okay so this is with respect to regular expression stemmer class now uh the next one that we are basically going to discuss uh and uh let's see where this is portter stemmer this this this this again it has written the same thing I don't want to write the same thing I'll go ahead with something called as snowball stemmer and this is also an amazing technique and this is a better technique when compared to okay so snowball stemmer I think I've given the definition somewhere here no no no no okay snowball stemmer is again a stemming technique but in this snowball stemmer it is it performs better than this Porter stemmer okay that is the reason why snowball stemmer had actually come initially we came up with pter stemmer we saw that lot of things lot of words were getting messed up you know so that is the reason why we use snowball stemmer because it gives a better accuracy when compared to the uh P STL when I say accuracy better form of a word okay now for using snowball STL what I'll do I'll write from Escalon do stem import snowball stemmer so here you can see I'll just write snowball for my scalon oh not for my scalon sorry from my ltk but because it is basically present in the nltk R so from nltk do stem import snowball stem and I'm just going to import it then we are going to initialize with respect to snowball stemmer and let's see what are the parameters so here first of all snowball stemmer is also provided with different different languages like uh Arabic okay Danish uh English Finnish French German Hungarian Italian so you can use basically all these words okay so for right now I'm just going to use English so what I'm going to do is that I'm in quotes I will just say English okay and uh I'll just use this and finally I will basically use a snowball stemmer over here I'll just create a variable okay and just execute it perfect now the next thing is that I will just use another condition saying that for word in words okay and uh I'll just write for words in Words uh so I'll write snowball stemmer do stem on Word word okay but I'll print this in a better way so that you'll be able to understand it so print word plus and this will be like this something like this just for formatting purpose I'm writing this so that it look good better for you all and here I'm just going to write the word right so same thing what I did in the things so here you can see that I'm getting eating as eat eats as eat eaten has eat right right this is all fine right for history also it is not being able to give the correct form right then you may be thinking chrish then what is the difference right but let's see with respect to some Porter stemmer where this pter stemmer will give some bad results or bad form of the word and where the snowball will give us a better form of a word so for this I'm just going to uh execute this line of code so here you can see when I'm applying the stemming the stemming was for the bter stemmer right when I apply stemming do stem on fairly and sportingly so what is the output that I'm getting I'm getting fairly and sportingly right Li Li but if I try to use the same word with respect to Snowball right here you'll be able to see just let me write the snowball stemmer here we'll be able to see I am getting some amazing answers snow balls stemmer okay okay let me just remove this I think there is a problem with respect to this now it'll work okay snowball stemmer comma I will just copy this and I'll apply to the same word something called as sportingly okay so now if you probably go and see I'm getting a good output which is like fair and Sport so altogether you will be seeing that snowball stemmer when applied to various other other words performs better than stemming right so here see pter stemmer is obviously a technique where you'll be able to find out the word stem but for many of the words it is not being able to give a good answers some of the examples are like fairly sportingly and all now in order to overcome this disadvantage snowball stammer is basically used and again guys understand these are some techniques which will actually help you to find out the word step and where it is specifically getting used in text pre-processing you really need to clean the data you need to make sure that the data is ready so that we will be able to convert it into vectors in an efficient way right so in this part we have seen about stemming and again one major disadvantage of stemming is that obviously see even though snowball stemmer is performing exceptionally well but some of the words like history there'll be also something like going right so suppose if I probably write this right snowball do stem right and if I use going I think going whether it will give us good or not so going is performing well if I write goes so here you can see for goes also it gives us a bad word right even stemming do stem if I probably try to see this right so in short how much we try right for some of the words obviously the form of the word is changing so this is one of the disadvantage with respect to stemming and understand that for use cases like chat Bots and all these techniques cannot be used so for that we have to go ahead with something called as lemmatization because lemmatization solves all this particular problem because it has the dictionary of all the words all the root words that is basically there so whatever word you basically give it will be giving you a good grammatical form of the word like if I'm giving goes it'll be go if I'm giving fairly it'll be fair you know if I'm giving eating it'll be eat so that kind of disadvantage is getting removed completely with the help of lemmatization and that particular part we'll be seeing in the next video so I hope you have understood till here please make sure that you try practice with different different examples hello guys so we are going to continue the discussion with respect to natural language processing in this video we are going to discuss about limitation in our previous video we have already seen something called as stemming we understood that stemming is the process of reducing a word to its word stem right and we understood what is the disadvantage of stemming because for some of the words right when we perform stemming we do not get the correct form of the word then and the entire meaning of the word actually gets changed and please remember this word which is called as word stem right so in short this is just a kind of algorithm which tries to find out the word stem but for some of the words it is working absolutely fine we have seen different types of stemming techniques like PTO stemmer Regus stemmer snowball stemmer and we also found out that snowball stemmer was better than poto so all these things were covered today in this video right now in this video we're going to discuss about lemmatization and the lemmatization technique that we are going to use is something called as word net lemmatizer I've already told you what is the disadvantage of steming because the words that we are getting is not in the correct form the meaning of the word is basically changing so in order to prevent that we use something called as lemmatizer and in this we have something called as word net lemmatizer now what is lemmatizer okay lemmatization technique is like stemming the output will we will get after lemmatization is called LMA which is a root word understand this which is a root word like suppose if I have eating then it will become eat is the root word right if I probably talk about history history is the root word right if I talk about go goes go is the root word right over here we are getting the stem word right in stemming right it it'll probably apply an algorithm and with respect to that it'll try to find out the stem word right word stem we basically say it this as but here we get the root exact word so again the main aim of lemmatizer is to give you the exact form of the word which is Meaningful and it does not change the meaning as it was happening in stemming itself so here you have rather than the root stem the output of the steming now again I'll repeat the definition lemmatization technique is like a stemming the output we will get after lemmatization is called LMA which is a root word rather than the root stem the output of stemming after lemmatization we will get we'll be getting a valid word that means the same thing so suppose if there are words like eating eats eaten it will become eat only right so it'll be giving you a meaningful word which will represent many words over there okay so nltk provides word let nemati class and I'll try to show you with respect to this and understand guys um you know this lization occurs or it gets performed with the help of this word net Corpus reader so there will be a dictionary kind of Words which will be comparing from there and you'll be able to do the worder very much properly so first of all let's go ahead and see that how we can basically implemented so first of all I will try to import from an ltk stem uh import word net lemmatizer okay so I'm just going to import this and this word net lemmatizer uh again helps us to perform lemmatization uh so first of all I'll create an object so I will write litier and here I'm just going to right word net nemati so here if I see this it is nothing but word net nemzer so let me just go ahead and execute it so this has got executed perfectly now one thing that you really need to see that okay so I will just be writing something called as let's say let's try some easy word okay so lemmatizer is equal to oh I'm just going to say dot limiti so there is a function which is called as limiti and here I have to give my words so let's say if I'm giving going and if I try to see the answer I'm actually getting going so that basically means that it is a uh it is trying to find out the root word with respect to this but let's go ahead and see the functionality of litti in lemiti we give two important parameters one is word and this is something called as post tag okay I will talk about this post tag right now it is written as n n basically means this word that I'm actually passing it is being treated as a noun so you may be thinking CH how many uh post tags will be there so let me just write down a comment over here okay so here you have different different post tag for noun we give it as small n for verb we give it as V for adjective we give it as a for adverb we give it as R now what I'm actually going to do by default right now the post tag is n right so for with respect to n I'm getting this output okay let's say I change it to V because I just want to show you whether this will be changed or not now when I give post tag as n then I'm just saying that consider going as a noun keyword but going is not a noun keyword right but if I consider with respect to Verve now you see with respect to verb I'm able to get a good lemmatization that is like go okay similarly if I probably try to see with respect to a adjective so here you can see I'm getting going right and obviously this going is a kind of verb right it is not an adjective and let's say if I try to go with respect to ad here also I'm getting going right now with respect to the going word right I feel verb is the correct one which we really need to select it now let me do one thing the all the words that I had copied over here right let's apply this entire litiz on this entire words okay so I'm just going to execute this and execute it and again I'm just going to write this for Loop okay and I'm just going to copy this instead of writing stemming do stem I'm just going to write litier dot limiti okay and I'm just going to apply for this words now see this eating is becoming eating eats becomes eat eaten eaten writing writing WR WR programming programs programs has become program because again by default in this litti remember that I've given my post tag as n right we are considering that all these words all these words are basically noun okay so programming programs history finally and so here you'll be seeing that lemmatization has not occurred that much we have not done any kind of we are not able to find out because it is being considered as a noun and in noun whenever we give noun suppose let's say that all my words has names and in names obviously those LZ does not occur right so if I have like Krishna if I have Sudan shukumar if I have this kind of words which is name right in short this is nouns right we have some famous place name like Taj Mahal India uh places name and all right so this all will be considered as nouns right so that basic difference you really need to know in order to understand this let's say that I want to again convert this into adjective so here you can see eating has become eats eats writing everything is same right but now let me try with respect to verb because most of these words are basically verbs itself right so here you can see eating has become eat eats has become eat writing WR WR programming program this history now you can see this right uh history is becoming as history only in stemming we used to get something like history right so this is quite amazing right finally finally and this all and we could also see that in stemming right we also have some of the words which is not actually being able to perform well right like fairly and sportingly and all right and even goes right so suppose if I probably go ahead and write something like litier do lemiti and let's say I'm going to write something like goes right so I'm going to get go which is good right and here I can also play with post tag right so let's say if I'm using a v post tag I'm going to get go let me go and try with this so here I'm just going to write litier okay and I'm just going to say try to to do with this fairly let's see what kind of output we are getting now this lemmatizer is super amazing right because it is giving you the good word form and the meaning of the word is also maintained right so here you can see fairly sportingly now let's me write a post tag obviously because it was noun so it may not give you something so here also with respect to verb also I'm getting the same answer which is good right so this is just to tell you that how good litiz is but one question that I really wanted to ask you which will take more time wordland nezer or stemming the answer is simple wordland ntier why because you can see nltk provides word ntier crass within a thin wrapper around a word net Corpus so it is going to compare from there right it uses morphe function to the word net Corpus reader class to find a LMA so this will basically take time right now I just have some number of words so it is happening very fast but if you have a paragraph if you have a bigger sentence lemmatization is going to take time right so this this is the basic difference between stemming and lization for which use cases you can use this uh one simple use Cas is uh if I really want to write uh Q&A chat Bots right chat Bots all this is an amazing examples for all this right you can basically use this right so let me write it down Q&A chat Bots right uh text summarization right uh Q&A that these are some of the examples which you can basically consider uh text summarization is also one example and in many many companies it is being used so right so you can use basically all these things and you can basically uh Implement word lemmatizer over there because you get the exact good form of the word that is the root form of the word which is Meaningful right hello guys so we'll be continuing the discussion with respect to natural language processing we are still in text pre-processing techniques we have seen tokenization we have seen stemming we have also seen its different types along with with that we have seen lemmatization now we are going to consider a topic which is called as stop words so in this video I'm going to discuss about stop wordss the importance of stop wordss and again I'll show you with the help of nltk Now text processing is a very important step in natural language processing because you really need to clean the data you need to make the data in the right format later on we'll try to convert all this text Data into vectors and then only we'll be able to train the model because model you know whenever we say any machine learning model internally we really need to train with some mathematical equations so whenever we train something with mathematical equations there we really need to give the input data in the form of numerical or floating values so let us go ahead and let's just understand what exactly stop wordss is so I've opened a new notebook file over here so here I have one amazing speech from Dr apj Abdul Kalam he was the former president of India and it was an amazing speech you can probably read out completely over here and obviously it is given in the materials now what I'm actually going to do is that I'm going to probably talk about stop words and why it is important that we should try to remove the stop words okay and uh just for a definition what exactly stop words now here in this particular speech you can see that uh there are a lot of sentences like I have three visions for India in 3,000 years of our history people from all over the world have come and invaded us so this is the entire speech it is an amazing speech if you're probably learning it I would like to just uh tell you that please read this uh you'll be getting a lot of information out of it very motivational speech alog together now from this particular speech you can see that and I can definitely say this as paragraph or Corpus right now here there are some words like I The Have and you know uh let's say off uh the you know to there why right all these kind of words right it will not play a very big role when we are doing task like spam classification or let's say if you're trying to do some kind of task with respect to uh you know like spam or ham classification I have already told about that and along with that to just see that whether this is a positive review or negative review but some of the words like not can actually play a very important role not and all so what we do is that with the help of stop words you know we try to remove this particular words because uh with this kind of use cases where you're specifically focusing on some of the important words to determine the outc these all words like I the he she of there is not at all required so what we can do is that we can basically pass this entire paragraph to that particular stop words and see that what all words can be basically removed okay and that is the importance of stop wordss in short so let us go ahead right now I'll just go ahead and execute it let me make some cells so that it'll be very much easy for you all to understand and how we can apply stop words along with stop wordss we can also apply stemming so I'll show you both the combination uh which which will be super important for everyone okay so let's go ahead and let's uh try to do that okay now first of all I really need to import uh for stemming you obviously know what we need to import so from for stemming I'll write for nltk do stem import pter stemmer so I can basically write pter stem over here and I'll execute it along with this uh uh I obviously need to also import Stop words because stop words for English it will be different because you'll be having that entire list of words like the he she and all right so what I'm going to do is that I'm going to say that from nltk do Corpus import Stop wordss So from this I will be able to use the stop wordss itself so now I have imported uh from nltk Doc Corpus import Stop wordss Now this stop wordss you know I have to also download it so for that uh let me do one thing I'll also import nltk and I will execute it and along with that I'll write nltk doownload and here in this parameter I am going to give about the Stop wordss and there are different different language stop wordss also and we'll also try to see that so if I probably write this here you'll be able to see that downloading package stop wordss to this particular location so the package stop wordss is already up to date and I'm getting true so in short I've downloaded all the different different languages stop Words which is already present in the nltk library perfect now we have done that now let's see that what are the stop words that are available in English so in order to do that I I have already imported from nltk Corpus import Stop wordss I'll take this stop wordss I'll copy and paste it over here and I will say dot download okay or instead of download I will just write wordss and here I just need to give my language like what language I really want to uh give it for like English or something else like German and all so I'm just going to write this let me just write English and if I execute this here now you can see the list of of all the stop wordss that you obviously have and all the stop words can actually be removed right now you may be thinking K this may depend on data to data right now here you can see that guys this is a list you can also create your own stop wordss in English like let's say over here some of the important words are like aren't couldn't right these all words can actually play a very important role uh to find out whether a statement is positive or negative like not is also there if you probably search for it not you'll also to be able to find it not okay so it is always a good way that you create your own stop words and try to remove all those kind of words from the paragraph So I hope everybody's able to understand now with respect to English you have this now let's see whether we have with respect to different different language and obviously you can go ahead and check the documentation but I will just try to show you with respect to German so in German also you have the specific stop words along with this you can also use French you have this particular stop words so with respect to different different text or different different language of text you can definitely apply different different stop words with respect to that now you may be thinking is there Hindi or Arabic or some other I think for Arabic also I think it is there let's see whether it is there or not yes for Arabic also it is there but I do not find it for Hindi I guess so again from the documentation you can check it out but till Arabic I was able to see it again all the information will be given in the NLT documentation now what I'm actually going to do is that my sentence is already English right now I'm going to perform two important task one is I will apply stemming and before applying stemming you know what I'm actually going to do wherever I find out the stop words I'm just going to remove the stop words from this particular paragraph so that this entire paragraph will be shortened up right so for that what I'm actually going to do now see whatever things we have learned from starting everything I'm actually going to cover up okay so first thing first I'm just going to say from nltk do stem I'm going to import pter stemmer port stemma and I'll go just execute I'm just going to execute this okay and then what I'm actually going to do is is that I'm just going to write stemmer is equal to pter stemmer this we really need to initialize it now when we do this task right now the next step what I'm actually going to do is that I'm going to perform the tokenization on the entire paragraph So for that I can use nltk do sent tokenize and here I'm just going to give my paragraph now see this guys here I'm going to get my entire par entire sentences like see I have three visions for India then in 3,000 years these all things I'm I'm able to get and this is my second sentence third sentence four sentence like this all the sentences in the form of list I'm able to get just by using sentore tokenize right this is a tokenization process wherein we take a paragraph divide that into sentences okay now let me do one thing is that I'm just going to save this in a variable which is called as sentences which will let later become a list right so this is my sentences and if you probably see the type of sentences I'm just going to basically see this it is a list now perfect till here we have done it amazingly well right we have have done port a stemmer on that uh sorry we have initialized stemmer over here and we have tokenized it now understand what we are going to do is that I'm going to Traverse through all the sentences first of all apply a stop Words which all words are not present in the stop words we are only going to take that and apply stemming this is what we really want to do so here I'm saying that first of all apply stop wordss and filter and then apply tokenization right sorry then apply stemming so this is the step that I'm actually going to do now see this very simple very important so I'll write a for Loop saying that for I in range for I in range and here I'm going to basically give the length of the sentences I can also go with respect to sentences but there I'll not be getting the indexes here I'll be getting the indexes okay so range basically says that whatever length I'm actually giving that becomes an index right 0 to that specific length now what I'm actually going to take I'm going to take this specific n um I and I'm going to write n nltk dot word tokenize because I'm getting in the form of sentences I need to get each and every word right so I'm getting the word over here and inside word unders token I I'll give I sorry sentence of I because this will be an index sentences of I so this will be an index and here I'll be able to get the word so here what I'll do I'll make a list of words so in short I'll be getting the list of words inside the sentences perfect now till here we have done it now after this we are going to apply one very important thing that is first of all I need to apply stop words for each and every word and see whether that it falls in the stop word or not if it does not fall in the stop word then only I have to do the stemming so understand the task step by step this is super important with respect to all the steps that I'm actually taking up okay so here I will write a list comprehension I will say stemmer do stem okay and here I'm going to write word okay of the word because uh from this particular words this words is a list of words and I have to take each and every word so here I will write a for Loop okay and this is called as list comprehension so I'll write for word in words if word not in see if the word is not present in the stop words then only you apply stemming that is what I'm actually trying to do okay so here you can basically see if the word is not in I'll use a set along with that I will download all the all the stop words with respect to English right so why I'm using step set because some of the words may get repeated so I don't want that so I'm going to basically write it over here right now through this what I'm actually going to do I'll get that specific word that is not present in the stop words and only stemming will be getting applied to that specific Word Perfect now here what I'm actually going to do is that I'm going to save it in a variable called as words perfect I hope it is very much clear I'm getting back everything after doing the stemming back in the words itself and then finally what I'm actually going to do is that I'm going to take the sentences and I'm going to replace it on the same index with respect to this words but once we get this words right I need to join all these words together so how do I join there will obviously be a space dot I'll just use do join so that it'll join together and it'll convert that into a sentences so this exactly is converting all the words into sentences right this is very much simple and we have actually done this perfect so here what all things we have done again let me repeat I'm I'm iterating through each and every sentence and then I'm doing a word tokenize that basically from every senten I'm getting the list of words and from that list of words I'm iterating I'm seeing that whether it is present in the stop words if it is not present I'm doing the stemming after stemming I'm again storing back in that same list and then I'm converting all these words into sentences I can say converting all the list of words into sentences perfect now once I execute it and now if I go and see my sentences here you'll be able to see now I3 Vision India right in 3,000 year history right so here h h i s t o r y became r i okay people people it became people world came invaded us invade it became capture land conquer mind from you can see all the special words like like I I have everything is gone see see this I is there have is gone right the you'll not be able to find out the anywhere right so whatever stop words were present over here that all got removed and then Only We performed the stemming right now you may be saying Kish uh the stemming does not look very good right so for that what you need to do I've already taught you with respect to snowball stemmer so I will just import this this okay and it is very simple I think you can do the same task again so snowball stemmer and then I will try to import this with respect to English and obviously after this you'll be able to get a good sentence right so let me just remove this one so snowball stemmer I've already done it and I'm going to copy the same thing okay and here I'm just going to say apply snowball stemmer stemming right and then instead of stemmer I will just write this word that is snowball stemmer that's it yeah and now once I execute it uh let me go back again back to the sentences because that sentences have got changed now so where is the sentences let's see okay uh this is the sentences I've got executed now now I'm just going to execute this now if I probably go and see my sentences here you can see that now it is good right now snow uh one more important thing that snowball has done see over here I still have capital letters right like there may be some of the sentences which may be in small letter it also so that becomes a repeated word but since this is a capital letter it will be considered as a separate word right for the model to understand right so what it does is that snowball one more Advantage is that it is making sure that all the letter is becoming small right so all the letter is becoming small and for some of the words it is not even giving a good result like poverty has become poverty but if you try to do this with the help of lemmatization you can also get a good word now let's try it with the help of lemmatization so I'm just going to uh do the same thing okay I hope everybody is understood with respect to snowball stemmer now what I'm going to do is that again going to go back to my limitation code so I'm just going to import this nltk do stem and it is very simple guys I think uh we are just repeating things so that you also practice in a better way okay so here I've got the word net climatizer I'm going to initialize it perfect this is done now I'm going to go go ahead and copy the same code right and I will just write it over here and instead of writing snowball I'm just going to copy this I'm going to paste it over here perfect so I've done it but let me just go ahead and execute the sentence part again because I need to get the updated sentence so I think it is somewhere here paragraph okay and uh here is a sentence okay perfect now let me just go ahead and execute the same thing and now if you okay I'm getting has no stem okay sorry it should be lemiti lemiti okay so liter. lemiti now you see some time it took right because it is basically checking from the entire corpus now if I probably go and see the sentences now you have all amazing things with respect to words that are coming up correctly and all the things are here right so with respect to this you are able to get some good thing now now one more thing you can do is that after this word you can also put the post tag now if you put the post tag to V right now you see what will be the output you'll get a better output I guess because most of the words it'll be basically considering as um you know as an adverb oh sorry as a verb so but anyhow we will try to understand about post tag again more more things right so I3 Visions India in 3,000 years history people come would world come invade US capture land so all the stop words has got deleted now we are getting a very good one you know uh at least better than stemming okay and the other one that is snowball stemming right so this was the entire process with respect to text pre-processing and here we have discussed with stop words and how you should also go ahead and do the text pre-processing I hope everybody got that idea now in lemmatization also you can see that it is not lowering it so what you can actually do is that you can basically lower all the sentences right so let's say if you write sentences of I is equal to sentences dot of I dot to lower right so you can basically give two lower let's see whether it will work or not uh and uh let's see whether it will completely work or not I'm I'm not sure whether dot to lower will work with respect to list but I think it should work I'm just going to execute this again go down okay and apply this litier okay uh Str Str object has uh no attribute to lower okay it's okay not a problem not a problem what I'm actually going to do I'm just going to comment this and before writing here I'll just say word to lower okay you have to definitely try different different things okay and it's all about Google you know just Google it you'll be able to get it so again I'm going to execute this and now I think it'll execute I know it is I'm doing a lot of up and downs but just try to follow the lecture uh string object has no attribute to lower why two underscore lower is there m two lower um okay let me just see uh Str Str to lower case right python let me just see this it's okay if I don't have any other way to see it but I think do lower will definitely work let's see dot lower okay perfect it has worked now now if I go and see the sentences it is done okay so I don't have any any regrets to search in the Google you should also do the search in the Google so now here you can see all the small letters are there along with the limitation so we are going to continue the discussion with respect to NLP and in this video we are going to discuss about something called as parts of speech tagging now I hope you have understood in limitation this parts of speech tagging it plays a very important role because if we are giving it as a verb or a noun right based on that you know we are able to get the root form of the specific word right and we had seen a lot of examples with respect to this in this video we'll try to understand that how many different types of parts of speech tagging is there and with respect to that we'll also see a practical example let's say that if I am actually getting a sentence like this Taj Mahal is a beautiful Monument How with the help of nltk we are able to understand that and you'll be seeing that the output of this let's say I'm just taking this particular example the output for this will be that tajil will be considered as a noun beautiful may be considered as a adjective Monument can be considered as a verb right I'm just giving as an example but we'll definitely see this particular example and we'll also see some extensive example altogether now with respect to parts of speech how many different types of things are there right so here you have something called as CC coordinating conjunction CD Cardinal digit DT determiner ex uh existential there FW foreign word I in preposition JJ adjective jjr adjective so that basically mean suppose if I'm giving a sentence automatically it will be able to categorize in all this specific different different parts of speech automatically with the help of nltk and similarly there are a lot of things over here you can see a personal Now personal pronoun like uh for the letters like hi he she will be put up as PRP right it'll be getting a tag called as PR P if you see over here some more example like RB adverb right words like very silently will be put up over here rbr it it is also adverb but it will be like a comparative one right so example like better so you have RBS adverb like this you have lot right so what I'm actually going to do is that I'm going to take a very good example and one assignment you just have to do is that take this particular simple example and try to find out and try to write down the comment in the comment section okay now over here you'll be able to see that I have this particular speech of Dr uh apj Abdul Kalam and obviously the same example I've shown you in the stop wordss example itself right so I'm going to take this very simple okay and then let's consider that uh in this parts of speech tagging I really wanted to perform uh stemming let's say but uh I don't want to perform any stemming because I want to see for each and every word right what kind of post speech tagging will be there so no need to import all these things okay so I'm just going to uh remove all these things instead I'll just be focusing on importing nltk so I'll write import nltk and then here you can see that I'm just writing nltk do sentore tokenize so that basically means I'm actually converting the paragraph into sentences so let me just execute this and once I get it here you'll be able to see all the sentences so in this example I'm trying to show you that how for each and every word we'll be able to find out the post tag okay so the next thing is that now what I'm actually going to do is that apply the stop wordss okay and we need to apply the stop words also because we don't want to remove anything so I'm just going to uncomment all these things and I'm going to say that we will find out the PA tag okay find out the PA tag right so this is perfect uh till here we are going good right now what I'm actually doing over here uh it is very much Clear very much easy to understand that uh over here we are able to uh do that like I'm I'm doing I'm just simply putting a for Loop over here with respect to all the sentences and this sentences here what I'm doing is that I'm just trying to convert this into words I'll be getting a list of words and now no need to do the stemming so I'll just remove the stemming okay and uh over here I'll just add this W word itself I'll be getting the word and I'm just saying that if the word is not in set stop words do wordss of English so uh let's see that okay I told you that I'll not apply stop wordss but let's let us just remove the stop wordss because some of the stop wordss will not be playing an important role so in order to do that uh I'll just uh import all the libraries with respect to stop word and obviously over here I had already written that but let me do contrl Zed so here it is stop words with respect to this and I will also remove this okay so two things and I will just import the stop wordss so I'm writing that from nltk Corpus import Stop wordss and now let's apply for this same thing whatever we have done in the previous time uh the same part we are actually repeating it so here you can see that for word for word in words if word not in this particular stop word with respect to English I'm going to take all the words and then I'm just going to comment down this code because I'm not going to join them back into the sentences because I really need to understand what pause tag each word will be getting so I will basically write print okay so here I'm just going to write a simple print statement and in this particular print statement uh before printing this also what I can do I can write basically this in the next line here I'll be using something called as nltk do post tag okay so here you'll be able to see post tag I can also apply it for words and I can also apply it for sentences right now I'm just going to apply this for the words itself so this is my entire thing and uh here I'll just create a uh P _ tag variable so this will basically indicate that everything is getting stored over here and with respect to that I'm just going to print this now let me revise it again what all things we did I'm iterating through each and every sentence and then I'm converting this particular sentence into words and for each words I'm applying stop words initially I thought that I'll not apply but let's apply the stop words because see at the end of the day they are so small small words like isn't the he she I don't want this particular words right so I'll I'll remove the stop words and I'll will take all the list of words over here I do not apply stemming because I really need to find out whatever words are present over here it needs to be uh like this NK will be able to find out all the different different types of parts of speech tagging with respect to that and then I just apply this particular simple code that is nltk do postore tag ofws and then finally I'm printing it so here you can see I'll just execute it I'm getting some error I'm not going to uh make sure that nothing I'm not going to like if there is an error I'm not going to edit the video so here it says nltk do download average percepton trigger tagger so I definitely require this particular tagger to apply post tag so what I will do is that I'll just go and uh copy and paste this you also have to do this because you will also face this particular ER now here you can see that it has been downloaded now if I probably go ahead and run it now here you'll be seeing for the first sentence I is basically a PRP PRP basically means what so if you probably go over here and see personal pronoun and I had given the examples like I he she right similarly you'll be able to see that other words like three three CD Vision nns uh India nnp let's go ahead and see what is nnp because India has been categorized with nnp so nnp is nothing but proper noun singular it's like a name place it can be monuments it can be different different things like nnps is pro proper noun plural Americans Indians like that right so all those things are there and how easily it is being able to Simply you know just show the post tags right with the help of nltk this is an amazing thing and here with respect to all the sentences you'll be able to see this okay and that is how powerful nltk is uh let me just show you whatever assignment that had actually given you right so this was a sentence over here I'm just going to copy this and I'm just going to take this entirely okay and let's see whether we'll be able to do it so I'm just going to use nltk do PA tag sense now let's see with respect to sense also if you're able to do this or not so I'm just going to put this Taj mahel is a beautiful monument and I'm going to execute this now here you can see T okay now in this particular case what has happened each and every word has been been considered over here right this should not basically happen so what I'm actually going to do is that I'm going to basically use post tag let's see what is the output with the respect to this so here also the same thing is coming so the simple way in this particular scenario is that again I'll put a for Loop for I in okay I'll say sentence or I in in this entire quotes I'll just put this I'll go through each and every word over here okay and I can write dot split with respect to this because if I'm just using this dot split this is my string okay and if I probably use do split you'll be able to see that what I'm getting Taj mahel is a beautiful Monument right so for I in this uh and here you can see that I'm iterating through each and every things right each and every word so I'm just going to print nltk dop tag and I am actually going to to use this so you here you can uh just go ahead and write this specific I or instead of writing this I I'll just write words okay for word in this nl. post tag I'm just going to use this specific word now here uh still uh there is an issue because I am iterating through each and everything so if I probably get this entire thing uh with respect to the list you'll be able to see okay uh let's see let's see let's see there's something again I'm not going to delete it I'm just going to see what is the word over here so I'm getting this okay and if I take this and ltk do Post tag with respect to this specific word uh okay this is the problem let's see what did I do over here uh I need to provide a list of words okay and then I'll be applying this okay this is the problem okay now uh all you have to do is that I don't have to to give word by word if it is giving word by word here you can be able to see for every single character it is giving this right I really need to provide the list of words so if I probably just copy this part and put it over here and remove all these things now what I have actually done after this error see I'm not editing this particular video with respect to this error because you really need to see all these errors in short when I give like this I'll be getting a list of words so if I execute this here now you can see that Taj it is showing nnp Mahal nnp is is vbz a DT beautiful JJ Monument NN now see what did I do over here post tag basically uh whatever parameter we give we should be giving in the form of list of words and if you are able to do that we will be getting the same answer right so guys I hope you are able to understand parts of speech tagging I definitely got some errors but I really wanted to show you all the errors that we are getting hello guys so we are going to continue the discussion with respect to natural language processing and in this video we are going to discuss about about name entity recognition this is an amazing topic so let me open a file over here so here you can see that let's say that we have a lot of sentences right like one example of this particular sentence is over here the FL Tower was built from 1887 to 1889 by French engineer Gustav eel uh whose company specialized in building metal Frameworks and structures right now this is a sentence now from this particular sentence we know that what is parts of speech tagging right which is noun and all but along with that with the help of nltk we will also be able to get something called as name entity tags right now what are some of the examples of name entity tags so here you can see one of the tag is something called as person the second tag can be location or place third tag can be date time right and here also I have given some examples so if I probably say FL Tower it may be coming as a place it can come as a location right over here you can see that okay GF F this will basically be getting tagged as a name itself right here suppose if this numbers are actually there it can be something else right suppose if you're giving some money value like $1 million so let's say this $1 million is present somewhere here right and uh you know that will be basically tagged as something called as money right so it'll be given some kind of name entity tags right so let us see some examples and let us see that how this name entity recognition will be given with the help of nltk library to begin with what I'm actually going to do I'm going to take this particular sentence I'm going to just execute it over here okay and again we'll be going like uh how we did in the past videos itself so let me do one thing over here let me create some more cells okay now the first word first thing I will just go ahead and import nltk and as you know that uh with respect to nltk also I can use something called as uh word tokenize word word tokenize word unor tokenize and here I'm just going to give all my the complete sentence itself so here uh let's go ahead and give the sentence over here and once I execute it here you'll be able to see that the FL Tower was everything is coming over here so this is my entire words the list of words right so this is my entire list of words now perfect now see this usually uh for doing the parts of speech tagging what we what do we do we basically write nltk do Post underscore tag and I basically give this entire words to this and based on this every each and every word will be assigned to some tags okay so let me just write this is uh as my tagged uh elements or tagged something I'm just giving a variable name which is like tagged elements and this will get stored over here okay now see this if I really want to provide some uh named entity recognition all I have to do is that use this nltk dot named entity so there is something called as n NE e okay so let me just show you NE uh NE chunk okay so NE chunk is a function over here and if I probably see the definition of this particular thing uh function you'll be seeing use nltk currently recommended name entity chunker to chunk the given list to or given list of tag tokens okay and inside this I will just past my all my Tagg elements okay so this will be my tag elements I'm just going to give this probably we may get an error the reason is there because we need to download this one right so here you can see that nltk doownload maxnet any chunker because here we are specifically using some chunker techniques to basically get the named entity and for that we really need to download it and that is what is the first requirement so here you can see that I'm downloading it and once this downloaded gets uh once this entire this uh Max Max and any chunker gets downloaded we are good to run the specific code so this may take some amount of time because this may also be a huge uh it can be a huge library inside it which will needs to be downloaded so now I will just go ahead and execute it still I'm getting some error It also says that okay you need to download nltk do download wordss perfect so what I'm actually going to do go and make one more cell and please don't worry whenever you get some errors I've seen some people who who gets worried first of all just go ahead and see the error what exactly it is it is very much simple and you just need to execute it so guys now once this NLT or download of words gets downloaded now all I have to do is that over here you'll be seeing that nltk neore chunk I have to give the tag elements and then I just need to write dot draw once I execute this here you'll be able to see this what an amazing graph I'm able to get now see this everybody observe this I don't know whether you able to see this properly or not but just clearly if you will be able to see it it has given most of the information very much clearly so here it is very much clear that you uh this entire sentence right the organization is recognized for FL and Tower right which is a noun right over here you can see this word was is a vbd built it is over here 1887 is something like CD and here you can see GP right French JJ right it is being able to determine as JJ here you can see person information is being able to get captured which is like Gustav NP so here you can clearly see all this information very nicely right so which which ever entity is been able to get recognized here you'll be able to find that so person is there GP is there and organization is there what that is what we will be able to find out this s is the entire sentence over here but with this graph you'll be able to understand that you're able to get the entire information so like this if you probably go and see that what all uh other entities you can find person place location date time money and all so I hope you were able to understand about named entity recognition and how you will be able to see this and how you'll be able to see the diagrams so yes this was it with respect to this particular video but understand that nltk provides this amazing feature you should definitely use it when it where it is required hello guys so we are going to continue the discussion with respect to NLP already we have already seen that now what is our next step you know after text pre-processing where we have specifically performed stemming lemmatization stop wordss and we have cleaned the data right now the main thing is that we really need to convert the text into vectors and they are m multiple ways the first way that we are going to discuss about is something called as one hot encoding okay now we'll try to understand how how with the help of one hot encoding we are converting the words into a vectors itself so let's take this specific example so here I have a text and a specific output right our main aim is that first of all we need to take this particular text and we need to convert this into vectors okay now with respect to this obviously uh I'm also not going to again Lower all the sentences of words but let me focus on more on like how you can basically implement this one hot encoding and what is the theoretical intuition behind it and how the vectors are actually created so uh to begin with over here let's say that this is my text the document one is the food is good the food is bad P Piza is amazing okay so this is basically Piza is amazing okay now how do you find out how many unique vocabulary is there because see the unique vocabulary will play a very important role okay while creating your vectors okay so to begin with what I'm actually going to do is that we just need to find out how many unique words are there so let me go ahead and write it down and obviously you know that if I combine all these three documents it becomes a paragraph or Corpus so let me just go ahead and let me just write down all the unique unique documents so the is one unique uh Sor not document vocabulary the is there food is there right is is there right and good is there so over here you can see good is also there then from the second sentence you can see the food is getting repeated so I'm just going to write bad okay and if I go probably to the third sentence or the documents here you'll be able to see Piza is there which is a unique word or unique vocabulary is is again getting repeated and finally we have something called as amazing so all these words that are probably present over here these are my unique vocabulary right vocabulary so these are just nothing but the unique words unique words that are available in this entire paragraph or sentence or in this particular data set now based on these unique words one hot encoding means what okay now let's say if I probably consider this particular document one D1 okay so let's say this is my D1 now what does D1 do is that like if we are probably applying one hot encoding we will convert all the specific words into this Vector representation that basically means let's say if I'm considering the word then the vector representation for the word will be 1 0 0 0 0 0 0 why because one is present over there that basically means the is present in this specific word so this is basically becoming one if I consider for the next word that is food then how it how we'll be able to represent it it will be 0 1 0 0 0 0 0 right so here clearly you'll be able to see wherever those specific word is it will be represented by a vector which will be of a dimension V okay now what is this v v is nothing but the words that are present over here so here you'll be able to see one word two word 3 word four word five word six word seven word so so the unique vocabulary I have seven so it is going to be represented by the seven vectors where for those specific word it will be represented as one and remaining all will be represented as zero right so now if I probably consider how can I represent this D1 that is the food is good with respect to this particular vectors okay so now understand that in this particular food uh there are four words right 1 2 3 4 so over here with respect to D1 the first word will have a represent of 1 0 0 0 0 so how many zeros are there 1 2 3 4 5 6 right 1 2 3 4 5 6 right so this will be my first word okay then coming to the second word now the second word over here is food right so here you'll be able to see 0 1 0 0 0 0 0 right so this will be my second word now if I come to my third word it is basically is right so is is basically represented by 0 0 1 0 0 0 0 so this is my third word and coming to the four word which is good so here it will become 0 0 0 1 0 0 0 right so this is my sentence one one hot code rep one hot code representation with respect to this particular text so this text if I apply one hot encoding I'm going to get this kind of representation so guys now if I talk about the shape of this D1 sentence which has four words right so here you can see the first word representation is this one the second word representation is this one vector representation third word Vector representation is this one and fourth word Vector representation is this one so if I probably see the shape right it is basically 4 cross 7 four words 1 2 3 4 okay and seven seven uh vocabulary size right 1 2 3 uh 4 5 6 7 okay now what we are going to do we are also going to do the same thing for D2 okay so let's go ahead and try with respect to D2 I would suggest just pause the video Try it by yourself and then probably continue the video anyhow I'll be showing you the entire uh representation again okay so over here you'll be able to see that which is my first word again the in D2 I have the food is bad right so again these two are same these three words are same so I'm going to basically write the same representation it will be 1 0 uh 1 0 0 0 I think there are 6 zos 1 2 3 4 5 6 zos okay so this will be my first word coming to the second word this is this one okay I'm just going to write this over here okay so this is my second word so here I'm going to write 0 1 0 0 0 0 I hope the size is seven 1 2 3 4 5 6 7 perfect so uh coming to the third word now here you'll be able to see 0 0 1 0 0 0 0 okay now here you can see this food is the food is it is a similar word like D1 only so the first three is just a replica coming to the last word right now if I say last word over here you can see good over here you can see bad now whenever you have bad this particular word will become one right so you'll be having 0 0 0 0 0 0 0 0 0 and then 1 0 0 so I'll just go ahead and write 0 0 0 again 0 1 0 0 perfect so this is how the D2 is basically repres presented and again over here you'll be able to see the shape will be somewhere around 4 cross 7 okay now uh yes uh this was the technique simple technique of one encoding like how we converted a text into a vectors uh over here you can see each and every word is basically represented by a one hot encoded based on the vocabulary size whatever the vocabulary size is basically present so here you can have you can see that I have a vocabulary size of seven So based on that one hot repres presentation is given for each and every word okay now uh going forward in the next video we are going to discuss about all the advantages and disadvantages of this and again one hot encoding is not getting used for this NLP use cases we have techniques like bag of word and tfidf we'll also understand what are the advantage and disadvantage why should we not use this and then we'll be trying to understand the bag of wordss and TF IDF okay hello guys so we are going to continue the discussion with respect to one hot encoding and NLP and I hope you have understood that how we can convert words into vectors and probably the entire process I've actually explained you over here in this video we are going to talk about the advantages and disadvantages of using one hot encoding so let me go ahead and let me write it down so here are the advantages okay and here are the disadvantages okay now first of all the basic Advantage is that it is very easy to implement with Python programming language right easy to implement with python now why I'm saying you that because in SK learn we have a specific Library where we can easily implement this and it is basically called as one hot encoder okay and in pandas if you're familiar with pandas Library we have something called as pd. getor dummies right so this function will basically help you to create this entire one hot encoding based on the words right and again the second advantage and obviously uh we'll see this kind of examples as we go ahead but again I'm not going to implement specifically separately as a video in one hot encoding because we don't mostly use this NLP in NLP technique and I'll tell you why because there's a lot of disadvantages also now let's go ahead and talk about the disadvantage the first disadvantage over here you see that over here at the end of the day if I probably consider consider right the food is good right at the end of the day I'm getting 1 0 0's and for the next word I'm getting the 0 1 0 0 and there's so many number of zeros and ones in in arithmetic in in linear algebra we basically say this as sparse metrix okay so this basically creates a sparse metrix right now what is sparse Matrix exactly to talk about sparse Matrix is that in an array in metrix you have lot of ones and zeros okay and what is the problem with respect to sparse metrix I can also say this as we can also convert this into an array also so we can also say this as sparse arrays but understand what is the disadvantage with respect to sparse metrics whenever we apply any machine learning algorithm this specific machine learning algorithm you know after we convert the text into vectors right in most of the machine learning algorithms this leads to something called as overfitting right now what exactly is overfitting overfitting is a process wherein you get a very good accuracy with the training data but with respect to any new data it will not be able to give you a very good accuracy so this parse metrics usually leads to something called as overfitting right so this is the one of the major disadvantage now let me talk about the next disadvantage over here okay now over here let's say that I have this words like the food is good the Piza is the sorry the food is good the food is bad Piza is amazing right now see one thing over here right in any machine learning algorithm whenever you give the data right and in this case also if I probably say this vocabulary right these are all my features the food is good bad Pizza amazing right now over here you can see that most of the time my inputs for every word that is getting converted into a vector the size is seven right 1 2 3 4 5 6 7 right and based on the number of words here you'll be seeing that I'm I'm having four words so I'm getting 4 cross 7 here also I'm having four words which I'm actually getting 4 cross 7 but if I consider with respect to the third statement or third sentence right so if I probably consider D3 right and if I start creating a one hot encoded format for this so what it will be see Piza is there right so Piza where it is one at this particular instance so I will probably create something like this 0 0 0 0 0 1 0 right so this is Piza right is again we have is over here is is nothing but this will be one and remaining all will be zero so I'll just write 0 0 1 0 0 0 0 right and the third statement is amazing which will be the last one as far as I remember from the vocabulary so this is how my D3 looks right in one hot encoded format but one one thing you really need to understand over here the size is 3 cross 7 right now in machine learning you need to understand one thing whenever we perform NLP or let's say any machine learning use case the number of features should be fixed with respect to the length but here you can see this is 4 CR 7 this is 4 CR 7 this is 3 cross 7 so I cannot train I cannot train my this particular data for a machine learning algorithm because over here we still don't have a fixed text size right so this is one of the major major disadvantage unless and until and with the help of one notot encoding we are not getting a fixed text size right over here I got 4 cross 7 4 cross 7 if this was also 4 cross 7 I could have trained it right so over here I'll say that for ML algorithm we need we need fixed size we need fixed size input right and right now it is not there so this is again a major disadvantage right we'll be seeing in the upcoming lectures how with the help of bag of words DF IDF will'll be getting a fixed size of words okay now the third one that you'll be seeing most of the times we are finding zeros and ones right zeros and ones in zeros and ones most of the times see if if a specific word is there that will become one and remaining all will be zero but if I talk about the semantic meaning between these two words like the and fo right we are not able to exactly calculate like how far how equal it is how similar that specific word is and this process is basically called as semantic so here I will say that no semantic meaning is getting captured no semantic meaning is getting captured now let me talk about this in a very good example okay let's say I have something like this food pizza burger okay now you know that let's say that in my vocabulary I have three words so for the foot representation I will basically write it as 1 0 0 for Piza let's say I'm going to write it as 0 1 0 and for Burger I'm just going to write it as 0 01 Now understand these are my vectors and right now since there are three features I'm basically having three vectors now if you have probably heard of something called as cosine similarity or if you really want to find out the distance between this Vector to this vector and then this Vector to this vector if I probably consider this let's say I'm just going to draw a three dimension okay let's say this is being determined by something like burger and this is being determined by something like food and this is being determined by something like pizza now if I talk about all these things right like let's say in case of food it is 1 0 0 so in this axis I will probably getting one so this will be represented by 1 comma 0a 0 in three dimension and if I talk about p is a 0 1 0 so this will be my another Point let's say I'm just going to denote this by another Point 0a 1 comma 0 and let's let's talk about the third point over here the third point will be somewhere here burer it'll be in the same distance so this will be 0a 0a 1 now if I probably calculate the distance between this all these things it will almost be equal right so it is not being able to tell the exact difference between food p and burger it is obviously considering that okay all this words are in equal distance so we cannot understand that how this particular word is different from this right this is super important to understand so in short what I am actually trying to say over here is that no semantic meaning is basically getting captured that basically means in this particular sentence I'm not able to understand which is the most important word how this word is related to this word or how this word is much more similar to this word so that information is not getting captured because at the end of the day I'm getting one zeros okay so this was the uh third major disadvantage now talking about the fourth disadvantage and this is also very super important and this particular concept is something called as out of vocabulary out of vocabulary and what does this basically mean o okay now what what is this all about let's say that right now I have this many vocabularies of word let's say for the after I train my model now I want to test it for my new data set so for testing on my new data set this will be my test data let's say I will say Burg is bad and I need to predict this so this is my test data and I need to predict this now you know that over here with respect to this particular sentences okay Burger is nowhere present in this particular uh vocabulary so what will be the problem we will not find out a way we will not have have any way to represent this burger in the form of vectors right we'll not be able to form a vectors over here because anyhow we don't have a vocabulary word so what will happen in this particular case and obviously we will not be able to perform this one hot encoding because in our vocabulary I just have this many number of words so this will not work when a new word is actually coming wherein it is not present in the vocabulary with respect to the test data and obviously this is again a major disadvantage right so this technique is basically called as out of vocabulary so this was in short about the advantages and disadvantage of this understand sparse metric basically means many ones and zeros it leads to overfitting with respect to various machine learning algorithm then for any machine learning algorithm I definitely require a fixed size input right now we are not able to get it that basically means all my sentences should be of fixed size it will not work no semantic meeting sorry meaning is getting captured I've already told you because if you try to calculate a distance all this words words are equidistance it is not saying that whether Pizza is very much similar to food or Burger how much similar it is to food right that that similarity is not there and based on that semantic meaning is not getting captured because here either we are getting ones or zeros so we are not able to provide right over here you can see that much more maximum information which is the most most important word that information is also not getting captured so this was in short about the advantage and disadvantage I hope uh you have understood this and again the next uh technique with respect to converting word to vectors is something called as bag of wordss we'll try to see that how it is fixing some of the disadvantage from here and uh yes one more one more major disadvantage I'll say you that let's say that if you're right now I just have three sentences let's say I have just seven vocabulary let's say tomorrow I will be having 50k unique vocabulary size vocabulary size now what will happen in this particular case right I'll be getting so many number of ones and zeros so in short again this is actually leading it to sparse MC but in a real world scenario I'll just not be having three sentences right I'll be having bigger sentences and I'll also be having many sentences as such so this is also one of the thing in any interview if they ask you you probably need to talk about or explain like this with respect to Advantage and disadvantage hello guys we are going to continue the discussion with respect to NLP in our previous video we have already seen that we have discussed how one hot encoding works right and uh if I probably talk with respect to different types of ways how we can convert a text into a vectors we had completed already one hot encoding now we are into the second that is bag of words now let's go ahead and understand how bag of words actually work and this is super important because this is the technique which you can do simple task like sentiment classifications or whether uh basically in short a text classification kind of of task easily you'll be able to solve okay uh some of the applications famous application like whether a mail is a spam or a ham everything you'll be able to solve it now let's say that I have a data set okay and in this particular data set uh I'm just saying that this is a positive or negative statements now in this particular data set I have three sentences let's say he is a good boy she is a good girl boy and girl are good okay and all these are like positive statements so the output of all this are ones right in supervised machine learning we really need to know the output also now let me go step by step and show you that how bag of words are implemented so this is basically the step one right I have the data set now let's go to the step two now what happens in the step two is that once I go over here right there's multiple steps that uh actually occur and initially we should also go with respect to this the first thing is that if I probably consider two things you're going to basically happen first of all we will lower all the words and usually all the steps we will even be doing for all the other techniques also first we will lower all the words and then we will probably apply stop words right and this also I have already shown you now when I lower all the words let's say my sentence one right my sentence one now becomes I'm going to take only this text Data I don't have to worry about the output because later on once we convert this text into vectors we can apply it to the machine learning algorithm so in the sentence one here you can see as soon as I lower the words what will happen this all will become smaller now why I'm doing this because there may be some repeated words now in this particular case the capital with capital B you have boy here small B you have a boy so both these words are same but since it is in uppercase this will be treated as a separate word so we really need to lower all lower case all the words so I'm just going to write lower case all the words now in sentence one after lowering casing so what will happen is that this all will become smaller letters then once we apply stop words now what happens in stop words these all words like he she is a you know it'll get deleted right because we don't require this particular word for any kind of task like sentiment analysis important words like good boy good girl is basically required so this all things will go like and are will also go now what will happen with respect to this the sentence one words will become only like this good boy okay I'm showing you step by step uh with the the help of python when we do with the help of libraries when we do it is very much simple we don't have to worry that much about it okay uh we just have to use one Library so sentence two what it will become once the stop words will get removed it will become good girl right and similarly with sentence three I will be having the specific words that is boy girl good right so all these things are there I have sentence one sentence two sentence sentence three perfect uh now what we do is that we go ahead and calculate the vocabulary now how many words are there in this vocabulary we have good boy girl boy and girl and again it is getting repeated so if I probably consider how many unique words are there in the vocabulary so I will be able to write it over the first word is nothing but good so I'll go ahead and write good good is my first word and one more thing that I will probably write that what is the frequency of this specific word like how how many times these words are there in different different sentences and obviously you can see 1 2 3 three are there so I'm keep going to keep the count as three then we have something called as boy over here you'll be able to see how many times boy are there so in sentence three also boy is there and sentence one also boy is there so I'm just going to make the count to two and coming to the next one which is called as girl so again girl word is also there in this vocabulary and again girl also you will be able to see two times it is there right now first of all all you need to see that in this frequency with respect to different different vocabulary words is this in ascending order and obviously uh sorry is this in descending order so maximum number of frequency will be put up in the first word right then boy is present two times and girl is present two times so this can be moved up and down but just understand what I've done is that based on this descending order I have just ordered all this words so from it is basically from Maximum to minimum okay so this is is there perfect now before applying bag of words you already have seen that how many vocabulary size what is the vocabulary size the vocabulary size is three right and I can see in a in a bigger data set I can have this kind of words a lot so I will be having all these words like this and frequency will also be there and one important step is that it is not necessary that you use all the words this is also there in this uh bag of words itself let's say that over here in the vocabulary there are 100 100 unique words and let's say some of the words are just present once right so if some of the words are just present once we even not take that so when we are when we'll be doing the coding with respect to a bag of words you know we also have an option that we just select the top 10 features or top 20 features which of the words are getting repeated more and more and that is the importance of this particular frequency okay now perfect uh we have we are we are till here now the next step will be that it is very very simple now based on the topmost frequency what I'm actually going to do I'm going to keep this words as my feature so good will come over here boy will come over here and girl will come over here okay now you already know what is a sentence one right so sentence one what will happen is that this sentence one has good boy now see how this will get converted into a vectors wherever good is present that will become one wherever boy is present that will become one and remaining will become zero so this entire sentence you'll be able to see uh that that it is getting converted to 110 as a vector okay so this was the text now this is getting converted to 110 as a vector now similarly if I go with respect to sentence two wherever there is a word like good that will become one wherever there is a girl that will become one and remaining all will become zero I'll talk about the advantages and disadvantages why we are doing this previously in one hot encoded you saw that for every word we are doing this for every word we are creating a vector but here for the entire sentence this Vector is is coming okay and there are lot of advantages if when I talk about this which I'll discuss uh first of all let's understand what more we'll be having in this so S3 you'll be able to see the sentence three here I have boy girl and good so wherever boy is there that will become one girl is there that will become one and good will be there that will become one okay now this is my entire vectors and obviously I will also have my output variable the output variable can be 1 Zer anything as such right if you are probably solving the sentiment analysis or something like that okay now this are my entire vectors okay and this is the vector for the entire sentence okay entire sentex and this is how the entire bag of words converts a text into of vectors okay now what we can do is that we can take this particular vectors we can train with an machine learning model and we'll be able to get the output now one more important thing that I really want to put let's say that I I have a word over here right good girl girl right so let's say if I have one more word like good now in this case what will happen Okay in this case what will happen usually in bag of words since good is repeated two times so I will increase the count to two Okay in this particular case so I what I'll do I will increase the count to two Okay now there are two things one is binary bag of words and one is normal bag of words now in the case of binary bag of words even though the count is two what it is going to do it is going to force it to become one so most of the time the word is present it may be present any number of time the value is either one or it is zero and in normal bag of words I can increase the count to 2 3 4 based on the number of words that is there so that is the basic difference between binary bag of words and bag of words that basically means here you'll be having only ones and zeros and here based on the frequency count will get up updated count will get updated based on frequency so this is the basic difference with respect to this okay so I hope you have understood like how with the help of bag of words we are converting a text into a vectors now in the next video I'm going to discuss about what are the advantages and disadvantages with respect to this like how we have actually discussed in one not encoding okay hello guys so we are going to continue discussion with respect to bag of words uh already we have understood the intution behind bag of words how it is converting a text into vectors now as usual let's go ahead and discuss about the advantages and disadvantages so first of all I will go ahead and write the advantages and then I will go ahead and write the disadvantages okay so uh and obviously we have also discussed about the advantages and disadvantages with respect to one hot encoding we'll try to compare with this and we'll try to see that what all problems is getting fixed okay first of all Yes again uh this is easy to implement and it is intuitive so I'll just write something like simple and intuitive simple and intuitive okay the second Point uh with respect to advantages now here what what you'll do in one encoding you you see that you have seen that some important important thing is that in machine learning algorithm right okay with respect to sparse metrics I'll be discussing with respect to semantic meaning out of vocabulary everything I'll be discussing first let's consider this particular second topic which is like uh for ML algorithms it give fixed size inputs now over here with respect to bag of words any statement now here you can see that some of the sentence may be three words five words 10 words at the end of the day based on the vocabulary size you are able to get all the sentences converted into that many number of Di iions of words so here the vectors is getting fixed the inputs are getting fixed because here our vocabulary is getting fixed so this particular problem is getting solved okay in uh one hot encoding you do not had a fix size inputs since we are creating words for every vectors sorry we are creating vectors for every words okay so what we are going to do over here the second point that you'll be seeing yes you have a fixed size input right and this will superbly help you for ML algorithms training okay ml algorithms now this is the two major advantages now if I talk about the disadvantage see over here the first disadvantage with respect to one encoding is spse metrix and I've already told you what exactly is sparse Matrix it is nothing but ones and zeros let's say if your vocabulary size is 50,000 then what will happen every sentence will get converted into you know that size of the vocabulary right so still sparse met problem is there so with respect to disadvantage again I'm going to write it as sparse metrix and array or array is still there and this will actually lead to overfitting okay now second major disadvantage again see at the end of the day whatever statement that you have like good boy good girl you know or it can be boy girl good okay something like this you'll be seeing that based on this sentence right and based on this vocabulary and based on the frequency of the vocabulary The Ordering of the word is changing now see understand if in a sentence The Ordering of the word changes and based on that this Vector is getting created because see based on the frequency we have written all the ve all the all the vocabularies right over here good was present maximum number of times so we wrote it as first boy was present in the second number so we wrote it over here right and a girl was present uh like two for two times and we have written it at the last right now over here you can see that if I probably consider the third statement boy girl good right but here you can see that entire word is getting ordered uh like it is completely changed right The Ordering of the word is completely changed so I'm having one one0 for sentence 3 I'm having 111 now when word ordering is changed the meaning of the sentences also gets changed and because of that some of the semantic information is not getting captured I'll talk about more semantic information but here you'll be able to see that ordering of the words is getting changed this is super important ordering of the word is getting changed because if this is getting changed the meaning of the sentence is changes right is getting changed so this was the second disadvantage if I probably talk about the third disadvantage okay third disadvantage again we'll go and see over here with respect to out of vocabulary now what happens if I probably add a new word like boy girl good and let's say I'm going to add something called as school now here you'll be seeing that the school word is not present in the vocabulary so what it is going to do for this specific word anyhow it is going to get rejected right it is it is not at all getting considered in this training data let's say that for our new test data in our new test data we have included a school word and we need to do the prediction for this particular word with respect to Output so the first step will be that we'll do text preprocess and then we'll try to convert this into a bag of words using the same technique what we did in the training data set but here you see that in my training data set I do not have a vocabulary which is called as school so what it is going to do it is just going to ignore the specific word and it is just going to see that where good and boy and girl are there right so still out of vocabulary still exist because this word may be an important word for the sentence but it is getting removed because we don't have that in the vocabulary right major problem so yes out of vocabulary is obviously a issue over here right this still persist okay o oov now this was the there now one more important thing semantic meaning in this is still not being getting captured why I'll tell you semantic meaning is still not getting captured and there are multiple things to explain in this okay now first of all obviously you know that I'm having either ones and zeros okay now in this particular case good and boy they are getting the same importance right for girl obviously if the word is not present I'm getting zero small amount of semantic information is getting captured when compared to the one hot encoding format but here you say that when we have many vocabularies either my values will be on or zeros one is just indicating whether the word is present or not but which is the most important word what is the most important context in that particular sentence that is obviously not getting captured and if that is not getting captured semantic in turn will not get captured now the other thing over here is that there is there is also very important thing let's say that I'm having two sentence okay it is like the food is good let's say in my data set I have this sentence the food is not good not good now let's say I don't go ahead and remove all the Stop wordss and all for this I will be having a vocabulary like one uh one let's say all these words are there okay and the is also a separate vocabulary food is also a separate vocabulary is is also separate how many unique vocabulary are there four because not is also there right so is will also become one not will be zero and good will be one right so this is how we convert from this to this right similarly from here to here if I really need to convert then what it will happen 1 one one one 1 because not is also present so I'm writing one now let's say this is my Vector one and this is my Vector 2 if I try to find out the distance or how similar this Vector is just by plotting some points let's say that I've converted this particular Dimension into two Dimension using PC and probably have plotted it based on this right only one value is getting changed is and not right so I will get both these vectors very much near to each other and this we can basically do it through something called as cosine similarity so let's say this is my Vector 1 this is my Vector 1 this is my Vector 2 so Vector 1 is basically present over here Vector 2 is present over here if it is near to each other if the angle between them is very near to each other or if the angle between them is very less I may say that this both the sentences are same almost same or similar right this is almost similar but do you think this both sentences are almost similar because it is the complete opposite of them right but since there is only one word that is getting changed because of that only one one of the values getting changed over here right like zeros and ones are happening and when we plot this it is becoming kind of kind of a similar word but this should not be a similar word this is completely opposite word right so this kind of situation is also not getting handled well with the bag of words and later on the techniques that will be learning like uh word to and all this will be solving all these problems right so I hope you are able to understand the advantages and disadvantages of bag of super important with respect to interview and if your basics of this is getting strong trust me you'll be able to understand bag of words average word to sorry you'll be able to understand word to average word to we in a very easy Manner and there are techniques in deep learning which is also going to come which is called as embedding techniques word embedding at all all those will get solved in a very easy way right hello guys so we are going to continue the discussion with respect to natural language processing in our previous video we have already seen what is NRS now what we are going to do is that we are going to see one more efficient way of converting a words into vectors and we specifically say it as TF IDF now what exactly TF IDF is it is nothing but term frequency and inverse document frequency so we'll try to understand how with the help help of TF IDF we will be converting all these particular sentences into vectors okay and I've taken the same example what we specifically did with bag of words like good boy and this is after lowering all the uh cases uh character cases along with that after performing or after removing the stop words so I have sentence one as good boy sentence two as good girl sentence three as boy girl good okay and this is the same thing like uh from the same materials if you'll be able to see I've done the same thing I've taken the same thing over here okay now there are two components in tfidf one is term frequency and the other one is something called as inverse document frequency so whenever I talk about term frequency term frequency definition or the formula how we calculate it is given by number of repetition of words in sentence divided by number of words in sentence okay I'll try to show you completely taking this as an example how we can calculate term frequency and over here inverse document frequency formula is very simple we basically calculate it by inverse document frequency is nothing but log to the Bas e number of sentences divided by number of sentences containing the word now this is super important and uh probably don't get confused with the formula right now uh I will try to explain you each and everything okay so let's go step by step and let's see how we can calculate the term frequency now first thing that I am going to make sure that what I calculate is nothing but I will be using this term frequency now with respect to this term frequency you know that how many vocabulary of words are there okay so first of all uh I will just try to show you in a different way by creating a table so I have S1 sentence I have S2 sentence and I have S3 sentence okay and then with respect to my vocabulary of words I have something like good okay then I have boy and then I have girl and already we know that only three words are basically present in the vocabulary I'm trying to showing show you with a simple example so that you'll be able to understand how tfidf will work okay now the first thing let's go back to the definition term frequency is nothing but number of repetition of words in a sentence divided by number of words in a sentence so suppose if I take S1 and with respect to S1 if I really want to find out the term frequency of this particular word that is good how do I calculate I just need to see how many number of times this particular word is repeated in the sentence so here you know that it is repeated just one time and then I will be dividing by number of words in that specific sentence so it will become 1 by two because I have two words now let's go to the next word that is boy so boy is also present how many number of times one and this will be divided by two okay I'll tell you why we are doing this because when we understand the advantages and disadvantages you'll get a clear idea about why TF IDF will better play perform will perform better than compared to bag of words okay then I have the word girl so girl I know that over here it is not present right in the sentence one so it will be 0 by 0 by2 so which is nothing but zero okay similarly with respect to S2 here you'll be able to see how many time good is present only one time and the total number of words is two so 1 by two boy is present Z times so it will be 0 by2 which will be nothing but zero and girl is basically present over here one time so again I'm going to write 1 by 2 now let's go with respect to the sentence three so sentence three how many times good is present one time and the total number of words now is three right and then if I go next uh to see the boy word how many times boy is present only one time so this will also be 1x3 and girl will also be present 1 by3 because the total number of words are three okay so this is how simple we able to calculate the term frequency okay now let's go ahead and let's try to find out the inverse document frequency so I'm just going to write over here as IDF now with respect to IDF also we will be creating two Fields very simple fields and remember uh this will basically be my IDF and this is with respect to my words over here so my words are nothing but good boy let me write it down in a better way so that in the it should look in the same order so here will be my good here will be my boy and here the next word is something like girl now in order to calculate the inverse document frequency it is very much simple now all I have to apply this log base e right how many number of sentences are there with respect to good right with respect to good uh suppose if I really want to calculate the inverse document frequency of good okay so here what I'm writing I'll write log basee e multiplied by number of sentences how many sentences are there there are three sentences so three divided by number of sentences containing the word so good is present in all these three sentence 1 2 3 right so I will basically be writing log to the base 3 / 3 okay and if I basically calculate this if you try to calculate it it is nothing but I will be getting as zero okay and you can basically do with the calculator boy over here again log to the base e number of sentences are three and how many time boy is present how many time boy is present in uh how many sentences boy is basically present it is present one two right sentence one and sentence three so I will be writing log based uh base e multiplied by 3 by 2 similarly girl will also be present same number of time if you probably see how many sentences girl is basically present okay so I have independent dependently calculated term frequency and I have independently calculated inverse document frequency this is perfect right now whenever we say TF IDF in short what I'm actually doing I'm multiplying this two okay term frequency and inverse document I'm taking the combination of this two now let me go ahead and write it down in a better way still uh in the way that we specifically want and finally how our vectors will look like so this is my vocabulary good boy girl and this is the final uh TF IDF okay so final TF IDF based on this calculation and this will differ Based on data set to data set okay so first of all with respect to sentence one whenever I see the combination of TF IDF with respect to good all I have to do is that multiply 1 by two with this zero okay so I'm going to multiply this with this so in sentence one so see this is the sentence one right this is this entire thing is the sentence one one right so this is my sentence one okay so I will be taking this combination I'll multiply with this right so sentence one good 1x2 * by 0 it is nothing but 0 with respect to boy 1x2 * by log base e 1x2 * log base e 3x2 will be the value that I'll be getting in sentence one and with respect to girl 0 multiplied by this it will be zero now let's go to the sentence two in sentence two I will go ahead and look for this now again 1X 2 * by 0 again good will be 0 and this boy is nothing but 0 by 0 so this 0 * by 0 is nothing but 0 and here I have 1X 2 * by log base E 3x2 I will tell you the exact thing what we really need to know why we are doing this specific thing everything will make sense uh and it will make sense and I'll make you definitely understand all these things okay so coming to the the next one with respect to sentence three in sentence three I will do this multiplication with this right so 1X 3 multiped by 0 again it will be 0 1X 3 multiplied by log base e 3x 2 and 1X 3 * log base e 3x2 perfect so we have got all this calculation and this is how my vectors will look like so here you'll be seeing that for sentence one we converted this entire sentence into vectors which looks like this right so this is my sentence one vector this is my sentence 2 vector and this is my sentence three vector and obviously I'll be having an output with respect to any kind of classification that I want to do and then I will train my model by passing my sentence one so in short if you see good boy is basically converted into a vector which will look like this okay this entire sentence one is getting a converted into a vector 0o this and this 0 right so again you can calculate this with the help of calculator but this is what is the way that we have actually done we have converted all our sentences into vectors and in tfidf this is the phenomenon that is used in converting the words into vectors now you may be thinking Kish what is so special about this we have got some values okay that is fine and that is what I'm going to talk about in my next video about the advantages and disadvantages of tfidf hello guys so we are going to continue to disc discussion with respect to tfidf and already I've shown you how we can what is the formula of tfidf that is term frequency and IDF that is iners document frequency and I also shown you an example over here right so till here that is everything is fine now let's talk about the most important thing about advantages and disadvantages and why this is probably better than bag of words okay so first of all uh the basic advantage that we have again this is quite intuitive uh the implementation is also quite intuitive uh coming to the second Advantage okay like bag of words uh here also our inputs are basically fixed size and this is based on the wab size right and this Advantage is also present with respect to bag of words that is also there but the third advantage that I'm actually going to talk about see in bag of words also we had fixed size right but this third Advantage is a major advantage now let's talk about the third Advantage okay so the third advantage is that the word importance is getting captured I'll explain you what exactly this is word importance is getting captured super important point and probably they may also ask you this specific thing in interviews now if I probably go and see my entire paragraph let's say this is my paragraph good boy good girl boy girl good right I've I'm getting a tfidf of this number right and over here I I have also written this with the help of bag of words in bag of words I used to get either ones or zeros wherever that word is present that is coming as one otherwise it is zero if it is not present in the sentence but how the word importance is getting captured over here equal importance is given to both the word like good and boy right because it is present in the sentences but here it does not work like that here considering the entire paragraph what it is happening is that we are focusing on two things term frequency and inverse document frequency if if a word is present in all the sentences it should be given less importance understand this okay if a word is present in all the sentences in that paragraph it should be given less importance why because all the all the sentences having that specific word so it is not playing that amazing or important role word importance needs to be captured from every sentence that is what we specifically want now over here you can see that boy is there right over here here girl is there now boy and girl are getting repeated in one or two sentences not in every sentences so if it is not repeated in every sentences we need to Value this particular word in every sentences as such so if I probably take an example of good good is present in all the three sentences so after we calculate tfidf here you'll be seeing that all zeros we are getting over here major major issue right so not an issue but it is a good thing we are ignoring the good word because it is present in all the sentence now if I consider with respect to boy so sentence one boy will play a very important role now right so with respect to boy here you'll be seeing that I'm getting some values right I'm getting some values now in the second sentence obviously boy was not there so it became zero but if I consider girl in the second sentence so here you'll be seeing that I'm getting some value that basically means in this particular sentence the girl word is super important and the context is based on that specific word that we are having a value of TF IDF right so in in short what is happening is that word importance is getting captured and in the third senten is it is talking about both boy and girl so you'll be seeing that both this boy and girl has some values so in short we are capturing some word importance over here based on the context right super important point and by this our machine learning model will be able to understand that okay something specific we are basically talking about and that way the mathematical models will be able to find out what kind of predictions it actually and through this the accuracy increases now let's talk about the disadvantages obviously in this particular case also you have Lot number of zeros so sparsity still exist okay sparity still exist over here uh and again we will try to see how we can solve sparsity using word to the second thing is that what we specifically discuss about is something called as oov out of vocabulary now here also if I probably add any more words over here with respect to the test data that is going to get ignored because uh over here also all my features is basically made based on a training vocabulary size right so this is basically the advantages and disadvantages with respect to uh tfidf but definitely just by seeing the advantages and disadvantage we can definitely know that tfidf performs better than bag of words right now uh in the next video we'll try to see some practical uh implementation with the help of nltk python and again guys you really need to practice this considering different different data sets we will try to provide you more assignments as possible so that you can practice these things also hello guys so we are going to continue a discussion with respect to natural language processing for machine learning in this video we are going to discuss about word embeddings and this was probably the topic that I should have covered long back but I'm deliberately keeping this particular Topic at this point of time why because we have discussed so many topics wherein we focused on converting word into vectors so I now you'll be getting a very clear idea about what what exactly word embeddings is and here I've given you a Wikipedia definition so this is a very simple Wikipedia definition I've have taken it from Wikipedia so the entire credit goes to Wikipedia over here so over here you can see the definition that in natural language processing word embeddings a term used for representation of the words right for text analysis typically in the form of real valued vectors that encodes the meaning of the word such that the word are closer in the vector space are expected to be similar in the meaning so let's say that I have two words king and queen okay or forget about king and queen let's say that I have two words and the two word is like happy and excited right let's say I have this two specific word now when I have this two specific word with the help of word embedding techniques what we do is that we convert this particular word into vectors and let's say if I try to plot this vectors in a two-dimensional graph okay and mean if I really want to convert this into two dimensional CFT we have techniques like PCA or other techniques which is an unsupervised technique to do dimensionality deduction so once I probably plot this let's say happy and excited are coming near to each other based on this particular vectors it basically indicates both are similar word okay let's say that I have one more word like angry now in this case with the help of word embeddings if I'm trying to convert this into vectors the tentative thing is that obviously happy is the opposite to angry so angry will be somewhere here if I probably try to plot this particular uh vectors over here why why it is coming so far because it is an opposite word so the distance between this word will be quite High whereas the distance between this particular word will be quite less so this indicates that this both the words are similar whereas this both words are opposite to each other right and this is all possible because of efficient conversion of the word into vectors right and how we are doing this again with the help of word embeddings but the techniques we have learned till now is it is something like one hot encoded we have learned about bag of words we have learned tfidf we have learned and all these techniques are a part of word embeddings so if I properly clearly show you the division you know in the first step if I have the word embedding techniques so let me just again go ahead and write it word embedding techniques are specifically of two types so this is my first type and this is my second type the first type is based on count or frequency count or frequency and the second type is based on deep learning trained models please hear to this very properly because this deep learning trained model models will give you very good accuracies okay now in count of frequency we have learned about three different types one is one hot encoded second one is something called as bag of words and the third one is something called as tfidf right so all these techniques we have learned right and we know the advantages and disadvantages but here we focusing more on count of frequency right but the major one which is having a better accuracy which is actually and all these techniques at the end of the day is also converting words into vectors or it is converting sentence into vectors right sentence into vectors but we have seen a lot of advantages and disadvantages maximum number of disadvantages are there with respect to all these techniques and this all disadvantages are getting solved by this deep learning trained model and the train model is nothing but which is basically called as word to W it's not like we can cannot create it from scratch we can definitely create it from scratch but again you require a huge amount of data and what we are going to do in that first of all we are going to understand in the upcoming videos uh what is exactly word to W and how it is basically converting a word into vectors and how it is solving all the disadvantage things that were there in this particular technique everything we are going to discuss but just understand that what is word to it is a word embedding technique which will efficiently convert a word into a vectors which will be making sure that both this property it is expected to have similar in meaning whenever it is converting into a vector space along with that it will also give you a very good representation of the words right sparity will not be there and all there are many points which I'm going to discuss in the upcoming uh videos with respect to word to now word to are of two types one is because the entire deep learning architecture is built on two different types one is we basically say cball C Ball cow is nothing but continuous bag of words super important continuous bag of words and we also going to see that how the models get strained but for this you really need to have a prerequisite knowledge of how Ann Works what is loss function what is optimizers and all right the second technique is something called as skip gram skip gram again this is a different technique uh and again it is a part of word to itself it is a different type of word to at the end of the day we can either use CBO or skip gram to get an efficient conversion of word to vectors right and this is what we going to see we also going to see some pre-trained models of word to VC you know probably created by Google and it is somewhere around 1.5 GB big model size we'll try to download it we'll try to see we we'll try to execute it but in the upcoming videos what we are going to see is that how word to W word embedding works right and how it is making sure that all these disadvantages that are present in this techniques is getting removed right so this is what we are going to discuss but I hope you got an idea at the end of the day whatever techniques we have discussed till now in converting word into vectors it falls under word embeddings right hello guys so we going to continue a discussion with respect to natural language processing in this video we are going to cover word to w we have already seen what exactly what to Beck is it is a deep learning train model and again it is a kind of a word embedding techniques wherein the focus is to convert word into vectors making sure that the meaning of different different words are actually maintained like if there are a similar words we'll be getting vectors that are very near to each other if we probably find out the difference and we'll also be able to see that which all words are completely opposite based on these vectors okay so let's discuss about word to W and I will give you an idea about like what exactly is word to W and how the words is getting converted into a vectors in the upcoming videos I will try to show you that how word to W models are basically prepared with respect to architecture uh in in the case of deep learning models and for that you really need to have knowledge about a&n models right if you really need to understand that how we can train word to W from scratch okay and uh yeah let's go ahead and with the definition and let's see that what problems it actually fixes so over here word to W is a technique for natural language process published in 2013 and it was published by Google an amazing company already you know that and they had done some tons of work with respect to NLP you know they're doing lot of research the word to algorithm uses a neural network model we'll be discussing about this how we'll be using it uh how it uses a neural network to learn word association please M make sure that you understand this words to learn word association from a large Corpus of of text once trained such a model can detect synonym words or suggest additional words for partial sentence so it will be able to detect synonyms it'll be able to detect uh opposite words and many more things as the name implies word to W represents each distinct word with a particular list of number called as vectors so at the end of the day we are converting a word into vectors but this Vector will have many things it will be able to detect synonym words or it'll also be able to suggest additional words for partial sentence okay now let's understand what exactly this is now see guys in bag of words TF IDF we have already seen right based on the vocabulary size we'll either get on zeros ones or zeros and in in short we getting a sparse Matrix in TF IDF also we may get decimals like0 2 5.6 then again Zer zeros are there in word to it'll be little bit different now let me talk about how it'll be different let's consider that I have a vocabulary okay and I have a vocabulary let's say I have my vocabulary so this are basically my this many number of unique words I have in my Corpus okay unique words I have in my Corpus Corpus basically means paragraph okay now let's say uh the vocabulary words that I specifically have like something like boy girl okay and then we have something like king queen and if I probably talk about some more words like apple and mango let's say in my vocabulary I have this many words okay now one very important word I'm going to put up over here which is called as feature representation feature representation please listen to me very very carefully a very important topic now each and every word that are present in the vocabulary will be conver converted into a feature representation now what this exactly means this basically means that we are going to convert this all words into a vectors based on some features now what all features can it be right let me give you a very good example about it but understand that when we are training a very big word to W model at that time you'll not be getting a clear idea about all the features but here just to make this intuition to make you understand the intuition let's discuss about this feature representation so in the left hand side what will happen is that let's say that I will be having lot of features like this so I may be having a feature called as gender I may be having a feature called as Royal I may be having a feature called as age I may be having a feature called as food like this I will be having lot of features and let's say that the total number of features you know are basically 300 Dimension that basically means I will be having one more nth feature over here and the size of this entire features like if I probably count all of this this will be 300 Dimensions let's consider 300 Dimension basically means I will be having this 300 features now with respect to all this vocabulary we are going to represent this word in the form of vector considering this feature representation so this is my entire feature okay what I'm actually going to do I'm going to take up all these words that are present in the vocabulary based on this particular features we are going to ass assign one numerical value okay and understand one thing this numerical value will be assigned based on the relation of this two words that is the vocabulary and this feature representation that we have here I have given you as an example but when we are training a very large word to model we will not be able to see this features entirely right if I take an example of Google you know they have come up with this amazing word to W model which is basically trained in 3 billion words I guess 3 billion ion Words which is coming from the news feed right and that practical example also I'll try to show you and at the end of the day you'll be able to see that each and every word is basically represented by 300 dimension of feature representation that basically every word will be having a 300 Dimension uh vectors okay now this is super important what kind of values that I can have over here right what kind of values I can have over here now let's say with respect to Bo which is present in the vocabulary okay and it's and it's relationship with respect to gender let's say that I'm having one minus one over here let's let let's just consider okay minus one over here now with respect to girl and gender the value can be+ one because it is the opposite of boy right opposite with respect to gender if I say boy is there opposite of boy is nothing but girl so minus one and one this kind of vectors can come now here you can see with respect to the next word over here we have boy we have Royal we obviously know we cannot say right we we don't have a sentence that oh he he is a royal boy he can be a royal Prince or he can be a royal king right so there is no no proper relationship so in this particular case you know there will be a value like 01 I'm just giving you as an example right similarly we'll also be having with respect to boy and age let's say that there is not much relation so I'm just going to put it as 03 okay very near to zero now similarly I can have all these values over here and these all values comes through proper trained models like word to word to which is trained by Deep learning techniques like Ann okay and I'll show you in the next video how those models are basically trained but just understand over here each and every vocabulary that we are seeing is represented based on this feature representation so that basically means for boy I will be having this Vector so for boy for boy you can see over here I will be having this specific Vector okay so this all vectors will be here okay okay now similarly with respect to girl now you can see with respect to gender if boy is having minus one then this will be + one right because it is completely opposite with respect to Royal again no relationship so it will be 02 let's say here I'm going to put 02 because age with respect to this also no specific relationship right so similarly I will be having other values like this now see with respect to King you know I may have gender like uh there is a relationship with respect to gender and King so minus. 92 will be there with respect to Quin it can be plus. N3 you know so here you can see opposite right opposite opposite over here right and similarly I can have other vectors like over here Royal and King are related right so I can have 95 okay and over here you can see also Royal and queen can also be a royal right so over here probably the value can be 9697 very much near by each other right now understand because of these vectors you know similar words will be will be very very close to each other because if I try to subtract I'll just give you an idea about it so right now with respect to age also obviously there will be some relationship between age and King because we say that Old King right with respect to age so suppose let's say here I'm having 75 here I'm having 68 and like this I will be having multiple vectors like this right and we can clearly see right over here it is now see with respect to Apple obviously it'll have no relationship with respect to gend uh gender so I can probably write .5 or 01 something right with respect to Mango also it cannot be so I'll write just 23 I'm just saying that okay we basically uh can have some values over here if it is not having a relation it'll be very much near to zero like 05 okay I'm just putting some values but once we train all these models right we will be getting this value here I'm giving you a Crux idea about like how each and every vectors may look like okay now over here with respect to Apple and Roy also I'll not be having much relationship so let's say I'm putting -202 this will be plus2 with respect to a an apple this may have a very good relationship right because uh let's say if the apple is kept for 10 days outside you know it may it may it may rotten up right it may not have that nutritional value so this uh with respect to age it may have a direct relationship mango also it may have a direct relationship so we are going to have this vectors pretty much similar with respect to food yes Apple belongs to a food item right so this will probably have a good value it may be 91 this is also 0.92 right and similarly I'll be having different different vectors right so here what we have done is that each and every vocabulary word is represented based on this feature representation right and here the feature representation may not only be 300 Dimensions it can be 100 Dimension it can be different different dimensions right but here I'm just say showing you an example with respect to Google and what will be this features that will also not be exactly known but just consider that I'm just giving you an intuitive example that yes based on some relationship with respect to the word you're able to get this specific vectors now there are lot of advantages with respect to this you know why I'm saying you because suppose let's say if I do a calculation which is like King minus man plus Queen if I do this calculation and this is a famous calculation which is also written in Google research paper if I probably do this calculation the output that I'm going to get is something called as woman okay I will definitely get is human why oh human is not there it's okay so let me just remove it suppose if I say King minus boy plus Queen then the output that I'm going to get see King is this Vector right I'm subtracting with boy and then I'm adding it up Queen at the end of the day you'll be seeing that the girl will be much more related to Boy so I'm going to get the output as girl here I'm just doing the vector calculation and this is what kind of relations we'll be able to get it you know this is what this is an amazing thing we are getting this kind of relation just by seeing these vectors which has been provided by word to again I'm not going to do the calculation here I've just randomly stuffed some values but in the real word to use case you'll be seeing that if you do this kind of calculation you're going to get girl and this I will show you practically also as we go ahead once we use this Google uh Google word to which is basically train on 3 billion Words which is quite amazing right so let me give you another example let's say here what I'm doing instead of using 300 Dimension I'll represent Every Word by two Dimension let's say I have 9596 man is represented by something like 9598 let's let's represent like this okay I'm I'm just giving some values okay some meaningful values so that you'll get an idea about it let's say queen is given as96 let's say because this is is the opposite of this right and again this can be the similar keyword uh similar vocabulary and with respect to human let's say that I'm having something like 94 or 96 let's say then what I do is that if we do the calculation of King minus man plus queen I'm going to get the human right as a output now what this vectors represent that also you really need to understand it is super super important and that is where I'm going to to discuss about something called as cosine similarity okay cosine similarity super important topic with respect to understanding these things because if you understand these things and all now see over here King is given by two two two vectors right 9596 so obviously I can I can basically construct this into a two Dimension let's say that I'm getting King over here okay and in order to make you understand let's say queen is over here okay or queen can come over anywhere let's say man is over here right man is over here if I probably say king with respect to gender right this two is going to be the most nearest word when compared to queen right so if how do I calculate the distance between this vector and this Vector which is provided in this form all I do is that I try to probably calculate the distance like this okay I try to find out the angle and for this to find out the distance between two vectors we use a distance formula which says it is nothing but 1 minus cosine similarity now what is cosine similarity this is super important to understand cosine similarity I can basically say that cosine similarity is nothing but cosine similarity is nothing but the angle between these two vectors let's say the angle between these two vectors let's say I'm just taking as an example is 45° let's let's consider then this is nothing but cos 45 cos 45 is nothing but 1 by < tk2 and probably I think this is approximately equal to 771 I've done the calculation if it is wrong just let me know okay but over here then the distance between these two vectors will be nothing but 1 minus 7071 now if I'm calculating the distance it is. 29 let's say so if I'm getting this 29 distance right I will say that okay almost this particular word are similar let's say if I have two more vectors if I have two different vectors one of the vector is basically over here one of the vector is over here then the angle between these two is nothing but 90° in the case of 190° my distance will be nothing but 1us cos Theta cos Theta is nothing but cos 90 cos 90 is nothing but 0 1 - 0 is 1 so I can definitely say this vector and this Vector are completely different because the distance between them is one if the distance is nearer to zero I will say that they almost similar vectors now in this case it is 0. 29 I can say that okay somewhat similar right let's say if I have one more Vector which is in this point only then I can say that these two vectors are almost same because the angle between them is uh cos 0 cos 0 is nothing but 1 1 - 1 which is nothing but zero right because over here the the angle between this two point is nothing but zero right there's no angle at all right so if there is no angle then I will be able to find out the distance in this case my distance will be nothing but 1 - cos 0 1 - cos 0 is nothing but 1 - 1 which is nothing but zero now if the distance is coming as zero that basically means these two are same word right now this super important because recommendation also happens in this way itself now in recommendation let's say I have a movie which is like a Avengers let's say Avengers is over here where do you think Iron Man will come Iron Man will come again near this or near this at this particular Point only right Iron Man will be coming and it will be based on different different feature representation right whether it is a comic movie whether it is an action movie so action comic right or comedy these all are feature representation try to understand this and movie name is basically my Vector my vocabulary it can be Avengers it can be this one right so I hope you're getting an idea about how word to W is basically working at the end of the day we are basically creating a feuture representation of every word okay and we are able to find it out so yes this was about word to W now what we are going to do is that you need to understand that how this feature representation is created and how this vectors is basically created how this vectors because here I have randomly written boy to gender is minus one boy to girl g girl to gender is plus one because I said that okay this may be the opposite one now you'll get an idea that how in an deep new deep learning neural network basically a simple neural network how this entire word to is trained and that is what I'm going to discuss in my next video then I'm going to also show you the Practical implementation so I hope you're able to understand this with respect to this uh a very good amazing model developed by Google a very good architecture for with respect to this and we'll try to solve that in the upcoming video hello guys so we are going to continue the discussion with respect to word Tock uh already we know word Tock are basically of two types one is cow which we have already seen previously continuous bag of words and script gram in this video what we are going to do is that we're going to understand that how what to model is basically created what is the Deep learning model that we are specifically saying you know how the inputs and outputs are there and how the model is basically trained one important thing is that you really need to have a prerequisite knowledge about Ann loss function and optimizers so if you do not have this I would suggest first of all please make sure that you have some knowledge about this right before you understand understand this now with respect to word to one more important thing is that in word to we also have a pre-trained models right now if I talk about a pre-trained model like Google right Google has a pre-trained model with respect to word to which is trained on three 3 billion words and we can also train a model from scratch train a model from scratch okay and again uh the reason why I'm taking this because you really need to understand that how that feature representation is basically getting created okay now let me go ahead and let me say is that uh uh let me take a simple Corpus and let's say that first of all we'll start with cow okay so we going to discuss about cow which is nothing but continuous bag of words and how this model is basically created it is a type of word to right continuous bag of words now to solve any problem I will definitely have a data set so let's say that this is my Corpus and remember Corpus because all these models you know like word to is trained on a huge data set huge data set like this particular pre-train model from Google is basically trained on 3 billion words so let's say I have a corpus or a statement or a data set or or a paragraph it can be anything and for just making you understand I'm just going to take a simple paragraph I'm going to say that okay I neuron I neuron is is I neuron company or I neuron company is related to data science let's say I have this particular corpus now remember like this in a use case you'll be having a bigger Corpus it can have millions of words right but let's say that I'm digging just a simple Corpus over here just a single liner now you're going to understand that how a cow what to is basically created and how a model is basically trained with the help of deep learning now first thing is that whenever we have a corpus we really need to know what is our input data and what is our output data because word to altogether is a supervised machine learning right so first of all what we do is that we select a window size and I'll talk about this window size and why it is super important let's say that I'm going to select a window size of five now this window size is super important to basically create your input data and output data okay super super important like what should be your input data and what should be your output data so that you can train your model right now this window size 5 indicates that how many words I need to select initially so let's say I'm selecting five words so here is my five words now from this particular five words I will take the center word Center word Now understand how I will take up this window size 5 words and convert into input and output data so let's say this is my input data and this is my output data okay now the central element that I've actually taken over here is is right now in the input I will be having I neuron company and then on the right hand side I have related and I have two okay now why you why you may be thinking that I'm taking the forward and the backward because understand if I'm taking this as a central word and this will basically be my output word okay is okay this is is basically my output word now it should be knowing that what all words are in the forward context and what all words are in the backward context so that is the reason we are creating in this particular way so that this is output should be knowing about its forward word and its backward word just to get some idea about the context of that specific sentence now this is the first step now when I took the window size as five the initial five words I divide my data set into input and output perfect now the next step is that I will go ahead and I'll move this window by one step and take the next five words so the next five words over here is nothing but so here this is my sentence one so sorry input one and now this will become my input two and here I will be having company okay uh and again from all these five words which is my Center word so this is basically my Center word related right so related so here I'll write company is uh two and data so this is my second input and over here the output will basically be related right then similarly I will go to the next step and I will push this Windows to One Step more okay so here now I'll be having my third sentence with respect to my input and output and the third sentence again which will be Central word so this will basically be my central word so here you have two okay and here I'm going to have is related okay and then the right hand side data and science now you may be thinking chrish should we only take the window size as five no you can take any value you can take any value and why this window size is playing an important role I'll just say in some time you can take up any value but don't take an even number take an odd number so that I will be getting the central element which I'm taking as an output will have the correct number of words in the forward context and in the backward context okay so is related data Sciences there now right which is the central word over here two right so I'm just going to write it as two now this became my input and output now what I'm actually going to do is that I'm going to train my model with this very simple right I'm going to train my model with this now how the training will basically happen now one very important thing that you need to understand over here you'll be seeing in neuron company related to all these inputs and outputs so I cannot probably send this text directly I need to convert this into some vectors initially to send it in as an input to the neural Network also so for this what I'm actually going to do first of all you know that how many number of words I have how many number of words I have in the vocabulary I have I neuron I have company I have is related to data science right so they are around 1 2 3 4 5 6 7 right seven words are there now if I probably use one hot encoding technique now see this okay this is super important in one hot encoding technique if I probably consider I neuron let's consider the first sentence over here I have I neuron and then I have company then I have related and then I have two right so for all these words how I'll be giving the one hot code rep one hot encoding representation wherever there will be I neuron I'm just going to make it as one remaining all will be zeros so there will be around 1 2 3 4 5 6 zeros similarly when when company is there I'll make this as one and remaining all will be zero right similarly related related is in the fourth word so I'm going to make it at 0 0 1 0 0 0 and then two is present after this so I'm basically going to write it as uh 0 1 0 0 right so this is this is pretty much Clear till here right so here you can see that I have I've basically done this a simple one hod encoded format that basically means if I really want to pass I neuron I need to give this as my Vector this is what is the understanding and this Vector is basically Ally given by seven dimensions so seven vectors I'm giving it over here right 1 0 0 0 0 if I'm sending company as my next word then this should be the vector it should go right then if I'm sending related this should be the vector I should go so similarly all these particular words will be converted into this particular Vector using one in coding now let's go to the next step which is super super important okay super super important what does c basically mean continuous bag of words okay this is nothing but this is a fully connected neural Network now you will be able to understand how these models are created fully connected neural network okay now in this fully connected neural network you'll be able to see one very very important thing one is first of all just understand how many number of input should I be giving right like how many words I will be giving as my input right since my window size is same window size is five all my inputs are fixed I hope that is very much clear right now in this particular problem statement you'll be seeing I'm giving four words in every sentence so my input is basically fixed now I neuron when I give my word in neuron let's say in the first case I give my sentence one and this is my sentence one when I give in neuron in neuron is represented by seven vectors over here it is represented by this Vector then company is basically represented by this Vector so if I probably see in fully connected layer my first input layer will basically be nothing but so here you'll be able to see this will be my input okay and this is super important guys see this so this will be my input my first input word okay and understand I'm giving I'm creating this circle as my inputs okay I'm I'm just creating this circle as my input so if you see 1 2 3 4 5 6 7 seven inputs I'm giving it right now when I give this seven inputs then similarly how many words will be going four words will be going right so one word two word so this is basically my input layer so this layer is nothing but my input layer in a fully connected layer a simple a Ann if I probably consider an example of a Ann so here also how many uh inputs I'll be having 1 2 3 4 5 6 7 right and this is my first word second word third word fourth word so I will be having four different words over here that will be going and each word will be having a dimension of seven vectors 1 2 3 4 5 6 7 right I'm not giving this value don't consider that these all are zeros okay I'm just saying that these are my input layer input layer input circles okay that is how we created Inn right and then probably I have my last one so this is my input right I'm basically designing the neural network how it will look like when we are training a word to we so these are my four four words understand in the first case I'm going to pass I neuron over here so let's say I'm going to pass over here I neuron this will be my input over here this is represented I neuron will be represented by 1 0 0 0 0 0 right and similarly if I go to the second word that is like company then it will be represented by another word like this it'll be represented by a different Vector like 0 1 0 0 0 0 0 right seven it will be 1 2 3 4 5 6 7 right now similarly other words will be represented like this okay so this becomes my input layer okay and every input is is basically given by a vector of seven seven dimensions so because I'm representing every word based on the vocabulary size using one hot encoded now this becomes my input layer now let's go to the middle layer that is called as the hidden layer now in this hidden layer this is super important just pause the video and guess what will be the size you know that our window size is how much our window size is basically five right so I'm just going to make this as my window size okay so this is my window size now in my window size if you remember how many how many we are having in our window size our window size is nothing but five so I'll be having 1 2 3 4 5 right window size is basically five so in my hidden layer I'll be having this five vectors okay so just understand that with respect to this five our window size will be set over here okay now with respect to the output in output how many values I have I just have one value and each word I just have one word in the output right and each word is represented by a vector of seven because if I'm also considering this is using one hot incant I'm going to get this vectors of this Dimension that is seven right so what I will be doing in output I will basically be having another output layer like this which will again be having seven different outputs 1 2 3 4 5 6 7 okay now this is how my fully connected neural will look like neural network will look like now you need to understand one thing over here each and every node each and every node will be connected to the other node right like this like this it will be connected like like how an Ann will work it'll be connected like this only right similarly right now similarly these all will be connected to this also so in short I can basically make a very simple connection like this which will look like this itself and this will be entirely connected to this right understand all these lines will have some initialized weight initialized weights and we need to train these weights and this is what it happens in a&n right similarly this will be connected to this this will be connected to this this will be connected to this right and finally this will also be connected to this right so everything is basically getting connected and from The Hidden layer this is my hidden layer one hl1 and this is my output layer right now from this hidden layer it will basically get connected over here and this will get connected over here okay now understand one very very important thing okay this is super super important fine we are connecting it with the help of loss function we'll uh we'll also do forward and backward propagation now let's see we'll consider let's let's pass this particular word I neuron company related to so I have passed I neuron company and here also you'll be able to see I'm passing related to very simple okay let me just zoom out a bit okay now once I pass all these things what happens over here with respect to this s output I already know what is my real output is right I'll be getting some values over here okay I'll be getting some values okay but the real output is what if I consider is is is my third word so is is my real output so this will basically be represented in this vector format that is 0 0 1 0 0 0 0 right but after training while we are training the model with different different weights this is my true output this is my y I may also get different y hat right I may get some values like25 I may get some values like 33 then like this 0 1 0 0 0 something like this then what we do we basically calculate the loss function and based on this loss we need to reduce this we do the backward propagation right backward propagation and we do it unless and until the difference between Y and Y hat are minimal okay and this process is continuous very simple but now you near need to understand one very important thing very very important thing now since this is giving me a specific output okay this is basically giving me a specific output when I say my my middle layer is basically basically window size of five window size of five that basically means over here in the word to when I said I will be getting a 300 Dimensions over here when I'm using Google word Toc this is all because of this window size okay that basically means if my window size is five I'm going to get the output as five for every word that basically means when a word is getting converted into a vector I am going to get a size of five vectors and this will basically be my final output now I hope you able to understand again let me repeat it the reason I have actually selected window size is equal to 5 because I want to probably provide a feature representation with a vector size of five okay that basically means every word will be converted into a five Vector now when I took an example of Google which was getting converted into 300 Dimension that basically means my window size is 300 and more the bigger window size the better the model can basically perform okay so in this case you'll be able to see that over here my window size is five that basically means if I see from starting right my metrix for every word will be 7 cross 5 that many number of Weights will be there then here also I'll be having 7 cross 5 weights here also I'll be having 7 cross 5 weights because I'm giving seven different vectors here also I'm having 7 cross 5 but in this case I will basically be having 5 cross 7 now what does 5 cross 7 basically mean when this loss gets reduced then my final Vector will look something like this this all will get connected to this this all will get connected to this let's say it is getting connected to this one it is getting connected to this one it is getting connected to this one so once we have this particular connection let let's say our first word over here is what is our first word if you probably see what is our first word with respect to this particular vectors with vocabulary you'll be seeing I neuron right this is my vocabulary the first word that is basically getting represented over here is I neuron right so I neuron will have a output dimension of five vectors because I'm getting this five vectors over here joined to this so this five vectors will be like 092 it can be 94 based on the training it can be 0.25 it can be 36 and it can be0 45 and this is based on some feature representation so I hope you're able to understand and this is how in neuron will be represented the second word that we have in the vocabulary that is company right this will again get connected to this company this will also get connected this will also get connected this will also get connected this will also get and this entire word will be the vector for the company itself and for the company we may have a different Vector but again the size will be five dimension because our window size is five and this training of forward and the backward propagation when the loss is minimal then only we'll be able to get the vectors and that Vector is basically taken and it will be represented in the uh format of the feature representation for each and every word so guys so we are going to continue the discussion with respect to natural language processing in this video video we are going to discuss about the second architecture that is Skip Gram now already I've actually shown you how does seow actually works continuous bag of words and I also showed you that how the neural network gets strained right now what is the difference between cow and Skip gram what is the difference between the architecture it is very simple guys right now just focus I'm going to take the same data set over here let's say that ion compan is related to data science I've written over here right and with respect to this you can see that I've created my input and output now if I am using skip gr then the thing that is going to change is that everything will be same let's say that I have taken the window size as this so if I probably go ahead and show you now what will happen is that before the input was this specific test and the output was this specific test now with the help of skip gram over here the the the initially let's say that if this was input now this is going to become the input and this will basically be the output right and this is with respect to window size is equal to 5 right window size is equal to 5 all the steps will be same only what we are doing is that we changing the input and we are changing the output before the input was this all text and the output was this now what will happen is that entirely when we are creating this neural network now in the input let's say that I have this isword related word to word right so in the input I'm going to basically have uh input layer with seven vectors that will be going Why seven because if we probably see this our vectors how many number of uh vocabulary what is the vocabulary size 1 2 3 4 5 6 7 right so initially in the input layer I will be giving a input which will be basically having uh seven vectors seven dimension vectors and then in the middle I will basically be having my window size vectors so window size is nothing but five 1 2 3 4 5 just understand this these are just nodes okay and in the output you can see that I'm having four words right so everything that is present with respect to this right similarly this will get constructed over here at this point of time so over here in the output layer you'll be able to see that I will be having one word two word three word and the fourth word so this is what is the kind of output we will be getting and again uh here it is very much simple because every one will be like seven Dimension over here again right so I hope you able to understand we have just changed the Direction with respect to this right so this will be my input layer so here you can see that we will basically be having a 7 cross 5 metrix with respect to the weights because weights initially be randomly and initialized right and then we need to train this right and then we will be having with respect to this all I'll be connecting here to here here to here and this will basically be a 5 cross 7 uh Matrix with respect to weights and then this will be 5 cross 7 similarly over here you'll be seeing that this will be 5 cross 7 and Below one will also be a 5 cross 7 but in with respect to that you can see that initially when I give the input is is will be uh the is vectors will be going over here and as you know that is is the third we third word so this will be going like this right so 0 0 1 0 0 0 0 so this is the vectors that will be going over here and again since my window size is five over here you'll be able to see other vectors that will get initialized uh randomly this all will be connected with weights so 7 cross 5 weight metrix will be created and with respect to this our forward propagation will happen and obviously you know that if you probably know a Ann right what all things happens in between the hidden layer the input weights is getting multiplied by the weights itself and then a bias is added and an activation function is applied on top of it in the output layer we basically apply a soft Max function soft Max function so that we compute it with Y and Y hat y hat is the predicted one y is the real data in this particular case this is my y sorry this is my y right in the first case Y is I neuron so I neuron whatever um whatever things will be there whatever vectors will be there here you'll be initializing it right so here I basically apply a softmax function let's say with respect to I neuron my Y is nothing but 1 0 0 0 0 0 0 0 7 zeros and then y hat will be computed since we apply a soft Max over here right so y hat will be something right some values over here and then what we do we calculate our loss function and we make sure sure that we keep on doing the forward and the backward propagation unless and until all the loss function decreases right the loss value decreases and finally you'll be seeing that whatever is connected right this particular word will be shown in the form of five vectors this word will be shown in the form of five vectors once the loss is completely minimized so same process you should definitely know how the Ann actually works how the optimizer actually work and this was just a brief idea about script gram right now how can we improve this or the the basic question is that when should we apply C boy C so the question is when should we apply cball or skip gram right the simple thing is that according to the research right whenever you have a small data set small Corpus we can Bally go with something like C that is continuous bag of word if you have a huge data set you should definitely go with skip gram and that is proven uh in many research paper so I'm just giving you the direct um observation out of it so that you'll be able to do this now let's say if you want to increase cow or skip gram how can you basically do it one thing is that you should increase your training data set increase the training data that basically means the more the training data the better the accuracy right increase the training data the second thing is that you can also increase increase the window size window size which in turn which in turn leads to leads to increase of Dimensions increase of Vector Dimension this is super important okay so here I'm saying how to improve cow or skip gram so this is what you can basically do increase the window size okay this is super important increase the window size I'm again going to take this separately or forget it out so over here again uh let me write this we have to increase we can also increase the window size instead of having five I can make this as uh 100 you know so obviously with the increase in the window size increase the window size if we are increasing the window size that basically means the vector Dimension is also increasing right the vector Dimension is also increasing so when we keep on increasing and trying uh try it you'll be able to see that we'll be getting better performance so this is also you can basically use you can you can increase it now when we see in the next example right we'll be we'll be using a pre-trained model with respect to Google so Google word to right now this is basically trained in 3 billion words I guess it is 3 billion words 3 billion words and it is going to give me a feature representation feature representation of 300 vectors sorry 300 Dimensions that basically means suppose if I have a word Cricket okay since Cricket is always there in the news this 3 billion word is from the Google news right and Google is a very big company guys with them this amount of data it is very much easy so what they're going to do is that if we give a word called as cricet then it is going to basically convert this into a 300 Dimension 300 Dimension vectors okay vectors they is super important 300 Dimension vectors okay so I'm just going to write it as 300 Dimension vectors and this all example I will try try to show you uh when we use this what we'll do is that in the upcoming session we'll try to we'll try to use a pre-trained model also and we'll try to also make sure that we train a new data set from scratch with the help of word Tock and that we are going to basically do with the Gen Sim Library okay so yes uh this we'll be doing in the next video so I hope you have understood both the architecture one is cball and one is the skip gram hello guys so we are going to continue a discussion with respect to NLP and in this this video we're going to discuss about average word to w a super important topic again because uh just by with the help of word to you'll not be able to solve uh the classification problem you really need to perform something called as average word to let's take a simple example I'll try to show you uh right now we'll just discuss with respect to theoretical intution but as we go ahead uh we'll be discussing about practical part also and how we can Implement average word to work also so uh over here I have my text Data the foot is good and the output is 1 Z and one and you can see all this are my documents now as you know that with the help of word to what we do is that we take every word and we convert that into vectors let's say that I'm using a Google pre-train model okay I'm using a Google pre-trained word to model pre-trained word to W model now in this Google pre-trained word to model what I'm actually doing is that what I'm actually doing is that over here you can definitely see that okay so this if this is my sentence the food is good right now the will get converted into some vectors okay and uh if when I've already shown you that with respect to feature representation we will be getting a vector of 300 Dimensions so let's say this is the vector of 300 Dimensions okay so this the will get converted into 300 Dimensions with respect to this okay the will be there now coming to the next word that is food right food will also get converted into 300 dimensions and that is what happens with respect to this word t right every word that is present over here like is will also get uh will also get formed in 300 dimensions and here also all this words will get converted into vectors with respect to this many number of Dimension okay perfect so till here I think uh you're superbly clear because we have already discussed about all these things uh in our previous session but now you need to understand one thing is that when we have this kind of sentences and if we are basically writing this many number of Dimensions basically this many number of vectors are there let's say this is my 300 Dimension Vector which is basically getting converted for the foot and this in the real world scenario I should be converting this entire sentence into of vectors right but here you can see I'm getting vectors for the also 300 300 vectors for foot I'm getting separately 300 vectors for is I'm getting 300 vectors and for good I'm getting 300 vectors right 300 Dimension vectors now this is a problem at the end of the day for this entire sentence let's say if I'm getting a 300 dimension for this entire sentence let's say this is my sentence one or document one my main aim should be that I should be getting only 1 300 diamension with respect to this so that I can take this as an input and this will basically be my output and I can basically train my model right but here you can see that since we applying word to we every every word is basically getting converted into a separate 300 Dimension vectors so in order to solve this what we do is that we take all these vectors we take all these vectors and we find the average of it once we find the average of it then we try to write that particular V Vector over here for this entire sentence and then this Vector will be considered for this entire document or sentence similarly the second Vector also is there we will go ahead and calculate the average of it and we'll write it over here similarly like this every Vector will basically happen right for every every Vector so every all the all the word vectors you know averaging is basically happening and that is the reason we say average word to work okay and why this will work because see at the end of the day you'll be seeing that we just require the vectors for this entire sentence and right now we have for every word a separate 300 Dimension vectors so if we are trying to take out the average of all this particular vector and if we try to write down as one vector and this will also be the length will obviously be 300 itself right this will be 300 Dimension itself but now with respect to this particular Vector I will be having my output I can pass it through my model and get trained with respect to this this so at the end of the day what happens is that if I'm having a sentence like this after applying average word to W I'm just going to get let's say if I'm using the pre-train model I'm going to get a 300 Dimension vectors which will be all these particular average of all these particular vectors itself and all this will be over here and I will be getting my output variable which is already over here and similarly this will happen for the second word set document third document also so average word to work basically says that okay we're not doing anything whatever vectors is getting converted I'm just trying to average out each and every Vector with respect to that right so over here let's say the is having 300 Dimension Vector foood is having 300 Dimension Vector is is having 300 di good is having 300 for every line by line all the vectors I'm just averaging that out and I'm actually finding out a value and why we are doing this because we really want only one specific set of vectors for the entire sentence right so this is what average word to is basically and for for any text classification we basically needed to solve through this specific way only now in the upcoming tutorials what I'm actually going to show is that I'm going to basically use a library which is called as genim okay and there is also a library separately which is called as Globe but if you probably understand Jim you'll be able to do everything so first of all we will go ahead and we will see with respect to a pre-rain Google word Toc and then in the second instance what we are going going to do is that we going to train a word to W from scratch from scratch that basically me we'll take a data set and we'll try to train a word to we model and again with the help of genim Library we'll be doing it so I hope you got an idea about average word to at the end of why we are doing this because for the entire sentence we want some vectors and uh since we combining all this particular word that semantic information will also be maintained right so yes I will see you all in the next video with the practical IC implementation hello guys so in this video we are going to basically see the word to W practical implementation I really want to show you some Google pre-trained models and basically give you an idea like how what creates a vector right so for this tutorial I'm going to use a library which is called as genim so let's go ahead and let's install this particular Library so you just need to write pip install genim so here you can see that requirement is already satisfied and uh what I'm actually going to do is that I'm going to import this import genim and from jim. model I'm going to import word to and key vectors okay and I'll talk about this why these two libraries are specifically required now one very important thing for this as I said that to show you the Practical implementation of word to here I'm going to take a Google pre-trained model in the upcoming video I will try to show you a different model which can be trained from scratch but here I'm going to show show you a Google word to pre-train model and what this particular model is all about so I'm basically taking a word to Google News 300 okay and uh this is basically a pre-trained vector trained on Google News data set about 100 billion words the model contains 300 dimensional vectors for 3 million words and phrases right this phrases were obtained using a sample data driven approached and distributed represent of this and all all the research paper and everything is basically given over here okay so this same model we are going to use and we going to see that how it can easily create a vectors whenever we give any kind of words so in gen Sim you know you have something called as API right so there is a library called as API so you just need to write import jim. downloader as API and you just need to write api. load and basically the model name right so the model name is nothing but what to Google new use uh- 300 okay now once you do this all you have to do is that when you provide any word inside this WV variable which is nothing but this is a instance of the specific model it will try to give you the vector I'm not going to execute this line of code because I've already done it because the model size is 166 2.8 MB right so it'll probably take some time for you all to download so I have already downloaded this so that I can record the video directly you can just go ahead and and downloaded now let me go ahead see that how the king Vector look like okay so this is basically the king uh vectors how the word is basically converted into vectors and the dimensions that you'll be seeing right how many vectors are there as said from here right we have 300 Dimensions so this is what we getting this entire vectors you'll be able to see will be having 300 Dimensions so if I probably use this Vector uncore King do shape right so if if I execute this here you'll be able to see 300 Dimensions right so with respect to this you use any kind of word you will be able to get some kind of vectors now let me give you some of the example okay suppose I make over here and uh for this particular Vector what all things I use I use this WV variable which is a word to object right all I have to do I have to give this and give any word of your choice let's say I want to give Cricut right so if I give probably cricet you'll be able to see that the vector is Auto atically generated so here is again the vector which you are basically getting with respect to this particular shape that is 300 Dimension now this WV this WV variable right which is a word Vector it also has some of the functions you can actually use something called as most underscore similar now let's say that I've given this most underscore similar I'm saying that from uh if I'm giving this Cricket word right which is the most similar word present in this entire Corpus right and I'm actually going to use the same word to W object and I'm going to find out all the words so if I execute this here you'll be able to see the most similar with respect to Cricket you know it will first of all try to convert into a vector and then it'll probably check whether that Vector is being able to see any similar kind of words or not so these are all the words that is similar to Cricket in that specific Corpus so you have cricketing it is having like 83 similarity Point 83% similarity then cricketers 81% similarity then you have test cricket here you can see 80% similarity 20 Cricket right you have 80% similarity then you have Cricket then you have cricketer all this things you know they are showing some kind of similarities with respect to this Google News field right now similarly if I really want to find out which is the most similar for word with respect to happy you'll also be able to see I'll be having words like glad pleased ecstatic Overjoyed thrilled satisfied proud delighted and understand that these all are similar words when I compare to Happy right so it is also being able to say that okay these are the similar words when compared to the Happy word and they also being able to show with respect to how much distance and I've already shown you how this distance is basically calculated right so we have a concept of cosign similarity and all similarly I can also provide two two words and I can basically say that okay how much this two word are basically similar so if I probably execute this here you'll be able to see wv. similarity I'm getting hockey and sports are somewhere on a 53 person similar right now this is very interesting okay I'm going to take the vector of King and I'm going to subtract with the vector of man and I'm going to add the vector of woman now let's see what will be the kind of output vectors that I'll be getting okay so once I do this in short what I'm doing I'm subtracting King minus man plus woman obviously the answer should be Queen but I really want to prove you through the vectors itself whether we are able to get the queen or not so here I'm going to execute this I'll got I got my vectors so this is my entire Vector so this is my entire vector and again this Vector is of 300 Dimensions okay so this is my entire Vector now what I'm going to do I'm going to use wv. most uncore similar and I'm just going to pass this entire Vector over here so once I execute this here you'll be able to see that see King is the first obviously uh King should be the most similar Vector but here I'm getting Queen Monarch princess crown prince prince prince sorry Kings Sultan Queen Resort but here you can see that the most similar word after King is Queen so the kind of vector we are getting after doing all this particular subtraction matches matches more towards the vector with respect to Queen right and this is what we are able to get with the help of word to W try to use this guys this is an amazing model altogether you'll be able to see uh that you can actually use this model itself you know you can actually use word to Google News 300 which will be able to solve many problem of yours and this is just just a brief idea about a pre-trained model you know how a what to pre-train model looks like you can also take your own text and train it from scratch but again there's a different process for allog together and this entire thing I've executed in Google collab because the model size is quite huge okay uh so in the upcoming videos what I'm actually going to do I'm also going to show you a what to technique by using genim where you can train the model from scratch okay so yes uh this was it for for this particular video I will see you all in the next video thank you
Info
Channel: Krish Naik
Views: 88,809
Rating: undefined out of 5
Keywords: yt:cc=on, complete nlp machine elarnign tutorials, complete deep learning tutorials, word2vec tutorials, avgword2vec, skigram tutorials, natural language processign tutorials
Id: ENLEjGozrio
Channel Id: undefined
Length: 233min 10sec (13990 seconds)
Published: Mon Nov 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.