Natural Language Processing in Artificial Intelligence using Python - Full Course

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
natural language processing is the branch of computer science and more specifically the branch of artificial intelligence or ai concerned with giving computers the ability to understand the text and spoken words in much the same way human beings can in this tutorial you will start from the absolute basic of nlp and then proceed forward semantic segmentation focusing on learning how to build a recommendation engine and build a chat bot using the widely used programming language python if this interests you then fasten your seatbelt as you are going to dive deeper into the world of artificial intelligence [Music] if you haven't subscribed to our channel yet i want to request you to hit the subscribe button and turn on the notification bell so that you don't miss out on any new update or video releases from great learning if you enjoy this video show us some love and like this video knowledge increases by sharing so make sure you share this video with your friends and colleagues make sure to comment on the video for any query or suggestions and i will respond to your comments so let's have a look at what is natural language processing a proper description for this particular technique so np stands for natural language processing which is basically used to understand and interpret human language to the mission in short it is the automatic way to manipulate the natural language like speech and the text by software for further analysis to get the required information from them yes this stands for natural language processing now we will have a look at what are the examples of natural language processing what are the field where we are currently using nlp so we use nlp for predictive text we know about gmail right suppose you are writing a mail and it it will automatically give you the prediction what will be your next words right how they can do that so this is doing by the natural language processing technique now email faced filters suppose your email is a spam email or it's an important image how jib will do that yes for this also they are using natural language processing now comes data analysis and language translation what is language translation suppose you don't know the particular language we know right suppose we were talking about uh no if you don't know bengali or hindi you only understand english in that case what you can do you can use google lens to understand basically what language like what is written in this particular language it will convert it to english how they do that again by doing that they have to use natural language processing we will see this is really interesting part of nlp you can do really cool stuffs using nlp right and last but not the least smart assistance when we talk about siri or google assistants or amazon alexa and there are a lot more we use as a smart assistants how they understand our language how they understand what we are talking to them yes by using nlp they understand what actually you want to mean by right yes this is the like you know most used uh example we can say for natural language processing so yes this is more interesting technique and you will learn lots of thing how can you actually make your you know own smart assistance using nlp as well right so let's move forward what else we have let's have a look at now we will see what is the road map to learn natural language processing so in that case you can see what you have you first need to do data pre-processing and from there you have to do modeling techniques but under data preprocessing there are few ways to do that what are those ways to do that you have to do tokenization then you have to do stop word removal we will see how actually we can do the tokenization what is tokenization stands for then we will see stopwatch removal then we will see stemming and limitization so these are the basics way to do your data preprocessing and we will do all of these things in this particular course after that if we talk about the modeling techniques what are the modeling techniques we have we have bag of words tf idf word embedding and sentiment analysis and sentiment analysis is one of the best way to understand your customer sentiment we will do a project on that as well right so yeah these are the type of data pre-processing we will work on and also the modeling techniques of back of words tf idea what mdd and sentiment analysis we will work on let's move further now we will see how to implement python or how to use these techniques so basically to use these techniques and to implement this techniques you need a language but what language should we use we will use python and now we will see why do we use python as i have said we will see why should we use python and why we choose language as a python we will talk about that so let's move further and let's have a look at what are the technologies rather what are the features we have in python right so what is python python is a popular high-level object-oriented and interpreted language now when we say it's high level object oriented and interpreted what actually i want to mean by suppose you are someone who has good idea about english and you really don't know how to write a program right you don't know what is in taxes you don't know how to you know write your first program then also i can suggest you can start with python why because python does not have any syntax but yes you have to maintain the indentation right we will talk about that what is indentation but for the other language you really need to know lots of syntaxes and if we talk about what is syntaxes suppose we are talking about in a human language right so in a human language we one thing we need that is grammar you don't understand a particular sentence if the words are you know placed without any manner without maintaining any grammar will you understand what this sentence actually want to mean by suppose you are saying hey i am going somewhere right like you know like we are using grammar for that but suppose you were just suddenly saying or going we are somewhere right can you understand that the words are actually there in the sentence but without any proper maintaining grammar can you do that no you cannot do that you will not understand what this particular sentence is like you know talking about yes to understand the sentence you need to have a good understanding of grammar in this case of programming language you also need to have a good grammatical understanding what is that grammar for a programming language this grammar is nothing but your syntaxes your computer will understand the particular command if you properly put the syntax where it comes for python you just need to maintain the proper indentation to make your compiler understand what you have written in your program right yes python is very easy to start with now if we look into the benefits of python i will give you a short introductory part of what are the benefits you will get if you start learning python suppose you are from somewhere we you want to make your own web page right if you want to use python the best option for you is python why because python has lots of frameworks to work with right and second the length of the program the first and foremost thing when we like you know think about a programmer the first things comes into our mind that programmer must have working with the thousands lines of code and that should be very difficult right yes they do work with lots of lines of codes but but when it comes to python it's really makes your life easy let me give you a quick example of that suppose you are writing hello world when you are writing a hello world programming java cc plus plus at least you have to write three to four lines of code in the case of python you just need to write one line of code you can understand how it how easy it is to work with right now comes the python is simple and beginner friendly yes as we i have already mentioned if you are someone who has no idea about the programming language then also python is for you now we will see the mathematical computational can be done easily how actually we can do a computation easily using python yes python has lots of libraries we will look into that lots of library makes your life easy already this particular mathematical function are implemented in these libraries so you just need to call the function and your work is done to work with right this is why python is one of the most used language these days now last but not the least graphical user interface if you are a gamer if you want to make your own game and also if you want to make your own gui python is for you again python has lots of options to work with right so you actually understand what are the benefits we have in python and i will show you how actually it's making your life easy with the lots of library right so yeah this is why we have chose python over any other language now we will have a look at what are the important libraries we have in python the first library that is numpy which stands for numerical python to solve your numerical problems if you have lots of numerical problems and if you want a library which we can actually make your life easy then this is numpy then you have pandas then comes as we have said we need to do lots of data pre-processing how can you do that are you really thinking to do these things by writing each and everything in your code no because pandas has lots of you know functions to make your life easy when it comes to data pre-processing right so pandas is known as one of the most used tool for data preprocessing suppose if you want to look at your data and if you want to see how your data looks like then matplotlib and seabot is the best choice for you now comes nltk and spacey so we are going to use this to particular library and talk about that how we are going to do that right for your natural language processing techniques to basically implement them right so yeah this is the what are the most important libraries we have in python in this part we are going to talk about the data preprocessing what are the steps you have to follow to do your data pre-processing so let's get start with that first we will have a look at what is data the basic thing you have to know before start with the data pre-processing so basically data is a bunch of raw information which operations will be performed right yes it's a raw information you have and using those raw information you have to make a proper report out of it that is known as data now if we move further we will see how you can start working with types of document you have and that is how can you read those document using python and how can you actually manipulate those document using python so basically we work with different types of document in python so most of the time we use pdf version or text file version or csc version what is csv comma separated file right so now basically csv stands for comma separated values so now we will see how can you actually read a pdf file using python and text file using python and also csv file using python so let's get see uh and started with python so here we are going to use google cool app and google collab is basically let me give you a short intro of google collab if you just write google collab right so what you will get uh though basically it is by google for the ai researcher if you are someone who is a spirit for artificial intelligence so this platform is for you you don't need to install anything you just could just write google collab and from here if you just click on this new notebook sign so you will get start with this particular new notebook so in this particular new notebook what you will get there are few features given by google for the ai researcher suppose you don't have a good you know computer to work with but if you want to work with the natural language processing i will show you when we are going to work with the long data type and large data type it's going to take a long time right so when your data set is large so basically that time you need to have a good computational power so if you do not have a good computational powers pc or laptop then also you can make your work done how by using google cool app because if you go to runtime and if you just try to change your runtime type you can see you have gpu and tpu so tpu stands for tensor processing you need and gpu for graphics processing unit right so yes you can have this two type of runtime using google colab right for this time we are not going to use any one of them because we are not going to use any huge data set to work with right yet so you can see this is the whole scenario of google colon and these are the code cells you have and if you want to put text you can put using this just click on that and you can write whatever text you want to add any you know note you want to add and also you can share your you know code with anyone by clicking on share right so basically it makes our life easy the google cool app it's a basic intro of google collab right and yeah this is how you can share now if we go back to that we have made our you know notebook for like natural language processing for this so what we are going to do first we are going to read a pdf file using python then we will uh you know uh talk about the text file and then we will talk about the comma separated value file which is csv file so you can see already we have few files over here so if you want to actually uh you know import few files to your google cola you can see there is option for upload to session storage so one thing you have to keep in your mind that whatever the thing you are going to upload in google cola that will be removed after 12 hours right so whatever walk you have done do not forget to save that now if you just click over here you can actually redirect to your local system and anything if you want to upload from there you can easily do that right so you can see we have ready data.csv file we have demofile.txt and we have a you know pdf file for python right so how to do that we are going to install pypdf2 this is a library in python if you want to read pdf file right so if you do that after that what do you need to do let me make it little big what you need to do you have to import it right so you have done the import by pdf2 right let me do the rest part so now if i just copy and paste whatever i had so now i have to import by pdf2 you can see what i have done i have imported pi pdf2 now in the variable called data what i have done i have called open method and on that i have written what the path of my file right so we are creating the pdf file object now in this you can see we are again assigning one variable called reader and we have said pi pdf2 dot pd file reader we are trying to read the file what we have already storing data variable right so you can see creating a pdf reader object this is our pdf file object it is pdf reader object right now what we are going to do printing number of the pages pages in the pdf file what are the total number of pages you have in pdf file to give that you have to use dot num pages function right you can see i have 52 pages in my pdf and it is like actually read by the library and i am sure that i have this much pages in my pdf so you can actually load the pdf file using python and you can start working with the you know pdf file in python very easily the way we have done now if we come so further so if i just do one thing right creating a page object now if i want to make pages what are the pages we have right if i want to do that okay so what to do that i need to call a function called dot gate page now if we see page dot extract text the first page we have for that the page of what we have that is installation and data structure right so the first page text we have that is python installation and data structure and this is how you can read each and every file or each and every page in your pdf file right now as we had here for zero right now i want to go for page two right so let's have a look at what we have now if i do the same thing okay i'm getting an error okay it's saying it's a close file okay right let me do that once again right after that i want to get the page 2 right let's see introduction to python i have in my page 2. now if i go to page one let's see what i have um okay so i think i am making the data close that's why this so one thing you have to keep in your mind whatever you are working with that data do not close it until you know you are going to really close this file you don't need this file anymore right so what we why we do that because do not want to waste our memory right so it's like you can see can you see we had the agenda and now we have these things in our pdf file uh the page called one page right the page started with zero the first page we have that is agenda and this is you can see introduction to python in the installing anaconda these are the thing we have in our pdf file right so now if you want to close the data so what is closing the data you are closing the file you are seeing we don't need the file anymore it will close the file it will not take the memory for from your compiler right yes so this is how we have read the pdf file we have load the pdf file using python now if we come down and if you want to check how to do the text file in this case again we are going to like you know use the open uh function and there we are going to give the path for our demo file dot txt if you see what we have this is one i have made right what we have hello welcome to great learning right let me make it big so this is the one i have in my demo file dot txt now let's see if i read it what i will get so here if you see that text data dot read you can directly read the text file so you get hello welcome to create learning what we have in our text file so using text file and reading text file using python is really easy now if we start working with csv as we have talked about few libraries in python like numpy and pandas you don't need to numpy for this you need pandas for this because to read the csv file we need to get a function called dot rate csv function which comes from pandas right that's why we have written pd dot read csv and the what the path for that now i will show you how will you how will you actually get the path just click over here the data you have uploaded and if you click over here right the dot three dots you have there is a option called copy path right you can easily copy the path of your data copy path and if you just copy paste it over here right and can you see this is one you get now if i just execute this and if you just write data dot head we will talk about this what is data dot hit and what are all these things for now being you can understand is a data frame we have which actually takes all the data from the csv file and store into the variable called data and if you just write dot hate you will get the first five rows of your data you can see you can see basically it easily what it easily actually uh loaded right yes so this is how we do our csv file reading right so let's see uh what we have in next now we will see what is data pre-processing so data pre-processing is a way to develop informative data from raw data by removing noise and unwanted attributes so what actually want to mean by removing noise and unwanted attributes i will let you know when we are going to do the demo on that right so let's have a look at what the types of data preprocessing we have we have remove our field null values right i will show you how can you remove the null value if you have in your data set after that count the unique values in the column yes i will show you if you have none value then you can see okay i can remove the null value but if you have returned in value you need to focus on that how much value you have that is unique which is which are not repeated right after that drop the irrelevant column suppose you are working with the data set where you need to find the price right and you know there is a option for your customer's passion right so your customer passion column is not related to the sale or the price of your product so you can easily drop that right so we will do this pre-processing to make your data perfect for the whatever the modeling whatever techniques we are going to work with right so let's move further and let's have a look at what the types of data process processing we have the first type of data preprocessing that is removing null values which are known as imputation techniques so remember fill the null values in the data to get appropriate informative data right so let's have a look at how to do that so here you can see i have already given the example that is you have the raw data and you have the converted data so in the symptoms columns you have a lot of nand values right and in the converted data you can see we have actually used them in sentence one or we will see in the demo how we can remove them right so let's get start with the demo again we are going to use the same data set what we have uh actually used in this that is ready data.csv now if you want to check what are the columns you have in your data how to do that you can use the dot columns function which will actually give you what the data you have sorry what are the columns you have in your data set so you have clean comment category these two you have in your data set now if you want to check the shape of your data what the shape you have in your data that time you need to give dot shape function now what shape it's going to return what are the columns you have like basically how many columns you have and how many rows you have in your data set right so here you can see you have total three four a three seven two four nine uh rows and two columns right now if you want to go for the data.described data.describer basically used to understand total count of your data few statistical terminology about your data like mean standard deviation uh interquartile range for your data right so these are the thing we use for uh we get basically using by using dot describe method right now comes the null value how can you check them so if you use dot is null basically dot is null function use to check do you have any null value if you have any null value please give me give me the sum of that so why because for each and every column in rows i don't want to check do you have null value or not because it's going to return me on boolean like true or false i don't want to look into that i just want to understand what are the total null value we have in our data set right so if i just execute that you can see for the column called clean comment we have 100 right we have 100 null values now if you want to if you don't want to have a summation of that if you just want to check do we have any null value for any column that time you can easily use data dot is null function then dot any it will give you the boolean if you have any null value it will shown you true for that particular column so you can see for clean comment we really have null value that's why it's given true for category we do not have that's why it's false now how to actually solve the problem what you can do we have taken a new variable called data part and we have said data part equals to data dot drop in a so what drop in it does drop in a basically drop all the null values now if you say hey i want to see or drop any right if you just say hey for you know this after using drop in let let me check my data if you just write data underscore part you will get all the data now as i see it we don't have any null value if you want to recheck it how can you do that you can just write data part dot is null dot sum right so basically it will give you if you have any null value after using drop in a function or not can you see you do not have any null value that's why you get clean comment equals to zero null value you have in category also you have zero null value so one option is you can drop the null value otherwise you can replace it with mean value or anything you want right so this is how we actually check our data as the null value or not we have removed the null value problem now if we go back to the presentation and if we move further so we have done this that removing null values or imputing data and you also checking the count of the unique data right now if i just go again and i just want to check that remove all the unnecessary column so as i said that we just want to understand that removing unnecessary column so we are trying to predict the cells there is no point of having a column for the hobbies for your customers so in this case what we can see we can remove the unnecessary column so yes this is also the part of data pre-processing right so yes these are the basic data preprocessing you need to look into whenever you are going to work with any technique right if it's information learning or data science or artificial intelligence or nlp right you need to look into these things for the very basics parts of your data in this particular part we are going to talk about what is tokenization as we know this is the part of database processing so let's let's have a look at what is the actual meaning of tokenization and how can we come like actually implement them using python so basically tokenization is a method is used to split phrase sentence paragraph or an entire text document into smaller units by doing that we can get the individual words or terms we will see how we can do that each of these smaller units are called as tokens right so this is what tokenization stands for and to do the analysis of our hind language to understand our language in a proper way we have to do that organization the very basic part after you do the basic data be processing light removing null values or removing the irrelevant columns for your data and uh you know how can you like uh deal with the null values after doing all these things you have to go for tokenizing your data set how you can use the tokenization so tokenization is meaning from the name itself you can understand tokenize your whole data set with each and every tokens right so now we have done like we have seen that what is tokenization now we will see why do we need tokenization the most basic part of natural language processing it helps to interpret the meaning of the text by analyzing the what's present in the text count the number of words in the text yes if you want to count the total number of word you have in your text suppose the example we have this is a cake in a tokenization form after you have performed you will get this is a cake so each and every word will be your token the total number of count of word you will have that it's 4 and this is how we did the tokenization now we will see how can we implement tokenization using python so here also we are going to use google colab to do the look uh line tokenization so there are two type of tokenization we are going to show you what we are going to do for implementing tokenization in python so to do that we are going to import the library that is known as import nltk so we are going to use nltk the library we know that is used for natural language processing so we have written import nltk and our data will be a string right that is equals to welcome to great learning now we want to make it a line tokenization so basically line-wise tokenization right let me add one more sentence for this i am very happy to be the part of great learning right suppose you have done this now i'm removing to this okay so you have a whole total string right now what you will get if you look at this so you can have two total line right welcome to great learning i'm very happy to be the part of great learning now what we are going to do our token will be a line token not the word tokenization right so if you want to make a tokenization according to what the total lines you have in your document you can totally do that now you have written tokens equals to nltk dot said underscore tokenize so this is the function called that is me max that is from nltk which is known as saint underscore tokenize and you have sent your data which is the data variable again and you say hey i want to see my tokens so after you do that after you actually execute this one what you have what you actually get you get welcome to great learning and i am very happy to be the part of great learning if you can see this is the first token welcome to great learning and the second token is i'm very happy to the to be the part of great learning so you can see this is how our line tokenization world work now if we want to check how the what organization work what you need to do you just need to call dot word underscore tokenize so this is for what tokenization if you just uh execute that what you will get you will get welcome to great learning i am very happy to be the part of great learning can you see all at the word you have that is already what that is already uh divided into tokens that all the words you have that all the word in each and every word will be one one token right now if you want to check token start count right [Music] after we beat the tokens dot count we get an error but why it's saying count takes exactly one argument yes before you go for the checking the count you have to check token is what type right type of the token so what we have done let me just write one more line for you right now i want to check the type of the token right as it's giving the you know error for when we are going to use the count function right let me do that okay what i get the type of the token is list to give the total count of your list what you need to do you have to call the function called name function not the count function so if you write length of tokens you will get the total count of words you have in your documents right i hope this is quite clear to you how the tokenization is done what is tokenization and why do we need tokenization and how can you implement tokenization using python in this part we are going to talk about what is steaming stemming is the way to reduce a what to its what stem that affects us to suffixes and prefixes in simple term the algorithm work by cutting off the end of the beginning of the word by taking into account a list of common prefixes and suffixes that can be found in an inflected word what is basically steaming stands for we got to know but how to do that suppose you have a what that is cats so it is going to cut the s from cats and it will give you the after the stemi as cat suppose you have birds again you are going it is going to cut the s from the what and you will get bird the basic from form of your word right so yeah this is why we do stemming but why do we need stemming because it slits input dimension so when you avoid cutting down and like in the basic form your input dimension will be least machine learning techniques works better with it and makes training data more dense reduce the size of the dictionary and helps to normalize the data in the document right so this is why we need stemming so it also the part of normalizing your what we will see when we are going to do the actual project that house telling is going to help us and it's also reduced the size of the dictionaries so when we are like you know this is something related to how you are working going to work with your long data set and we will try to always cut it down to a less number of the dimensions so we don't need a high computational power and your like technique will also work fine with a short data as well so that's why we go for using stemming technique now we will see how to implement stemming using python so let's go to google cool app right so we came to google collab and i'm going to show you how to do the streaming so to the stemming we have to import nltk from an sdk we have to import port a stream we are actually going to uh you know call the port a steamer and which is like maybe you can give that port still the name you can give for port steam the variable name and you can directly call the function called porter steamer after do that you have a word list what what list you have you have like liking and likes i want to see how it's going to work with steaming right after doing that what we are going to write we are going to write port stem right so we have a list and we are iterating through a list using for loop and your each and every your i will go to each and every words you have in your list and you have print i and port same dot stem that is your not w the i right if i just execute this right i will get for like it's like for liking it is also like for likes also it is like right so you can see how this streaming is going to work so if you give different type of you know uh you know different ing after adding ing or adding s or adding you know like right so it's anyway it's going to give you the what it's going to give you the base part of your word right so yeah this is what stemming stand for right so yes we have talked about that what is stemming how why do we need it and how can you implement stemming right so yes this is what stemming stand for and why do we need it now we will have a look into what is limitization in nlp we know limitation used by the part of data preprocessing but what is that limit helps to do the morphological analysis of the words it is important to have a knowledge about the detailed dictionaries which the algorithm can refer to the link the form back to its lima but what is actually my like name for right so let's have a look at the example of limitization now you have given helps so this is what this is third person singular number present tense health so your lima will be help now you are saying that i am going to give a word called helping so img form of the word you have with the help so your base word will be again help so you can understand limitation and steaming is quite same we will see what is the difference between them as well and also we will have a look at how can you implement limitization before going back to the how to implement limitization let's have a look at what is the difference between semi and limitization the goal basically for the staining goal is reduce the inflectional forms stemming refers to the you know crude historic process which chops off the end of the words in order to achieve the goal correctly whereas the case of limitation reduce infectional forms with limitation reference to the things do proper with the help of the vocabulary and morphological analysis of words so you can understand there is a one catch point for limitization it's actually use the morphological analysis which in the case of stemi they don't bother about morphological analysis now if we go go and check about the implementation stimulus are typically easier to implement and run faster compared to limitation whereas limitization is difficult to implement obviously it takes time when we actually it's going to work with morphologic analysis right so let's see how can we actually implement limitization using python so again we came back to google collab and let's have a guide how to do the limitation so we have already imported nltk in this part we are going to download nltk dot download board nate right so downloading the package word made to the root you can see this is uploaded the word made right now you have going to actually import that what need limitizer from nvtk so you have uh made a variable called nemati and you have called the function called word need limitizer now you have given socks and songs let's see how it's going to work so you are going to write lima t dot lemma ties so what is lima t the function you have called that is what need limit is now you said lemma t dot lim tights then you are what called socks right and you have called uh one more word that is songs so lima d dot limit eyes so what you will get as a result you will get as a result sock and so on so it will chops off the last word last character you have that is ace and it will give you the base words so you can understand this is what stands for limitization and singing stemming is the basic way to do that whereas liberalization is doing the same thing but using morphologic analysis right so yes we have talked about what is steaming and also we have talked about what is limitization difference between both of them and how can you implement them in python now we are going to talk about what are stop words and why do we need to remove them why it comes under data preprocessing we will talk about that and how can define and how can you understand what are stop words you have in your document right so common words that occur in symptoms that add weight to the sentence are known as stop words the stop words acts as a breach and ensure that sentences are grammatically correct in simple terms words that are filtered out before processing natural language data is known as stop words and it's a common pre-processing method yes we know this removing of stock words it's a common preprocessing method and we will see how can we actually implement them using python and also using analytical library but i hope you understand what are stop words stop words is nothing but common words that occurs in the sentence but that will not have like you know weight that add weight to the sentence but that is not that important while we are going to do any modeling technique so we can easily remove them so let's see the demo using nltk how can you actually remove the softwares from your documents so we have came to google cola and let's have a look at how to do that in order to removing the stopwatch what you need to do basically just need to import stopwatch from nltk and then you have written import nltk nltk dot download stop words right let me just do that it will actually download the stop words you have in your nsdk library and we will see the whole you know list of uh words we have in nltk now we have a data called data that in the under data variable we have written data science is one of the most training fields to work with it needs the data to be give prediction by using the past scenarios so now what we have written we have written stop word equals to say stop words dot words and you want stop words in which language in english right so you have written stop words dot words and then in the bracket you have written english now write pre stopwords dot words so basically it will show you what are the stop words you have in your uh library if you just remove this particular part let's have a look at what it's going to give you right so basically it's some other language it's going you are going to give you the uh you know stop words that i cannot understand maybe some other language but if you write 620 to 680 basically it's splitting the data like what you have in this total words in like if those stop words have the all stop words from like the all the language they know in the library like in the library whatever the language it has so basically particularly for english one i want to look at so you can see your yours yourself he him his all are stop words comes under stop words right so now if you just want to see what are the stop words you have in your data before that you have to tokenize your data we have talked about tokenization what tokenization does is basically make all the what one one token so whatever the word you have in your data it will make each token for each word right let me just make it uh the and just token let's do the tokenization okay i'm getting an error because i need to download pumped so yes these are the few things you have to keep in mind when you are working with nltk right you can see it basically tokenize your data you had written data will be one token science will be two one token is one of the most everything will be one one token you have made the tokenization now if you want to check that okay what are the data what are the basically tokens i have as a part of uh stop words in my data after i have done the tokenization so what you have written you have written for what in data if what in stop stops is with basically that whole data like the stop words you have right print word so what are the stop word you have you have ease of the most to with to buy the so these are the words are not important to do the theta preprocessing so it not data pre-processing it's not important to do the modeling using natural language processing so you can easily remove them yes so this is how you can actually remove the stop words you can check the total list of stop points and you can understand easily that what the stock words you have in your document yes so if you have uh if you don't remove stopwatch it will basically add more weight to your data it will take you know make your prediction not that good so we prefer to remove stop words from our document right i hope you understood about what is stoppers and how can you implement stopwatch in python now we are going to talk about the modeling technique in nlp that we say that we will learn what are the modeling techniques we have in uh you know nlp like back of words what to wake and then we have tf idea right so let's have a quick look at what are this modeling technique all about and what are the like main like history behind it as well right so let's have a look at bag of words what is back of what so back of word model is used to pre-process the text or documentations it converts the documents into a bag of words which keeps account of the total occurrence of the most frequent used words and bag of words is one of the most used method to transform tokens into the state of features so basically after you do the tokenization you have tokens but there is no features but if you are going to use pack of words basically we use count vectorization method to do the back of words so it what it will do it will keep a count of total occurrence of most frequent words right it will going to help you to do the data modeling and to understand the prediction right and back of words as we see that one of the most used methods to transform your tokens into a set of features right i hope you understand about how and what is back of words i'm going to do the back of words implementation in our project as well that is also using python then we will see what is tfidf so as we know tfids stands for term frequency and inverse document frequency this helps to measure the score in order to get the transformation retrieval that is in short known as ir or summarization tf idf is also used to reflect how relevant a term is in a given document yes we do check that how relevant a term is in the document basically help us to do that data pre-processing right and also data modelling and procedure to calculate tfid by multiplying two metrics what are these two metrics how many times of what appears in a document and the inverse document frequency of the word across a set of documents yes there is a two like you know two type of metrics you will have in tf idf and we will actually implement tfidf using python when we are going to do the project you will see how can you implement tf idf as well so you can understand about the modern about bag of words and also what bag of word does it makes you know makes a count of total occurring word in the whole document and also it does shape that uh what is the like how can you make a features out of your tokens and also in the case of tfid it works pretty much same but there is some concepts of inverse document frequency of the word across a set of documents right so these are the model technique we use but why do we need tfidf tfids helps to establish how important a particular word in the context of the document corpus tfidx takes into account the number of times the award appear in the document and offset by the number of document that appear in the corpus now tf is the frequency of term divided by total number of terms in the document and idf is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient right so this is why we need tfidf and tfidf is the in the multiplication of two values what are those this is tf and idf right i hope you understood about what is tfidf and why do we need it and we will do the implementation in our project then we will see what is what embedding so basically the types of word embedding we have what to weak globe and first text right so let's have a look at what is right what embedding is so what including vectors are one of the most common wave to encode what adds vector of numbers those vectors can be fed into the machine learning models for inference and also it helps to establish the distance between two tokens if you want to check how actually you can do the distance between two tokens you can do the what embedding and we have talked about the types of word embedding what do we glove and first text in this part we are going to talk about what is machine learning so machine learning is a subset of artificial intelligence that we all know which comes under ai that allows the system to automatically learn from experience without being explicitly programmed what's that mean by let me tell you in the diagram machine learning is going to work and you can understand in lehman term suppose you have a data set so you have some algorithm to make a model so you are going to make the model now you have dataset for the training purpose you will use the data and the algorithm to make your model straight about whatever some prediction you want to do then you will have a taste data set by checking your model that how your model is working with the new data this is what machine learning stands for when we talk about in layman term there are lots of algorithms we have and we have different types of machine learning as well now we will see the last cycle to make a machine learning model how can you do that first you have to understand the business problem what type of business problem you need to understand suppose you try to understand for a food delivery app you try to understand how they can actually make more sell using their app now you want to make their customers happy how can you do that in order to do that you can maybe use their all data to cluster spare customers if you after cluster their customers you can maybe assign a particular coupon for all of them suppose you have different type of customer base some of the customers love to have chinese some of them love to have thai food or some of them just love to have indian food so you will actually make few coupons for them and you can send across all the customers your customers will be happy after getting a you know uh coupon and also they will make more orders from your app and you are easily going to make a high sell so that's what i have given a small single example of that so the basic state is that you have to understand your business problem after you understand your business problem you have to go for data acquisition what is data acquisition data acquisition is basically collecting your data you have to collect your data to do the analysis to do the prediction to make the model so all you need that is data then you have to go for data cleaning but what is data cleaning is my data is not cleaned yes your data is not cleaned because in past the data we used to have in a very structured way someone use we used to dare to store those data but now in these days you actually making the data you are having a lot of data but those data are unstructured or semi-structured so in order to clean those data you have to follow the data cleaning process that we have talked about the data pre-processing earlier right after you clean your data you have to go for exploratory data analysis few you know statistical term to understand your data after you do that you can easily make your machine learning algorithm because you have a good idea about your data you know what algorithm will work best for your data after do the machine algorithm you can build your you know model using the data and algorithm after you build your model you just want to predict your model at how it's going to work is it going to work perfectly with the new data which data is not shown your model during the training if you can see your model is giving a good prediction on that as well you can easily say your model is ready to work with the new lots of data and it's going to give a good prediction on that as well so this is the whole life cycle of machine learning now we will have a look at the types of machine learning we have supervised unsupervised learning and reinforcement learning what is supervised learning what is unsupervised learning and what is reinforcement learning supervised learning tape itself saying there will be a supervisor will be the supervisor for your mission your data will be the supervisor for your mission if you have a level data then this data comes under supervised learning you can use in uh you know algorithm from the supervised learning in machine learning and you can use that algorithm for your data set suppose you have a data without label you don't know for uh what label you have in your data set you have to make a cluster with the same you know same type of customer or same type of data points that time you have to go for unsupervised learning and last but not the least i hope you all have an idea about tesla self-driving car they use reinforcement learning reinforcement learning is basically if your model is going to work perfectly fine you are actually going to give up reward to your model and if your model is not going to work perfectly fine you will get a penalty for that this is pretty much about the types of machine learning so yes we have talked about what is machine learning what are the types of machine learning what are the way to make the missionary and one more thing we will talk about that is logistic regression because we we are going to use logistic regression concepts in our project but what is logistic regression logistic regression is a part of supervised learning classification algorithm it is used to predict the probability of a target variable and the nature of target and dependent variable is discrete what is actually that is meant by your target variable will becomes under zero or one true or false right so this is how our logistic regression works so for the output there will be only two class present so you will have a two like you know two class in your data set right so the depend dependent variable is a binary nature so that it can be either one or zero yes or no true false anything of out of this so basically it will be like 2 class data set or you will have 2 class in your like you know output okay not the output you can say the label right in the level you will have 2 class and logic recreation is also known as sigmoid function and we all know sigmoid function equals to one by one plus to the power minus value so it maintains this particular technique and this is all about logistic regression and yes you can after this you can easily do the project what we are going to do on sentiment analysis now we are going to talk about what is sentiment analysis from the name itself you can understand we are going to talk about the sentiment yes you are right you are giving right we are going to talk about the emotion of the of a particular person so scientific analysis is a technique which commonly used to understand the positive negative or neutral sentiment about the particular topic sentimental texts are represent as a value between minus one and one referred to as polarity right and it is an unsupervised machine learning technique as well right so yes so if you have few new movie reviews data and if you try to understand what is this particular review all about it's a positive or it's a negative review how to do that you have to use sentiment analysis process to do that we will do a demo on sentiment analysis how can you understand a movie review using this particular natural language processing all the modeling technique data pre-processing and also using machine learning techniques right so let's go to the project and let's have a look at how to do that now we are going to talk about a project on sentiment analysis and we will see how can we actually implement whatever technology and whatever the techniques we have learned so far right so let's go to google collab and let's have a look at how can we start it and what is the data set we are going to use for this particular project first we will have a data set description and then i will show you step by step how can you do the sentiment analysis using natural language processing and also we will include all the data pre-processing technique we have done so far so let's get started so here we are going to do a movie data analysis but where we are going to get the data i will show you and what is the data all about let's have a look at the data description so we are going to use imdb dataset so imdb dataset having 50k movie reviews for natural language processing or text analytics it's basically for binary sentiment classification containing uh substantially more than previous benchmark data sets we provide a seat of two five zero zero zero highly polar movie reviews for training and 2500 for testing so predict the number of positive and negative reviews using either classification or if you can want to use you can also use deep learning algorithms but in these like in this process we are going to do all the modeling technique and also we will do all the data preprocessing we have learned for our natural language processing course right so let's get start how we can do that so first we need to import few libraries what are those libraries this libraries we are going to use for the very first time we are going to use numpy and pandas why do we use numpy numbers stands for numerical python and pandas stands for data manipulation tool right we have talked about that before as well so we have written import numpy as in and pandas as pde now we are going to use this particular imdb dataset dot csv but where you will get this data set it's a open source data set you will get these data says on kegel as well let me show you how can you see and how can you search for kegel for imdb movie data set right so you can see we have this particular data set in kregel that imdb data set of 50k movie reviews if you go down you can see option for download this is a 26 indeed data set it's a pretty big data set if you are first working with the data set it's going to take up you know few seconds of minutes to get download and yes uh for that you have to wait right and if we go down you can see the license and usability and imdb dataset.csv the detail you have for what are the things you have in the data set everything you will get over here right yes so this is how we can actually download this particular imdb data set from kdl code and let's have a look at what are the things we are going to do with this particular data set so now you can see we actually take a variable called data and we have written pd.read csv we have talked about earlier as well in the part how can you use different type of data set using python and using pandas basically so if you want to read a csv file that is comma separated value file you have to use a function called dot read csv so that's why we have written pd dot read csv and then we have given the path now i will show you in google collab how can you upload the data set so i have already shown that but for the recap you can see there is a upload option if you click over here it will redirect to your local system and from there you can directly like you know upload your data set what you have and it will automatically get uploaded in the google column now if you want to take the copy path option you can just click on the this three dot and you will get an option for copying path now what do you need to do you have to write read csv and you have to give the path right let me give the path yes so this is something uh slash content slash imdb dataset.csv now if you want to check your set is loaded or not that's why we use a function called data dot 8 so data dot 8 will return you the first five rows of your data set now i have written data dot hate and it's like actually written me that um first five rows you can see zero to four and we have review and sentiment column now if you want to check the last five rows of your data s8 how to do that you can just write data dot tail right okay so this dot tail is basically from the name itself you can understand it's going to return to the last five rows of your data set so your last rows is four nine nine nine and you will get the last five rows from your data set now if you want to check okay i want to check what are the columns i have in my data set i can see over here we have only two column review and sentiment but in the case of other data set there will be lot more columns you will have now if you want to check what are the columns you have what do you need to do you just need to write data dot columns and it will return you the whatever the column you have in your data set if you just write data dot column and if you just execute this you can see review and sentiment you have right yes so this is how we can see that what are the column we have in our data now if i go further before that i want to see what are the what is the shape of my data so why do we need to check the shape of our data sheet basically returns us how many columns you have how many uh you know rows you have right so if you just write it at the bottom see so it will first return you total 50 000 columns you have 50 000 rows you have and two column you have right can you see yes so this is how you can see the shape of your data now if you want to check hey i want to see what is the description of my data so to do this description of her data what you need to do you have to do the data dot describe i will talk about this is now little later right so first we are going to see what is dot describe is going to basically return us so this does describe we used to see okay um maybe um description is going to be what your data description what is the total count of your data you have what is the unique value you have in your data then uh you know these are the thing you can see in this particular case right if you just write data dot describe right can you see what are the count total 50 000 data you have and you have unique data of four nine five eight two there are few returning data you have right now we want to see the next step is always for that do we have any null value in our data sets if we do have any null value then we have to remove that and otherwise we need to understand the outlier for knowing you don't need to understand what is outlier but one thing always keep that in mind if you have null value always try to remove the null value or try to replace the null value because doing more than null value data set it will not give you a good prediction right so to check that do you have any null value or not you have to write data dot is null and then we have written dot in so basically it's going to return do you have any any uh you know null value in your any column so it will return you in the boolean you can see what we have written data dot is null dot any so the review is false sentiment is false so that means you don't have any null value now if you don't want to get it in a boolean format if you want to see in the summation format what you need to write data dot is not dot sum so basically it will give the summation of total null value you have in your data set if you just execute that you will give the review has zero value and the sentiment column has also zero null value right now go further let's have a look at i want to check what is the total value counts i have for my positive and negative okay before start with the data analysis let me show you how our data looks like let's go to excellent let's have a look at what actually our data looks like right so i have already opened this particular imdb data set in my itself you can see we have preview and sentiment column let me just read out one particular review and then okay there's a very big review if i took this one this is the big one no so basically i got if you like the original gut reaching laughter you will like this movie if you are young or more then people love this movie hell even my mom like it great cap right so based on that you can see this is a sentiment is positive you can understand obviously it's a positive review right let me just read out a negative review so the negative review we have that is no one expects the sad straight movies to be higher but the fans don't expect a movie that is as good as some of the best episodes unfortunately this movie had a muddled an impossible plot that just left me uh cringing and this is by far the worst of them nine so okay so keep some saying so you can understand by just reading out the first two real lines that this review is not a good review so basically the sentiment for this particular review will be negative right so this is how we are going to see if i'm going to give a particular review to my model it's going to predict that it's a positive or negative right so how to do that we are going to use natural language processing for that and also we will use one particular machine learning concept we will talk about that what machine learning concepts we are going to talk right so let's go and let's have a look at our code so you can understand what our uh data set is all about right so let's go for data analysis part right now so again we came back to google collab and let's have a look at the total value counts for your sentiment so i want to see how many total positive sent like positive feedback i have and how many total negative feedback i have so for that what i have written that data and then in the packet i have written which column that's a sentiment column dot value counts right so in this case you can see what we have total negative is two five zero zero zero and positive we have two five zero zero zero so you can understand total we have fifty thousand data in this fifty percent you have negative value fifty percent you have positive comments right now we have done this data dot shape uh portion before as well now we will do the text normalization so what it comes under text normalization tokenization streaming limitization what we have done before now for that we are going to import few more libraries uh yes you need lots of languages to do that so we are going to import c bond matplotlib you can understand these two are for data visualization then we have nltk then we have current vector what bag of words we will see how can we do that then we have tf idea vectorized then we have um stopwatch removal then for steaming then again stopwatch the limitation right we have also imported nltk and we have imports pc as well right so yeah these are the basic things you need to actually start your project now if i just want to show you how to like what we have done import in ldk we have done but if you want to take some stop words right if you want to that we have done the demo as well so you have to download ntk dot download stop words right now we will do the tokenization how to do the tokenization right so in this case what we are going to write we are going to write tokenizers equals to tokenizer you have called this particular function and also for uh stock words you have called the function of from nltk that nltk.corpus dot stopwords dot words and what stop what do you want you want indeed stop words right so after that i just want to show you how to do this particular tokenizing text removing software like you know stop words and also if you want to remove view if you have any brackets right if you have any html code like uh if you have any tag right if you want to remove all those things out how to do that so first we have written that um maybe uh here we are going to give that noise removal text right we are going to remove the noise from the text how to do that so we have called beautiful soup beautiful soup is basically used for data you know um scrapping if you want to do the web scrapping and also you can use beautiful soup to actually remove the html tags from your data so you can see here uh so as we were talking about and that particular comment or review have that you know html bracket or you know any of special characters i don't want them to be in my uh you know text up because i am going to use the data preprocessing and i'm going to remove all those things to clean our data right so you can see we have written super coils to beautiful soup and we have given the text text is nothing but your total data where you have stored all the csv file then you have written html password then you have soup dot gig text after removing your html person from the text then you have written read dot sub and then you have given the special characters if you have any special character you don't want them so you can remove that as well and after removing all of the you know unwanted part from your data it will return you the text so now we are going to use a lambda function to do that so um uh what we are going to do we are going to write data of review we are going to remove all these things from the review data of review equals to data of review dot apply and noise removal text so what you are going to do you are using this particular function to this particular data we have to remove all the html person and also the you know the special character if you have you just want to remove them you did it right and then if you just write data data.head it's going to take a few minutes because yes your data is too large to work with these things right so that after that if you just write data.head you will get the dataset where you don't have any html person and any like you know special character in your data set and your data is partially clean now we are going to do the staining right how to do the stemming you are going to again write a function for that that's steamer and you have written text the text basically again the total data you have then you have written that in indicate dot porter stimmer that we have called for stemming your like you know to apply the streaming uh technique in your data set then you have written ps dot stem once for body text dot split right return text suppose you have done this and then you have applied this particular function to your data again so you will get this this part right that if you have any you know prefixes and suffixes will be cut from your all the word you have right so here what we are going to do we are going to declare one particular function again for what for removing stop words right and here also we are going to take a parameter for text or you can write data as well whatever the name you want to give then is lowercase equals to false right and from there what we are going to write we are going to write tokenizer equals to tokenizer right basically we will call the tokenizer function and then we have written tokenizer dot tokenize and we will give your data right after that you will have two kids you want token right that we have to do that we have seen that we need tokenizer to talk in our like you know make tokens from our uh document right or the data now we have written i dot street for i in tokens right your tokens you you already get the each and every tokens and no it will be very big right so fine so you will have tokens right after that you have written i dot stream for i in tokens if is lowercase then filter tokens i for i in tokens if i'm not in stock word right so you are checking that your data is not in software then you will actually make them the filtering the token that you are talking is not in stop words else you have written filter tokens i for i in tokens if i dot lower not in stock words right so you are basically checking that do you have any word in your data which are coming on the stop word right then you will remove that as well so what you have done you have actually tokenized your data and then you are checking that the the token you have they are not the part of stop word list you having mltk then you have written that return filter text so filtered text what is going to return it's going to return that your data right whatever the data you have given the data is tokenized and also the data is not in the part of stock words right so after that you are going to again apply them and removing stopovers right and then if you just give data.head it will show you what after it coming right and it will take a little time because what uh because the reason behind it because the data set is really big enough to work so many things in the whole data set it will take few seconds to complete and yes it's completed and data dot here you can see this right now we need to trade like we need to split our data set because we are going to use one particular machine learning technique to do that right how to do that so here we are going to write the trade split how how we are going to do the splitting for training part and testing part yes what we are going to do we have said that for training part i want the 30 000 data right and for uh like you can see you have written train reviews data data dot review and you want this particular splitting you have done for the training and for testing you are going to again test the the race data you have right after you have done this particular review part then you will use a bag of words technique so we have talked about what is back of what and here we are going to implement that how to do that so we are going to write cv and we are going to call it count vectorize a right we are going to call the count vectorizer so basically this count vectorizer function is used for back of words you can say right they you have written cv train so you want to you know change your data or vectorize your data the data you have using cv train right so what you have right like you you are going to store this particular trade data as any cv tray variable you have written cv dot phi transform and you are given your data get vectorized using this particular function after that you have to written cp test cv dot transform and paste review data right then what you have written you have written that bow cv like you have done the back of words part right bow cd train and cv train dot shape what is the total shift for this and what is the total shift for this you can see for training past purpose we have taken this part and for a testing purpose we have taken this part right but the case of uh you know positive or review you did not make it train or taste part right yes so you can see this is the training shift is the testing shift you have for after you use this particular train review stata using back of words right now we will use the tf idf and then we will do the label encoding right yes and that time we are going to do the splitting for the reviews like you know you have what type of sentiment we will do that the training and testing speed for that right now we will do the tf ideas so what is tfid we have talked about and here we are going to write tf idea vectorizer you are calling the method and then you have called this method and making the object for this particular method that is tf now you are going to store your training data for pf after using the applying the tfid you have written tf train tf dot feed transform and you have given your data again train reviews data right after that you have tf taste and tf dot transform price reviews data you have given so basically you are using or applying tfiv for training as well as for testing now if you want to check the total shape for them you again can see for training you have 30 000 for testing you have 20 000 data right and i will show you why do we need this training and testing part what part of the machine learning we are going to use as well we will talk about that right now we are going to do the label encoding why because you have negative and positive now i want the label encoding to make it 0 and 1 right it will make a label for or you can see the cluster for each uh like you know for negative and positive right so what we have done we have called label equals to label uh binarizer right and then you have written sentiment data middle dot feed transform and you have given data of sentiment right not like the review part you want the level encoding for setting it right after you've done that you can see the total uh seat for this and also if you want to check what level encoding it has done right so you can just write sentiment data can you see what it makes 1 1 1 0 0 0 so basically for all the reviews for negative maybe it's six for all the positive one it takes one right so this is how we have done the label encoding now what i am going to do i am going to make this data dot sentiment for again training and testing right for training purpose i'm going to take 30 000 for testing purpose i'm going to test 20 000 of the data right now we are going to use logistic regulation yes so basically i hope you have understood about what is logistic regression because before start with what is velocity regression you need to have an idea about what is machine learning and how we are going to use logistic regression to do this analysis right so i hope you know about what is logistic regression as we have talked about before in this particular course right so what we have written logistic equals to logistic regression and you have called this particular function logistic regression so as i have said for if you use python the basic you know feature about python is everything is already implemented in python library right so we are going to use psychic learn and cyclic lan has some like you know function called logistic regression where it's implemented the concepts of logistic liberation now you can see we have written ehler bow and y bow it's for bag of words right so you can see what we have done logistic dot feed and we have given cv3 for you know back of what we have used founder vectorizer so we are going to use cv train train data that is your train data basically for the sentiment cv train for reviews right and you have written great iller bow and also you have uh written this right let me cut this part out because we will do the tf idf later right right let it give some time and let's see what is it's going to print for back of words uh you know model because first we are going to use bank of words your and then we are we are using logistic regression for that and then we will see how pfidf will work right according to that maybe you can use one of them or maybe yeah right uh it's going to take a little time to get executed yes it's got executed now i want to see how the prediction is going to work for back of words so we have written um let me write this uw predict and here let me make it little short and here you can see we have written logistic dot predict and we have given cv underscores taste right the testing part of our data but we are not going to provide the sentiment because we want to check what is going to predict is how much it's going to work with the you know um like our actual one right so here let's see what is going to predict okay it's going to say negative negative negative dot dot negative right i can see only negative here okay we will see the accuracy score as well let him do that as well okay okay so here we are going to do the accuracy part so we'll cut it down and make it here right okay so first i want to do all these things for back of words and then we will do for pfidf right okay so what how to do the accuracy score you have to call as q qrs is 4 function and you have to give that test data and power predict we have that is back for prediction by your model right so right now let me check okay for this it's going to give you the prediction okay let me just get it okay so for this you can see it's going to give you the prediction is 59.21 it's pretty low it's not working that fine right so let's have a look at if i use tfidf what i will click but if i df again we are going to use lrtf ids the variable name we have used and we have written logistic dot feet and we are going to provide the tf underscore tray and train data the normal one right then uh let me check if i do the prediction what i will get again it's going to take a few minutes because um yes that's a big data anyway we are working with now if i do the prediction so you can see the prediction is giving me 0.74 so you can see here basically tfidx is working pretty cool with this particular data and like first you are doing the all the data pre-processing you are removing the you know tokenizing the data you are removing the you know noisy part you have in your dataset then you have used uh like you know tokenization steaming limitization removing stopwatch we have done everything and then we have used tfids and back of words so in this case we can see tfidf is working really good with this data and then we have used logistic lubrication from machine learning that is a supervised learning technique right so yes you can see this is how we have done the sentiment analysis using natural language processing right i hope you like it and please if you have any doubt do let us know and um yes this is pretty much about the project we have done and we have used all the technique we have learned in this particular course so now let's come to what is text blog so what is text blog text blob is a open source python library which is used for amal activities like limitization stemming tokenization noun phrase extraction pos tagging n-grams and sentiment analysis so what is n-grams since we didn't discuss about it so let's now take it take an example to understand it uh let's say this is this course is about nlp and text blob okay so what n grams stands for n grams are like for example let's say unigram okay so unigram will get the tokens like this course is similarly it will get text block okay into a list similarly bigram or so what diagram will do diagram will take the two your window size from one it will change to two so now your window size is two this course will become one word course is will become two similarly and text blob will become third okay so this will become second this will become this and then this will become this then nlp and this then this text blob and and and text block okay so this way we can create unigram by gram trigram quad gram so how many number of grams we can we want to extract we can get that and with those [Music] n grams what we can do we can get the phrase for example this course nlp and nlp text blob okay so this way we can extract the information in in terms of phrases so this way it helps us in getting the phrase from the sentence let me remove it so text blob is faster than nlp however they do not provide the functionality of vectorization and dependency passing like we discussed in the previous slide what is text characterization but these features are not available with text blob so text classification sentiment analysis these kind of activities we can do with the help of text blob so let me clear it so how you can install text blob so if you have python what you can do you can install it with the help of this command pip install text block so once you do this it will take more than two or four minutes hardly it will take five minutes based on the speed of your internet and all it will get installed into your system and then you can import text blob and then you can do all the activities functionalities of text block so for this let's jump into our jupyter notebook and see you there now we have our jupyter notebook open so for this exercise we are using jupyter notebook and anaconda so if you don't have jupyter notebook and anaconda installed i would highly encourage you to install them or if you have any difficulty you can go through the videos of great learning they have provided videos for how you can install and how you can run your programs on jupyter notebook with the help of anaconda environment so now let's get started so we have discussed what is nl what is nlp we have also discussed what is text blob and what are the kind of activities we can perform with the help of text block so uh without wasting any time let's get started this is the official website you can go there and you can read about it and you will find so many information about text block you will see examples how they how you can use them so i will also encourage you to go through the documentations of textblock so in order to install textflop you need to install you need to have an ltk in your system you need to have text block so pip install nltk pip install text blob so since i have these installed so i'm not going to do anything so then what you can do you can just write import nl tk and then nl dk dot down load in that you can write popular okay so once you do that what it will do it will download popular corpus for natural language processing and which will help you in going forward performing any activities in uh natural language processing with the help of nltk so once you download that then you can also download this average perceptron tagger okay uh this helps in uh doing the uh tagging task okay so you can also download it so in this session what are the functionalities we will be looking into this video we will be looking into what is language detection what is word correction how you can count the word or get the frequency of the words then phrase extraction then appears tagging uh what is tokenization popular pluralization sorry pluralization of words using text block lemmetizations and text flop and engrams and text plot so what is a language detection okay so uh with the help of google translate uh what this uh textblob does textblob takes your input it hits the google api google translate api and then from that google s translate api it gets an response translated text and that text it reflects in your system if you don't know about apis how apis works how you can hit an api so these things you can uh see in the video of great learning they have also a video on apis so you can go and watch them but for this course we don't need to have a api or or we don't need to uh hit any api but for the sake of understanding how textblob works we just wanted to give you the information so let's suppose we have a text hey john how are you so first we will detect what is the language or in which language this text is similarly then we will convert it into spanish okay so for for different different language uh they have extensions let's say let's execute it to execute it we will press select plus enter so once you press shift plus enter it will take some time and then you will see detected language is english that is en input text in spanish a hula i i'm not good at spanish so i'm not going to uh pronounce them but yeah this way we can uh translate the sentence uh into from english to spanish so now uh when you uh there is a disclaimer or there is a note so when you will uh do this text blog and you try to uh detect the language or try to translate the language what will happen you will get an error 404 okay why this is happening because recently google have changed their uh api address okay and which is not updated till now in text blobs api okay uh i think uh in in their function they will update in some other some time okay but for now what we can do uh you can uh change that okay you need to go to your c user then whatever is your username then uh anaconda3 then your environment name like let's say our environment name for this is rythm okay so we'll go to rythm then lib then site packages then textblock in that you will find translate.py in that translate dot py just copy this url equals to this text and just paste it instead of the other url and you will be able to execute your code so yeah let's let's move on to our next thing what is spelling correction so what happens that uh whenever we write anything in uh in any platform due to typo or something like that we do commit we commit mistakes so how we can uh correct them okay so sometime for a programmer it's it's a difficult task to read each and every words okay so uh what textblob does textblock provides you a feature to use a text blob and to convert the uh the wrong word into a correct word okay so it uses lexical rules to do so we are not uh going to discuss those rules in a detail how they can uh do the spelling correction uh so i'll just execute it so first we will import from text block import text block so we are from text block we are importing this function text block then executing it now let's print the text abcd car a values uh i think it's their employees okay but the words are written in a wrong okay so we will create a object of this okay we will pass this text into this text plot and we will create a blog okay so if you print this blob let me print this for you we'll create a new cell blob so text block this will give you the input so blob dot correct the correct as the function to do the spelling correction so you will see abcd for this corp is missing now always values their employees okay so we can see sometimes some of the words are getting misplaced but we see few words are getting corrected as well okay let's use has okay if you if you use smaller text i'll give you a hint if you use a smaller text you will get the output correct but if you provide a long string uh it will not give you a good result okay some since uh someone has written u are instead of your but it will convert into or which is wrong so sometime it is correct sometime it is wrong so based on the use case and uh based on the incorrection in the spelling of the words you can apply it okay so now let's see word count so in any machine learning we need to identify the variability we need to know what is the variability of the documents or what is the variability of the data so in terms of x data it is very important to know the frequency if the words have no frequency let's suppose you have 50 000 words and those words have no frequency they are only called ones so it's a difficult for the machine because what will happen your your your uh this uh the text vector the vector will become various parts it will have zeros more as compared to ones okay so in that case it will be very difficult for machine to understand the text understand the patterns and then do the classification do the recommendation and then things like that whatever you want to achieve so it it will become very difficult so you need to know whether the text has the frequency or not so the counting the frequency of words is very important so for now let's take an example so let's take this word count if you want to count the analyze let me correct it out analyze what is the count of sentiment sentiment count is zero but if you see the sentiment in lower it has a count of three so whenever you are searching a word with upper case see here we are writing in uppercase sentiment sentiment twice sentiment three times okay if you see we have sentiment three times but when you search this term the sentiment with starting with capital okay it will throw a zero but if you use this it will give you three similarly if you use analysis we know analysis uh frequency of analysis is one but still if you do it with the upper case you will get a zero count okay so be mindful whenever you are counting the word frequency you need to provide it in lowercase okay so now let's do pos tagging so let's suppose uh you have a word my name is adam i like to read about natural nlp and i work at abcd corporation see my it's a preposition name is noun it's a verb it's preposition it's vbp it's to uh it's verb so there are so many pos present here okay okay now we got that okay from from any word we can extract the uh pos okay but what i will do with those pos okay let's suppose i know that okay uh it's a noun it's a pronoun it's a verb it's a adverb adjective i know each and everything but how it will a useful in my use case okay now let's suppose uh you are using you you are doing some review sentiments okay review sentiment analysis okay and in that you will find that prepositions are not making any sense okay or our verbs are not making any sense for example i'm not going into the little understanding of that whether it the process is right or wrong because any process is right or wrong is based on the data set until or unless you don't have the data set you cannot say that this process is right or this process is wrong but what we can do we can understand that okay whatever i am doing is uh making sense to me or not okay so we are just uh looking into example hypothetically so let's say uh i want to remove this vbp okay i want to remove this uh pos i don't want this pos in my sentence okay and if any word is there with this uh pos i'll remove that word okay so how we can do this so i'll use forward loop for i in text dot tags if vp not in i of 1 okay so let's execute it one by one print i okay so i'll just remove them if you print you'll see this okay next what we are doing if vbp not in i of 1 this is i okay i of 1 is this i is this i is i of 0 is this i of 1 is nn i of this is a this vbz is i of 1 okay so if this is not there what we will do we will update this new tuple list okay so i'll just execute it now if you see we have all those tuples into the list okay now i want to uh join those sentences so i'll use join and now you see my name is adam i do read and about nlp i at adam abcd corp okay no matter what uh this thing is uh like like helping us in in understanding or or is it making sense this example or not but what our intention was or what our intention is here to just uh see how we can uh remove a particular pos and how it will help me in uh cleaning my data so now moving on to next thing what is tokenization so before that what is corpus or corpora corpus is nothing but it's a collection of text data so in terms of nlp like we say data set uh in any machine learning thing we say corpus in terms of natural language processing okay so your corpus may be of one language or it may be a combination of multiple languages okay that is uh that is not a concern but a corpus will contain text data okay so now the next term which comes is token we have discussed earlier also what is token and all uh we we we discussed that we will convert the word into tokens so token is nothing but a total number of words in a text okay so what token is token is smallest unit of a sentence for example this is a book so this is a two this token is another token uh another token book another token okay so tokenization we can do uh with various things uh let's suppose if you want to for example abc underscore one two three underscore dfga so you want to tokenize the text with the help of underscore then your terms will become abc is one token one two three is another token defg is another token or if you want to perform with the help of space you can keep the normal space and then you can get the tokenization done so now let's see how you can uh how we can do the tokenization in in text blob okay so let's this is our text we'll create a blob object of it okay and then corpus dot word this is it so now we have the tokenization on the basis of words okay let's say there are 40 words tokenization we can do on with on the basis of sentence also so on the sentence level let's see there are three sentences okay so this way we can do recognization uh at at particular character level or at particular spatial character level or at the level of sentence okay so what is pluralization of words and how we can do it with text block so now for that we will use the function word from text flop import word then we will pass the words or the string to this function word and we are storing the result into w small then w small dot plural lies so when you do this becomes platforms so but if you uh do the pluralization of platforms let's see what will happen it will add one more s to that okay that is wrong okay so ah now uh let's do one more thing if rpos is nn okay what we will do we will pluralize the word okay so we will pluralize that particular word so let's see so our great learning is a great platform so this become platforms to learn data science so science becomes sciences okay so which is uh not correct what it helps community so communities that is right but e etc will become etcs okay so this way we can do this but yeah we will have to be mindful uh whenever we are performing any activities in text whether that will make any sense to it or not that we need to check so now let's see how we can do the limitization so for lemmetization we will uh create a object of this thing uh this uh we will create a textblob object and that from that we will create the unique tokens of it okay and then we will do the word dot lemmetize okay and word dot stem to perform the limitization and stemming so now let's execute it so when we execute it we will see great learning uh grades lemmetization is great stemming is great so learning levitization is learning stem is learn okay similarly when we uh provide a word okay and we are limitizing it as terms of noun you will see it's a learning since great learning is a name so we are calling it as a noun but if it is verb how it will do it will treat it as a verb so it will change it to its base form that is learn okay like people's so it will change it to into people okay so now uh let's do how we can do the n-grams okay so we have our blob ready so blob dot n grams where n is one let's see all the words are unique tokens okay so unigrams now let's do the second one great learning learning is is a this way we are creating two grams if we provide three it will become trigram four then quad gram okay so yeah this way these are the functions which we can perform with the help of text block now let's move on to our next topic that is sentiment analysis on text block so for that let's uh jump into our jupyter notebook now for this example or for this exercise the sentiment analysis on desktop we will be using the data set which is imdb dataset sentiment analysis in csv format you can get this data set from kaggle.com you just use this link toggle.com slash columbian slash imdb dataset analysis in csv format you will get that data set okay so now let's execute it so now first here for this exercise we are using a couple of packages like pandas text blob nltk repackage then we are using spacey if you want to know about specie you can check the uh you can check the great learning links for that okay and you will find great tutorials for that okay so in text block the format or or the way we can find out any uh sentiment of a text is with the help of text block then providing the text inside this brackets then dot sentiment okay so when you do this you will get two things one is polarity and the another one is subjectivity similarly you execute this and then you execute the third one okay so now if you see you are getting polarity he is a very good boy okay so it's a polarity polarity means what is whether it's a positive one it's a negative one okay so it ranges from minus one to one okay on the other hand subjectivity ranges from zero to one okay so what does polarity gives you polarity gives you what is the sentiment of that sentence okay and what subjectivity gives you subjectivity gives you whether it's an opinion or not okay if it's an opinion then the subjectivity score will be more closer to one if it's a positive it will be more towards one in the terms of polarity also in terms of polarity if a text is positive it will be towards one if it's negative it will be more towards near minus one on the other hand subjectivity provides you whether uh it's an opinion or not if it's an opinion then your subjectivity score will be higher and if it is not an opinion your subjectivity score will be lower he is a very good boy that means there are so many people praises this boy okay or praised this boy that means it's an opinion from a group okay and he is a good boy that means it's the sentiment is positive and it's an uh opinion from people so that's why it's subjectivity score is higher he is not a good boy so your polarity is like minus 0.3 pi that is that means it's a negative sentence but it's a subjectivity is 0.60 that means it's an opinion from people let's move on to the next sentence that is everybody says this man is poor so polarity is again negative because they are saying that this man is poor so it's a negative sentence on the other hand the such subjectivity score is higher that means there are so many people since a it is written everybody says that means it's an opinion so your subjectivity score is higher so now for this uh let's uh load the data so for this we will be using pd dot read csv to load a csv file so let's execute it now we have our data now we will be using since it's a 40 000 data and we will be lowering this data to 5000 so lower that so now we will be uh training the date getting the data so now we have 10 000 data here there are two labels 0 and 1 0 stands for negative 1 stands for positive so for data preprocessing the first step is to check whether the uh your sentence uh whether the text has uh null values or not so for that we will use strain dot is null dot sum so this is our train data we'll apply that data so we see that there is no null values present here okay so then we will use numpy okay and we will use regex here to replace if there is a space okay so what happens in nlp in in text is that let's say your text is something like that then this is space and if it is a value then that will not be considered as a null value but since it's a blank value so for us it is not making any sense we need to remove them off okay so we can do this with the help of uh with the help of this regex function so let's drop them off if there is anything so how we can remove skip sequence okay so backslash t backslash and backslash r these are nothing but these are space sequence characters okay we don't need them so in in text when the text getting encoded these are added to the text so we need to remove them so we will use this so we have removed them again uh we will check that okay so this is our text data now what we will do we will remove the non ascii words for that we will use your text dot str dot encode s key and rest you can ignore then what how you will decode you will decode it as s key so this way we can remove that similarly we can remove the punctuations okay so before uh we go into that let's see what is punctuation okay let me just create a space for you import string then we'll use so these are nothing but these are punctuations okay so for that we will write a function remove punctuation inside that we are importing a package then we are using a for loop and if punctuation is there we are replacing with no space okay and then we will apply this function here to a jupiter notebook now it is executed now let's see how train data looks so this is our trained data so the data is same but we have removed the characters like for example mr lincoln okay see mr lincoln is the period is now removed okay so we discussed with the stop words so these are some of the stop words which are famous stop words and we can remove them off okay so we will remove them off but since we are doing a sentiment analysis okay and uh just just take an example he is not a good boy and he is a good boy okay and if we remove not from that let's suppose our text is he is not a good boy okay what will happen uh the sentence will get changed since we were saying this is a bad boy now he is a good boy so we need to take care of these kind of a thing and and these stop words like i mentioned is a use case dependent okay so you need to look into your use case to identify how you can remove the stopwatch and what are the stop words you need to remove okay so we will remove this from this stop word leads okay not a no okay now we will create a function for that okay it's a pretty straightforward function so we will execute it now it will take some time executed train so now if you see there are no stop words here earlier if you see i like 8704 i have okay so if you see never i have is removed okay now we will remove the special characters so we will regex use regex for that we will remove html characters sometimes we get urls in our text so we can remove those urls with the help of repackage so we'll remove it next we will remove the numbers or integer from the text so now we have removed it now we will remove alphanumeric character let's suppose some id is there sk012397 j6 or someone's pan card number is given so we don't need those numbers so we need to remove those alphanumeric characters so with this function we can remove it we will create a first we will identify that words okay the digits then in that digits we will remove uh the string and digit if if it if string like alphabets and numbers both are present in a text we will remove that with a space okay let's execute it so now if you see our text is clean okay so the next step which we are going to do here is uh we will be applying limitization now our limitation is done so we are all good with all the preprocessing of the data so now let's uh use the textblock sentiment function to uh get the sentiment from the sentence so this is our sentiment we are storing uh the sentiment into a column known as sentiment and for that we will be using trained uh text we will be applying that function with the help of lambda so we will use that suppose we are that text each text will get into tweet and this tweet will go into the text block and then we will get the sentiment from it so now let's execute it this will take some time so we'll need to patient for that so i'll just wait for that to execute so now this is executed uh now if you see a train so what will happen for that for for all the text we will get sentiment okay and we will get a tuple okay one is uh like we discussed earlier polarity and one another one is our one we will we will get this polarity and another one we will get specificity okay subjectivity sorry sorry for uh that we will get the next term as subjectivity and polarity so now we have the list of that so what we will be doing uh we will be storing these values into a list okay so now let's execute it okay and we will create a new data frame known as df1 okay so now let's see how our df1 looks like f1 so if you see polarity is a one column and subjective subjectivity is another column and then we will do the concatenation of train with df1 so now we have that and now we will remove the sentiment okay and uh we will use the lock function to if the polarity is greater than r equals to 0.03 we will consider it as a positive and we will store it into a new column known as sentiment if it is lesser than this we will store it as into a sentiment as a negative now let's see this is our polarity this is subjectivity and our sentiment so now we can we can label them if your label is one sentiment label will become one if your level is zero sentiment label will become zero so let's execute it so we have that sentiment label and we have sentiment and we have subjectivity we have polarity so this way we can uh perform sentiment analysis using text block what exactly is your understanding on unit what is happening in unity okay fine uh so what are we doing here is we let us say we have an image okay let us say this is an image and i want to convert this image into some kind of encoded structure which is smaller in size yet it retains all the information over there so when i have such kind of problem what do we do we always go ahead with convolutions so what i will do i will take a larger image i will do convolutions max pooling i will convert it smaller so just imagine the height is the dimension i am just reducing the dimension one by one all right so i'll do multiple layers of convolution convolution polling convolution coding such a way that i can say that the information that is stored here i have shrunken it down to this much yeah this process is called encoding all right now what is the use of these things where do we use this first of all it could be used to store so if in case you want to compress the images and store it yes you can do this second thing is there where there could be certain applications where a larger image or having a huge data multiplication not possible so in that case we will try to give this image and try to recreate something out of this bag that was nothing but auto encoders you people remember we got if you remember that numbers and all we had original image and the answer that we got was a very vague kind of down sample or i can say vaguely sampled images even though from the looking at the image we can make out the number so it was not so bad so whenever you want dimension reduction and all this thing could be possible but in today's case we are not looking for that application what we will do is i will now try to recreate this image back so any idea how can i do that so let me first put the figure so that from there you get to know what we are doing all right till i get back to the original dimension so what i am doing i am doing compression decompression so i can say this is encoding this could be decoding okay now my question to all of you is how we can do this how is it possible the convolution part you people bought it we understood the possibility how you can do this let me show you guys something on the visual part so say i originally had a triangle all right now what i did when i encoded it i encoded it to a smaller triangle yeah now my aim is to bring that triangle back such a way that it is almost similar to the original one something like this yeah how to do that very simple what you do is you start making you start so let us say there are some pixels okay so you expand those pixels say that might be this is my current range the nearby pixels also will be of the same so this could be one operation on the top of that you do one more operation saying that okay let me expand this on the top of this one more something like this so you keep expanding or going in reverse direction such a way that till you reach your original shape inside okay so this is your very basic concept but if you want to look at from image point of view let us say this is an n cross an image now what does this image have please remember only numbers nothing else now after encoding what what i am going to have let us say i have my encoded data as 2 cross 2 it could be 1 2 pixel numbers are 4 and 6 after encoding after going through multiple layers of convolution let us say we end up with this now how do i recreate it back so what i will do i will try to create something like a pad around it will start doing padding around here and in this padding what i will do let us say i add one layer of for example now when i add one layer of pad what's going to happen you will get new values right so what if i say that the points near to this will get this value the points near to this will get this value so this is four this is four this is two points near to this six six six one one one yeah so now can i say yes there are some numbers around it like this i will keep increasing my values one by one yeah now this looks convincing over here but are you able to find a problem here there could be some problem here because of my using the same pixels i am replicating the same color correct because of this issue so actually it looks like an issue but yes we have got a solution here so why is the solution so now keep this in mind this type of concept and we'll come back to this there could be certain times where you don't need to identify what is there in the image see in this image if you observe what all we can get out of this so if this image was passed through your yellow your next week's this week's week 4 content yolo will give you many object detection saying this is a house sky which will give you all the cars and you would have seen this kind of stuff on youtube and linkedin also people posting it people are walking and yolo and ssd are able to identify them and in the image yeah what else you can see you can see houses you can see a poll you can see a building a lot of things you can see yeah sometimes our application does not need these kind of sophistication i will give you an example let us say you are having um again coming back to your parking slot uh problem i think i've explained you earlier also let us say there are some parking spot which has sensors now sensors putting a sensor there is very costly sometimes because of maintenance and all a lot of issues maintenance hardware so what if i remove this sensor and i put a simple camera over there and this camera is connected to one of our uh deep learning networks yeah now what it does is the camera captures an image of the slot now is it necessary for me to let us say this there is a car parked over here now the only intention to show here is whether this car park is full or empty it has to detect and display it in a you know in a display board down the line so that nobody comes and waste the time over here and go back yeah so what is our job here is just to identify is there some object on this or not what is the object what color what company what type of car we are not interested in so applications where you don't need to detect like this but yes you need to separate the object by color or by any of the other methods you can say segmentation will be in use so this is called semantic segmentation where you are trying to give a color to similar type of objects you people see that all the cars are almost purple color sky is one color road is of one color and because of the difference this is one color and the ground is of one now try to bring that uni the encoding decoding concept over here so what is going to happen now is we will make an algorithm such a way that if this color is red if it is an encoded stuff what i will do is i will see to it the neighboring stuff neighboring uh pixels to this also get the same color but in this case our segmentation is giving purple color yeah so i will see if this is purple purple purple purple purple i will try to give purple almost everywhere till i find this image is in school right and also you will see there are some borders you might find the color is coming apart so this is the way you can say that yes the image could be encoded down and then when you are decoding back there are chances that we can get a very similar color around a simple object okay so this is the application of encoding now you may actually find apart from this parking slot whereas you need let us say you have you have you are working on this google uh car self-driving car so what does a car need to see the car has to do two things first of all it has to drive right now where will it drive obviously it has to find a road yeah now if let us say this is the image given to the car there could be lot of confusion there could be one road going over here the car will not come to know sometimes because of lot of objects present in the image we are not sure so what i will do for that time is to solve this problem i will say i will put an algorithm saying that wherever i find a road i first of all google will see an image like this and whenever there is a road you might find one particular color follow that i don't care about any other object around me i only care about one thing which is the road that's it yeah so there are algorithms in which we can we can we can blacken this up so whatever you see here so if i say okay i'm focusing only on the road part of it i can i can wipe this out into one color let us say whatever is not rode is red color okay something like this so imagine a picture where this is happening so there are only two colors one is red and one is black so google the particular self-driving car will come to know that yes that's all is my road now wherever i keep finding this black color i will keep following and there could be one more application on that if there is something on the road it has to stop and manage the speed in that case you can say that yes i have detected a person or i have detected a car like this or something it should not be that much complicated as compared to the driving part yeah so this is one example of semantic segmentation now there are various type of semantic segmentation we have instance and we have semantics so let's start with a semantic so as i said if you observe here there are two buses okay so when the image was shrinking down and when we get back the image what's going to happen is then the pixels are going to you know basically say that okay this is a part of the same family let us use the same color so i let us say this was my first pixel one monitor this was my could be my first pixel then i would have drawn a bracket around it more same color pixel more more more like that i started increasing and i finally got wherever there is bus if you observe there is a green color wherever the bus is there is no bus it is followed by a correct so i can easily come to know that when i take when i look at this there are two buses that's it so this is called semantic segmentation and there is something called instant segmentation now in this also if you want classification because this is one type of bus this is another type of in that case you can say this is double decker and this is single okay so all we need to do is we need to have a very good corpus that is your training data and we need to have your target data are you able to get it we have to train them first of all saying that if if this is the scope you have to do this or if this is the scope you have to do this wherever you find these kind of buses you color them i need to have original images and i need to have target images all right anything else possible here apart from these two see if if in case you train the image saying that i just want to detect this i don't want to detect this the double decker in that case this also will become black this will be identified okay so once you progress onto that you can have a lot of other applications if you look at the unit structure what are we doing now is if this is my original image and this is my convoluted image what can you say about these two what is what is present in the convoluted image so we'll have a list of filters available with us and what are these these are one of the most important features for filter right so just imagine that an image is broken down into a very simple encoded part so just imagine this is a small a list or i can say a combination of lot of convoluted filters so can i say these are these are one of the most important features from the image remaining stuff we don't we don't much care about it correct next what i do is if you observe now we are upscaling so we are picking up each filter and as i showed you people that if this is one filter add it up so if you observe the size of this and size of this definitely there is an increase in this so by padding what are we doing we'll say that instead of max pooling it we will replicate this by the same pixel so if the pixel number is one that means it represents a color this also i will put the same number all right so this is the way we are expanding so now if you go back to that bus let us say in that bus part if this was if this was one of the filters let us say let me pull out a color let us see if this was one of the convoluted filters so what would have happened the color which is represented here will be replicated but from the training part we have pushed that this should be the color so in that case green color will be put on to the complete box that you see so it will be now completely green now i will go to the next one i will do the same job i will go to the next filter i will do the same all right so when i come here when i come here it has been learned that if this is not the object that we are looking for this is something else it has been learned that it is from the black side so whatever is there you replace it by black all right so this is how they do it so this is a very ideal image i have shown you so today in the case study you will actually come to know it's not this accurate if you observe here there is a line exact same line it is not very active but yes you will be able to pull out separate objects okay moving on so this is what is unit uh again i would say uh give some time to this these are not straightforward things and if they were then it won't be you know you will not find a very handful of people who are actually working on units you know they can design the unit without any references and issues this is what is unit and if you observe this is not only one part like this is not the only possible solution you can mix match lot of convolutions all stuff over here to move and you have to make your own unit but please be careful sometimes what happens is uh image that is saying this case is of this size the image that we got out could be of a different side so if you are expecting that image is perfect or replica of it or not then you need to design it such a way that all this basically this and this should exactly match with each other you know that is what we did in auto encoders all right but in this case it is not perfect it is like some size and output is going on so you can combine so it's not always necessary that i will only use unit you can use rest net or something like that to have lot of replications and finally we make sure we get the image of same size or in other case you can use my logic from auto encoders the auto encoder is a very simple logic whatever is here is over here whatever is the shape and size here is the same shape and size here that's it all right now the let us say the doctor wants to understand which of these cells are cancerous which of them are not so what if i can use this these things to locate certain part of that image saying that this is my cancer one and whatever is around it the black zone so whatever is there definitely there are cells around it but we are not interested where is it now if the image is not of the same dimension and all this there will be a lot of issue in identifying the place all right so in those cases yes i agree it is needed exact replication sometimes you don't need it so you whenever you are designing this take a decision on on your on on how do you say depending on your application all right good uh now coming to mobile so what i will do i will first we will look at the case study and then we'll come on to the mobile so mobile net is again a different part yeah so i will like actually mobile net we will do a lot of math today i'll show you how i can reduce my number of multiplications from vgg or normal convolutions to mobile all right okay so just for starting purpose what are we going to do today is we have got a set of images like this okay let me reload it yeah the whole intention today will be to let us say this is one image that i have okay now i want to find out say i want to convert them into some kind of mask or some kind of segmentation yeah but if you observe here there are so many objects so there is a car there are some buildings there are some uh what is a vegetation over here there are pathways there is a road there is sky a lot of things is there and also this image is not that clear why because i've used 128 cross 120 that's the only issue otherwise if i use a one zero two four cross one zero two four it would have been better but at the end of the day what is my intention my intention is to convert it to like this very simple so why do i need this let us say the forest or i say environment department wants to know how much vegetation is present on each road so if you look at this it clearly shows us that overview that yeah there's a good amount of trees present over here for example or as uh how what is the current traffic level at particular uh time of the day i don't want to you know basically look at other options i just want to look at my car and the road in that case changes yeah so this is what is uh the purpose of today's case study where we convert this to this as simple as that or if i go down and show you some of the outputs i want to convert this particular image to this image okay now uh from one of my other batteries i got a very good question that um uh what if we can map up so how are these colors coming up first of all these colors are coming up because this is my original image okay and this is my mask image or i can say my tagged image so already somebody has provided us so from this we are getting the mappings but yeah so the purple color for the road is coming from something like this the learnings so what i am going to do now is they ask me okay fine if we can find some kind of relationship between the colors in the original and the colors in the segmentation should be good so yes i will show you guys something next week or next to next week so i'll start coding on to that uh and i will show you guys how what is the relationship between the color and how these guys have done but overall in in this case study it is all dependent on our masked image or tag all right so now let's start so this is the place where you can download the dataset from and for our collab people if you can use the above code i will start from here so the first thing is i have to define my shape and size so now one question to all of you let us say i have defined 128 cross 120 or else if some of you would have done it say 256 cross 256 or one of you let us say having gpu has done one zero two four cross one zero will it make some kind of difference on the quality of my output or no so whatever extra spills that we are seeing you will not be able to see that because if the image has more samples or more pixels the difference between the colors will be farther yeah so please remember that uh to choose an optimal image size so i have for ease because my epochs otherwise will die over here so for these i have chosen 128.8 when you get this code try on a higher dimension also nothing wrong next is uh i'm listing all the set of images that are present in my training images that means my original images this ones yeah so if i look at it this is how it looks like now why is it important because here the person has prepared the data has given the same name to the original image and the same name goes into the so masked image so if you observe here i'm printing printing the original image in the mask image original image masking so when i do that if you observe the names are perfect you see so that we come to know which is the target image or the how do i say reference image for the original one so this is original this is mask yeah so the first thing that i have to do is just to keep a check everything is good i have to sort both of these so this is my independent set of data this is my target setup correct if i sort both of them common sense is going to come to uh same name so basically we will be able to put it in same hierarchy all right next uh we are using opencv so opencv that again is a library which is could be used for a lot of importing images printing images and all this part of it and here for displaying part i'm using simple this thing matplot the first thing that i have to do is i have to now differentiate x and y so guys one more thing there are various ways in which you can do segmentation please remember there is not one way i am showing you an easy way to do it i am not showing you a heavy manual there is there could be one more way beyond this where in seven eight lines of code you will be able to finish it all right so the videos that you are seeing on your olympus that that could be one possible way this could be one possible way and in future if you read some of the blocks where it's very simple that also could be so the way we are taking you through various uh approaches is so that you are comfortable with almost everything but i will still prefer have a simplistic version i don't want you guys to write all manual stuff input the data then concatenate fine that is good to understand but once you've got your understanding don't do this come on to the industry side and write simple straightforward quotes okay so here the first thing that i am doing now is i am saying this is my independent data this is my targeting now first of all i am creating empty lists or sorry empty numpy arrays which are full of zeros okay what should be the length of it how many of them i wanted it should be exactly in size equivalent to our mask and original so this is there is a folder called original there is a folder called mask and all the images are stored in these so total number of images how many i have i want that many number of matrices inside that matrix what should be the row and column width should be 128 cross 128 and do you want color i will say yes so put it and what i need i need a let us say the data type should be float so now i have got two arrays full of zeros of this dimension so this is something like your hdf where there are a lot of images of 128 cross 120 size okay moving on the first thing that i have to do is i have to give the index of my file so what is the first index let us say index is 0 or minus 1 whatever is original index next is what is where it is located so i'll give my path after that i will import it so from this particular path read one image and load it in the form of image in a for loop so i will go around this for loop for how many times for the number of images that are present in the folder original okay after that i will say try if resizing this image using opencv to 128 cross 120 okay normalize it next is your mask so we have read the original image now i'm reading the masked part of it in the mask what you do is do the same concept read it resize it and normalize it perfect and if you have expect exceptions go back to the original part this is what it means the first stride is still nothing is possible move ahead with the next one all right but so far we did not get any exception so it's okay any idea where we will get this exception i hope you people got the second part what do i mean by exceptions for example say there are 200 original images that means your independent data and there are 199 target images what is going to happen or mask i will say in this case what's going to happen this is not going to run for sure it's going to give you error if it gives an error there is an exception go back and check the path so it will show you that there is an exception e and plus it will give us the path as simple this is the issue this is the place where error has occurred now moving on if you look at the shape and size of x and y 200 images we have in total 128 128 cross r 0 now we'll try to pull out one image so i am not sure let us say first one so this is how it looks like in the same y of it when you print here you get it like this good this is it if in case this was you wouldn't have used try and accept for example what else we can use from the coding part let us say i don't want to use try and accept then what i could have done can i say if yeah this particular image so whatever i have downloaded if image is equal to true agreed if it is true do this that's it correct or not so if there is some value into it definitely it's going to show you true if it is true enter if it is not true it will not even go all right so if you want if anybody wants to simplify it you can do this okay good so now moving on okay uh also was trying to show you just in case how the original image looked like the original image was 375 across 1242 cross three so that was the reason we make it to 128 128 standard one so if you observe we are losing certain data if i'm not wrong here so if you look at this image and if you look at this image so what is the difference between both the images is there is one car which is over here which has been cropped because of our thing plus i think this particular ground is shown over here yeah and this one if you observe has been shrunk and there is a tree present here which is again chopped up so please be careful uh try to find out the optimal shape and size of it and then choose 128 or 256 now you may ask me fine how do we come to it uh what you can do is you can use the max function something like that find out use a for loop to go through each and every image and find out the maximum amount of which image is having maximum size and which image having minimum size then you can go ahead and choose about it that whether you will pad the minima image or you will remove the stuff from the maximum image perfect so just ignore this i was just throwing across got some black connections now comes the concept of transferred learning for our uh unit basically for our segmentation model we are getting from this particular command all right so it will take some time so please be patient here it will take some time and sometimes it might show you errors i'm not sure for me for the first time yes it show me error what i did was i restarted my kernel and from a fresh python listing i did or else you can use pip3 or you can use conda whatever you like it more or less this should work if it is not working on any of you do let me know okay we'll try to solve it up so i'll ping you right away if you want in the break part you can just write down so what are we getting now so first of all we are having a segmentation model from there we are importing unit okay so what is the segmentation model what we imported here okay next is from segmentation models we are importing backbones now this is a new concept to you guys backbone is a network which will run on the back side of your original network so let us say in this case my backbone is rest net but my original network is a combination of unit and fully connected for example yeah so i'll show you we'll see down the line how the architecture and all happens in this next is i usually use both of them so i have manually copied it but we are not using jacquard loss and iou over here so if i explain you what is jaccard loss is a type of loss function which we use in optimizers and what is iou it's very simple let us say this is one image and let us say this is one triangle so you are designing one computer vision code where you are supposed to tag the object so let us say the person who has designed the corpus has tagged it like this all right okay okay one minute i've tagged it too thick it will not help me today let me tag it little thinner so let us say this is your triangle and if i go back and tag it change the color let us say the person who has designed it has stacked exactly perfectly tagged now when you run your algorithm whatever you have designed onto it let me change the color so you tagged it like this so your algorithm was able to type this as the object yeah now how do you find how good is your tagging so for that what we'll do is we'll use the concept of intersection over union that means do people see that both of these boxes are intersecting at one point are you able to see those books the intersection area divided by the whole union area correct so let us say the intersection area is 90 and the whole union area was one to for example because of this extra stuff we can say it's more in this case we can say that yes the model performance is designed by this one so this is the accuracy score that we use for object detections it's not in our this module but yes in next week's model you will be seeing this next we are having keras so we are having input convolution and we are defining keras mod to get our uh to define our models next thing is you split your test and train data as simple the first thing that i'm doing here is i'm bringing rest net can i say there are certain time chances that you will get early outputs your output will not travel through all the layers it will but the first output that you get might be a very fast route agreed okay so in that case if i wanted this could be a very complicated yet easy network to put it across this so this will give me some kind of speed i can see now tomorrow if you don't like rest at 34 you can put rest net 52. now what do you mean by rest of 34 you have 34 different d players that's it 52 means you have 52 d players and then you are bypassing each other something like that all right so my backbone is currently rested so what i am going to do is whatever unit whatever output unit gives me it will propagate through my rest all right for my classification very simple if you don't want to use backbone it's okay i just gave you an option here that even you can do this rather than using this a single network all right so if you are not comfortable for now you remove it and later on when you're comfortable with the unit bring back the backbone concept and use it next thing is your training part of it so whatever i have defined on rest net as my process input i have to put my train and validation data agreed we are saying that i do not want to manipulate the weights over here because it's already made rest net 44 backbone network okay look at the shape of my training data 170 validation data 30 same 128 1283 now coming to your unit so guys this is the simplest way you can define it it works perfectly you might find various architectures where this goes pretty complex okay i welcome that no issues but i'll say don't jump on to that immediately start with something simple and then move up right so anyway today's uh case study we are getting an accuracy of only 54. so anyway i will ask you guys to improvise this you will see to it how you guys can do it all right so the first thing that i am now doing is giving my training shape of my first train data the shape of my first training image to n storing it as n and later on i'm bringing that over here as my input shape to my unit okay now after backbone i'm saying this is my base model what is my base model unit from where i got unit from segment models what is segmentation models that we downloaded live all right so this is how the connection is coming so from unit i am saying my backbone name is nothing but inception version three so if you remember inception one was one of the architectures and the weights that we are putting on to that is image networks okay so this is my input i'm giving next one is my layer so let us say i am defining my layer number one i am convoluting it two dimension convolution 2d next thing is i am saying there are three filters each one is one cross one okay on what we are applying on your input then comes my output so this is my base model so whatever model i'm using unit i'm using my unit onto my layer number one all right after that layer number two so i'm again doing convolution three uh kernels one cross one each one or output uh channel and finally i'm implementing my keras model i'm saying bring out my keras model on my input with respect to my l2 so i'm saying this is my input and this is what i'm supposed to bring it out name of the model that is going to run is nothing but your unit your base model okay and uh post that and printing the summary so if you look at the summary there are no parameters in input as of such there are 12 parameters in output okay and uh next thing is when we are bringing the unit okay using inception v3 we are having a huge number of parameters here all right so if from this parameters we can say that these many parameters are non trainable a very fixed version and these parameters are trainable out of this okay now if you ask me fine how did we get it since we are getting ready-made material i have no control over this which is a non-trainable which is true since we are getting a red part right so this is how the network is how the summary looks like and if you want to look the summary of the base model your base model is nothing but your unit just look at the size of the unit it is very heavy all right so we cannot it is impossible from our site to manually code this manually constructed so we do transfer them now coming to uh running our optimizers and defining them so what are we doing is we are if we are having binary cross entropy we will use adam and i would use tf tensorflow to do some calculations onto our accuracy function or matrix function so the first thing that i will do now is if you remember our optimizer what all we used to put we should put the name of it then we used to put a lost function so what i have done now is we have defined our own loss function you can use the standard ones also no issues but in this case say just to show you guys it is possible to define it we have done it and seconding we have defined our own dice coefficient so this is something like your accuracy so you remember i showed you iou concept intersection over union you know that the one which i showed you about multiply our predicted versus actual and divided by the union of it for example okay and we'll use tf to reduce the sum so if in case there is a huge amount we try to reduce it and narrow down the stuff or else even you can put what do you say how do we reduce decimals guess what is the function for decimals um [Music] if i have 3.2468 if i want to reduce it to 2 3.24 what is the function we use round exactly yeah you can use a round function to do that or else a simple version of tf will be df dot reduce underscore sum so we will come back to this special kind of accuracies and losses once we go down we deep into this so probably in fifth and sixth week we will come back and revisit this topic but for now just imagine this is your metric and using the same thing we are defining our loss okay now you may ask okay what is this epsilon epsilon is nothing but 10 power minus 7 yeah so this is my loss this is my coefficient i will call it whenever i compile my back propagation that's it and finally i'm fitting my model all right on to one defining also my validation data 100 epochs it took a lot of time for me on collab i think you guys are pretty fast batch size i have taken say approximately 13 use it more to be faster and i think this tall you people know so if you observe here instead of accuracy now i'm having my dice coefficients right so that is 54 and for validation dice position is 53.9 that is again 54 so the model is stable but i feel it is not that accurate yeah so there could be one reason the first reason is we have done 128 cross 122 first reason probably there is a lot of pixels which are shrunk so what if i can increase the size of this so please remember try doing that and then report back what accuracy you got second reason could be um i will say number of epochs that we are running yeah so if you observe then there's not much change i agree to that but we try running box you should be able to move ahead okay so this is it uh so this is the problem i am giving you guys try to solve it up and let me know how much you are able to pull it up to but do it only in collab otherwise on jupiter you have to wait a lot of time yeah even though it's showing three seconds and 15 milliseconds and all but yes ending up takes a lot of time here it might run fast the last epoch will take a lot of time hey guys have you seen that i'm not sure if you have noticed on jupiter the last epoch now the eta will show say zero seconds but still it will be running and it will not come out of the execution any idea what is happening here observe it if you are not observant try observing that okay so there is some delay probably i feel even though we are getting a number over here but yeah there could be a delay now moving on um or is one more thing um just in case if anybody is very much interested in timing it there is a function called tqdm i have shown you guys this earlier also you can use tqdm here tpdm is a very effective way of noting your time you come to know how much time you are equal finally we are predicting it so when i run my prediction model so i will come back to saving the weights later so when i am doing my prediction on our model whatever i get as my let us say i printed the original image and this is the predicted image if you observe and if your intention here is to just to find road then yes you can say that you have got a very good color but the problem i see here is there are some vehicles on the road because of our low accuracy even they are off similar color so this is going to be a problem so let us say if you are running car onto this google car is running this it will not be able to identify this is going to go and bank yeah try to improvise it and see all right now coming to an important concept of storing your weights i think i have done this all already something new i am showing you guys as json so you have four options available job pickle json and hdm you can use any of these to store the weights so that tomorrow when you open the code again you need not to run this can directly load your model or your weights directly from this so all you need to do is next time when you run this code you need to say let us say new underscore underscore model when you do that what you can do is you can say um dot loads or load can be loads or load just take it out and in that you can give the name of the file which is nothing but our model dot h5 right so this is going to load the complete findings onto new model and from next time you can say new underscore model dot predict and inside that whatever is your image you can give all right so that's the reason we use this so you don't have to keep retraining your data so this was a simplistic version of industry based unit let's move on to uh a new type of network and we'll try to reduce uh some of the computations out of our convolution and all let's see if i can reduce the number of competition so to do that now my first question is um say you have a phone for example and in this phone say you have clicked a picture and the functionality in your phone is it can detect certain objects so let us say first of all it detects your face and from now every picture you take it's going to detect it as you now it is almost every phone has it yeah now first question is how do you think what type of algorithm will be running in the phone so please make sure that uh this phone is having some kind of ram some kind of storage and some kind of say i will say graphics card or whatever you call it right so how exactly this this image processing is happening here any idea do you think we'll be able to load vgg and all on to this why because how many parameters were there in vgz if i'm not wrong it was around 60 million parameters right so even if i tickle it or even if i you know try to load uh you know basically take the weights and make it a apical file or some kind of function that from tomorrow when you just run it right it is not possible to have it on a handheld device now let's see still these stuff these people that do it so now let's there is a need of an algorithm which is a lightweight of convolution it is a convolution but could be a lighter weight so let us try to reduce the so let us say if i i say if i reduce the number of parameters by say 95 then you think the phone will be capable enough to load it up i will say yes correct so let's see so this gives us a need of a new network so we will define it as mobile the compressed version or a very simple version of our complex models is this now if somebody can help me i would say i don't want to control screenshots so what i'm doing now is let's try to build this up in a very very very so uh this is something like um revisiting our concept of how do we say revisiting the concept of convolution plus so say this is my image and this is an m cross n image yeah the e channel and cross enemy so just imagine that these are the number of channels okay now what do we do on to that we multiply it with some kind of filters correct so let's try to define the shape and size of all of these so let us say you have got um image height of cps and image width of df and then you have got some kind of channels here which are nothing but m is everybody okay with this figure this is how an image looks like in python now let us say this is a filter so i define say dk cross dk as a filter size it's always a squared part of it yeah now what will be my output so when i take this through multiple layers of convolution max pooling and all let us say my output looks something like this say diminished version but i will have a lot of feature maps agreed so let me join them up [Music] all right so let us say the size of this feature map is d e cross b d cross n all right this is what we do in convolution everybody agrees so can i write here convolution happens here and at the end we are getting n filters out of it yeah so when you put n filters onto that at the end of the day we're getting dg cross dg cross n right total number of feature maps it is there in our week 1 ppd please uh store it now if i ask you guys so let's build some equation here if i ask you guys question look okay if i ask you guys what does one convolution multiplication look like and somebody tell me one multiplication looks like d k square d k into d k cross m times agreed okay how many multiplications you are going to do with respect to total number of kernels how do we determine that dg square dg into dg dg square into what dk square total number of the dimension of my kernels into total number of filters that i am having agreed so this is how one convolution works out look at it how heavy it would be so if the size of my dimension is one zero two four just imagine how heavy it agreed is this is where our mobiles and handheld devices will not perform good or they will become little slow because of this so please remember convolution is nothing but multiplication yeah i will say multiplication is heavy then what could be light i want a lighter version of it what mathematical function should i put right think about it if i am saying multiplication is pretty heavy for me there should be some mathematical function which should be little lighter version of this okay my question through here is multiplications are heavy do you agree with this this is the heavy task you agree if the image size is in megapixels do you think this is going to take this one this is going to be immediate no now what if i replace our mathematical function with something else can i say plus could be one option agreed if i have additions if i want to add one zero to four across one zero to four i know you will say okay fine it's it's one cycle only i agree to it but one zero two four square into say six hundred and fifth whatever six hundred and one two square this becomes too heavy the number the total number of parameters can always further yeah so what if i can say that okay let me introduce something into plus and what if we are able to reduce the uh size of it so for this reason what we'll do is we will come to a new concept called depth wise separable convolutions depth this is a convolution only no change but the way we do it will alter it depth wise separable convolutions okay now let's see now this thing is divided into two steps so the step uh we will put step number one so the first step is called depth wise convolution depth rise consolidation so what if now i have the ef and ef so this is the dimension of my thing and m is number of channels i'm having okay and what i'm going to do is my ultimate solution will be so when i convolute it what's going to happen i'm going to get convolution will remain same there is no change between so i'm going to get some n images over here or filter maps over here something like this but the only change i will do is i have not defined my filter yet so if i define my filter right away then you guys will come to know what is the difference in depth so let us say my filter size is dk cross dk for example okay and i am saying that i will df and d k will be perfectly same the shape of these two will be perfectly same so when i get my output it will be dg cross dg cross m did you get it i have m channels so how many times i have to go through if i am saying the size of df and dk is almost similar that means i am taking the whole depth as one filter i have to just run three times agreed so if i come to the math now so whatever bracket we did in the above one if i do the math for this if you ask me what is one multiplication look like one multiplication will look like dk square agree multiplication per channel so if i have a multiplication for channel how it is going to look like it is going to look like dg square b is the this one okay into bk square right and if i ask you guys multiplication per kernel yeah so if you take one particular kernel how it's going to look like it's going to be dg square into dk square into total is m so what we got here m cross n now we are having m all right so i can say this particular thing will get multiplied with one whole channel this is what it means did we get this we tried to eliminate one larger value called m here this is what is the aim of doing depth wise okay now let's try to do the second one which is called point wise now comes the second one so now when we build the second one then you'll get the whole picture you will get why are we doing this so let me rebuild our image now what will be my image i say there are two steps for this so the output of step one is this this will go as an input to my step so let me not redraw it let us say this is my input image okay now what am i supposed to get out of this i am supposed to get the same stuff out of it which is bg that will not rename it otherwise you will get confused what is vg and all these things let us say this is our output we are expecting dg cross dg cross n now i am expecting n number of output now i am not saying that one filter per channel i am not saying that okay did we get it now what i am doing is i will redefine my filter so here what we did we made one channel now what i'll do is i will have a one cross one filter and it will go to how many channels it will go to all m of my charts this is how my filters will look like this is n understood one cross one filter only but we have to make m number of so when i do these two together i am going to get dg cross dt cross m sorry n please be careful there is no more m n okay now if you ask me what is one multiplication look like here so let us do so this is called point twice point twice what is the next parameter multiplication yeah so what it would be dg square dg square into 1 cross m perfect and now if i ask you what will be my final multiplication you guys will say n into this number dg square into n agreed yes simple now as i said this is synchronous processing sorry this is a sequential process sequential means what so if i want to find out my total computation how should i do it now i will say total is equal to depth wise plus point y what is my depth wise this one dg square into dk square into m plus m into n into dg square agreed now what are we going to do we'll take out the commons what is commons can i say m into dg square is common in both of them so i'll pull it out m dg square what am i left with dk square plus n everybody is clear till now now we will see how we will compare this computation so can i say this is the answer of my depth-wise separable convolution and this is the answer of my total convolution right yeah now if i want to compare these two what should i do can i divide them i'll take a percentage this divided by this see it's like you want to say that my my current uh say for example salary is x yeah and let us say you change your job and you got a salary x and your current salary was y how do you do it this by this you will get a percentage of price very simple so in this case what i will do is i will just divide this by this so we will just see the computation here so m g square d k square plus n so let's do that uh we have m so i will put numerator i will put depth wise denominator i will put whole convolution okay so whole convolution is nothing but dg square into dk square into m into n agreed and on the top it will be m into dg square whole multiplied by dk square plus n right let us check whether it's written correctly or you misplaced it m into dt square dk square plus h perfect now we'll start cancelling so can we can i cancel dg dg yes can i cancel m m yes what am i left with dk square plus n divided by dk square into n if you separate the lcm what am i going to get can i say i'll get 1 upon n plus n upon now can i say the equation which was full of multiplication now we have kind of put addition on to it so i'm not saying we have transformed it it's just a percentage but yes we've got some addition now you take an example now you take an example saying that the total number of output you get is say one zero two four for example yeah you have this many number of uh filter maps and say your input or your filter dimension say was usually use three cross d right so let's say put okay how much you got 0.1 so the answer that i got is 0.1 now can somebody explain what does this mean the numerator is 10 of the denominator so can i say i have reduced my computation by 90 percent by following these steps this is one way of developing your network so tomorrow when you guys want to industry say five six years only computer will also about matching the sizes if you follow these rules definitely you can mix match of anything and try to find the solution right so now why did i do this in detail is because i did not find a very good solution for this thing that's the reason i went this ppt is good but i will not say it's pretty big so yes now if you look at these things yeah so this is what is the final slide this is what we did see what happens is if i directly start showing you these things these are not very simple things to visualize to be very effective so that's the reason i went a little bit on scratch so this is how a mobile net possibly looks like so if you want to convert say if you are inputting uh 2224 plus 224 cross 3 image this is how you have to go it's now coming to something very interesting so this is what we do in industry yeah so now whatever you are going to do you are going to implement that if you are working on computer vision yes you will use these algorithms very often so now comes the real cv what is faster if you guys are okay shall i delete this or we are not referring this back right uh [Music] [Music] my intention is to if my device is supposed to focus on the face so my convolutional neural network algorithm should give me a beautiful box around the face yeah also using some kind of generic stuff i can say that what if i draw a box like this don't you think the box is going to cover the sales of the person yes i don't want to do that i don't want my algorithm to go spread like this i want my alcohol to be very precise that is the face of the person just to tag only the face not the generic one yeah why do we need it there could be multiple applications where you need to check who's coming in who's coming out or for facial detection or for any other so in this case my convolution neural network will work good definitely but can nobody tell me how many first of all we can detect it's a phrase using our classification so far what we have done how should i put a bracket around it you know i have to put a line around if you see this this is a very simple uh uh the frame you've got a lot of people and then they are tagging yeah so how should i get this line how should i get this lines classifier will find out that yes there is one phase how should i find out how should we draw that box so one answer is segmentation good and then we can say that on an average for a human being this is the distance so what i will do now is think this yeah let us say i have this image so what i will do is in a very simple way i will i will randomly choose one filter size okay could be any size sorry not filter a box and i will start running this box all over the image like this randomly i will put this box over the image okay now what i will do is when my filter box goes and sits on an image i know the dimension of the filter so i know what part of the image i'm talking about right so from which cell to which cell i'm talking about i have an idea of the geography now what i do is i take this as an output feature map of your convolution and i import this to a fully connected url to detect i will tell my fully connected check if i are you able to find the face in this if it says one yes i will say that whatever i tagged yes this is a symbol all of these will get fcn and output zeros these all will get s squared agreed simple yeah but there is one challenge if anybody is able to spot the challenge how many of this do how many of these twos [Music] it says an image can be scanned by putting these two thousand videos so the challenge here is for us first of all to maintain this dimensions of two thousand windows locations second is to get it passed through lcs but yes it works very well so apart from the three solutions that you got two solutions required you can do this which is called the regional evolution okay so i will just give you an idea because for the three content we are not going deep into it but it will make your videos little easier so yeah just have a look at rc so what we have done here this is my input image we are randomly putting approximately 2 000 proposed regions we are saying that this region could be there could be an object in this region yeah now it is not necessary that when you take a region pass it through cnn and fully connected you are going to say you are going to get an image of object some of them will fail but yes the ones which have passed through the only good thing here is we will know the location where to put the box that's it this is called now we found out the problem problem is i am still i am very eager to use this how should i improvise on time how should i improvise so let me show you one fast car schedule what we are doing now is prove instead of increasing the size of the total filters i am saying we will increase the length of the fit so i can say i will have one filter like this another filter is given by the red and third filter could be something like this which covers the whole image now what i do is i have three zones people agree with that i have knowledge of three zones what i will do i will take one of the zones i will put it through convolutions first convolution i will divide into two parts one will be my soft ones to decide whether there is a face or not another one so if there is a phase yes another one will give me the dimension of this line all right so regressor is to find the x and y dimension soft max will be used to get the face so if both of them are this is positive definitely put it back together and say that draw a circle or draw a line or a square around with this particular region because we have detected softmax output that is face so put a small command here that is correct so the larger ones that you see here the larger filters that you see here are nothing but region of interest still i am not happy first of all one question to all of you now do i say my speed is increased i would say yes because i don't have to now do two thousand computations i just have to take three different filter [Music] what if i want to still improve so what i will do is rather than these two regressor and the classifier being in parallel like this within this panel yes instead of being parallel what i will do is i make them sequential so i will say this is my image this is my convolution list this is my feature maps these are nothing but the square that we have proposed go and put it into one variable called proposal and when you do this once you classify it if there is a correct output pull that out so yes there is not much difference between this and this but the only thing is here we are not having if we look at the standard timings takes acceptance faster we'll see down the line so now we are going to see that what is machine learning right so if you don't have idea about what is machine learning machine learning is basically a subset of artificial intelligence right and that allows the system to automatically learn and improve from experience without being explicitly programmed right so what does that mean so basically after you make your model right so what you need to do you just need to check your model is working perfectly or not how your model is going to predict if you see your model is predicting nicely for that you don't need to go for like doing the programming from the scratch every time what you need to do you need to just provide the new data to your model just to predict so this is actually the machine learning all about so we just need to train our model and after the new data we got we just need to test our model right so types of the let's see what are the type we have in machine learning but before that let's have a look that what is the difference between traditional and uh uh machine learning right so in the traditional programming if you want to predict your data every time you need to code right you need to uh like write the code from the scratch every time but for the machine learning your model will do everything for you so after you make your model after you know your model is actually working well with the new data so you don't need to write program each and every time you just need to provide the new data to your model and your model will uh do the all the like all the work for you so this is why machine learning is actually getting the popularity in days uh like respect to traditional programming right okay now we will see what are the type of machine learning we have we have supervised learning we have unsupervised learning we have reinforcement learning so what is supervised learning supervised learning is basically your model will learn under a supervisor right so basically when your model is learning that time there will be a supervisor we will say your model no no you are not right so this this can be optional for that you are or wrong right but on the unsupervised learning your model need to learn from itself there will be no supervisor so in supervised learning we work with the level data when your data is labeled okay and in the unsupervised learning your data should not be level right your model is to understand by itself what is the like what is the group of the data right this is the difference between supervised and unsupervised learning and the reinforcement learning is basically uh like if your model is going to predict wrong you are going to give a penalty for that and if your model is predicting right you are going to give a reward so these are the basic three types of machine learning algorithms we have and in this types we have several types of algorithm right so today we will work with the uh super unsupervised learning right so now comes the recommendation engine so what is recommendation engine but before that i got some of the okay so i got a question from akalpita that it will be better if we can have a q after the lecture yeah yeah sure uh we can do that but okay [Music] okay so i got a question from shikhu lali that how we uh train a model and why do we have to speed our data set so basically we split our data set because we don't want to provide our machine uh the full data your machine doesn't have idea about the taste data set and in that what you will your mission will do your machine will actually learn from the drain dataset and the themes data tree a taste dataset will be unknown for the um like your model okay uh i got a question from ananda is machine learning related data science yes data science is basically like machine learning is the like subset of data science right so when i just say the life cycle of data science you can see in the last when we need to build our model that time we need machine learning algorithms right okay so what is the recommendation engine recommendation engine is like a recommender system predicts the choice of the users right and helps user to discover new products or content according to their past uh like the past data they have used and if it's it's like system is not going to say this is can be uh suppose we are all familiar with amazon recommendation system right so what we used to get suppose let me give you example you are going to buy a mobile phone right so that time the amazon is going to say maybe the mobile cover or the headphone that actually connected with your phone right so this can be your choice right this is how you don't need to go for explicitly you need to go for searching the other things automatically it comes under that you may like in the portion right this is how actually recommendation system works right so the recommendation system actually helps the user like experience like with the shopping experience and it helps the company to boost their sales as well right so this is actually the like part of product recommendation system you can say so uh now we will see uh what types of recommendation system we have before that let's see what are the application what the companies they are using over recommendation system like linkedin netflix and instagram i just give them some of the uh x like application and which are the company actually using and there are multiple companies they are using recommendation system right now okay so uh linkedin's actually job matching algorithm has improved the performance by 50 using recommendation system okay and netflix values recommendation system at a half a billion dollars to the company so we all quite used to with netflix or amazon prime videos right after you actually search for any movie after you search for any movie it will recommend that these movies also are there in the list maybe you you can like them as well so this is how recommendation system works and for the instagram switches to use algorithmic feed right so it's something that there are some uh you are here like let me give you an example suppose uh you go for uh some advertisement right in that case uh you saw some feet and according to your feet is going to say the these are the advertisement we have and you can see maybe these products you can like right this is how actually recommendation system are using each and every company to boost their sales now we will see what types of recommendation system we have we have user-based filtering recommendation system and contain based filtering recommendation system so let's have a look what are the these two actually stands for and take out like as a use case how the these are going to work right okay so user-based filtering system is building a model from users past behavior as well as a similar decision made by the other users right suppose i saw a movie and the other person also saw a same movie so basically there can be a reason like there can be a reason that we both have a same choice so in that case uh the machine actually will calculate that what is the similar taste maybe we have and according to that the machine is going to give you a recommendation maybe this movies you can like right so this is how actually you use a base filtering works and this model is uh okay so this model is basically uh based on the user's similarity right not the movie what movies have right so it's basically used on them like user similarity okay so this model is then used to predict its eisen that the may have a interested in right so what we said that this two uh suppose the two people are actually used like see like watching the same movies and that time the model or the recommendation system can think that maybe this this two can be a similar choice and whatever the uh movies they are watching they can be a similar uh choices for them right this is how user best uh filtering are working okay so now see what is content-based filtering content with filtering is not at all user like what other user are actually watching so what actually does suppose you have watch some movies so it will go for the content for that movie the heroes of the movie or the director or what type of a movie maybe it's a horror movie maybe it's an anime movie right so it will go for other ratings we have for the movie so it will go for the other or like other feature of the hidden pitch features of the movies right so this is how content-based filtering are working so these are the two filters actually works for the recommendation system and uh these are the two main difference between them okay okay now comes netflix how actually netflix works which system they are using right so in netflix they are using hybrid recommendation system they are using both the concept ubcf and cbf right the user base collaborative filter and the content based collaborative filter so this are the two they are actually combining two recommendation system to actually make a more effective recommendation systems right okay so yeah we all know that netflix and amazon prime actually give us a uh maybe you can like this movies based on the our past choices right this is how actually they are using hybrid systems so now let's have a look at user-based collaborative system filtering right so basically this algorithm finds a large group of users and also searches user with the similar test the same thing i have told right and algorithm looks at different things they like uh they like and combination them to create a rank list of suggestions okay and what are the uh someone asked me that what are the uh algorithm we can use for ubcf or the recommendation system we are going to use for k nearest neighbor that is an unsupervised uh learning algorithm like machine learning algorithm and we can use psn correlation this two we can use to make our own recommendation systems right when it's comfort user based collaborative filtering okay now comes uh like let's have a use case like how can you do that suppose we have a girl called lisa right lisa just watch a movie joker we all are quite familiar with the joker movie right so if you guys know about joker just let me know in the comment section that yes you all know about joker okay so i got a question from uh that why do we mean uh why do we mean why what do you mean when you refer to hidden features of the item will data part of the features on the dataset or will be unknown feature so basically hidden feature i want to mean by so suppose you have the data right but when you have the data but you don't know how your data is actually behaves right so suppose i have a data of online transactions but i don't know how this data behaves when it comes to fraud transaction so getting those hidden feature that maybe there is there will be out layer maybe there will be some spread of the data will be changed when the uh actually your transaction is a fraud transaction this is actually want to mean by hidden pattern okay fine so fine i got a question from abdullah yes please provide the notebook of yeah uh you will get the notebook and the presentation as well in the great learning academy and you just need to go and register for the course and you will get all the material over there okay fine uh okay i got a question from ragavendomam is it necessary uh to learn data structure and programming for every language so basically data structure is a part and it's a it's a basic part when you start programming so you have to have a basic idea about the data structure not maybe the competitive programming data structure but the basic idea about data structure you need to have okay so yes uh i got a question from vikas that how much time it takes to learn machine learning so taking time for machine learning it's depending upon you if you want to know if you want to know the mathematic behind all the algorithm it will take some time but if you want to go for only the coding part it will not take a long time right uh ajay is are we going to analyze some data in the video yes we are going to analyze the some of the data in the video as well okay okay uh uh amani's will see python code for building recommendation system yes we will make a recommendation system in python notebook okay okay so i got a question from vanita that k n n is a supervised algorithm i mean k near a sniper right so in that case we actually use the unsupervised technique as well okay so for her that the recommendation system comes under which category so basically it's come under unsupervised learning part [Music] okay fine so let's let's go back to the presentation so lisa is just watching the joker movie right so let's see how the recommendation system engine works recommends her the next movie right so we have actually they generated list by machine of user who have seen the following movie suppose sam joy and ratan they actually they these four actually watched joker okay but for dave maybe they he didn't uh like watch the joker but again the mission is maybe thought that it it can be a similar one they actually saw their like previous movie right so these four they have listed down now let's see how can you like these are the four movies they all four already have seen so under these four which one can be recommended by the system like by the system for the lisa for the next movie right okay so in that case you can see we have seen that uh sam j and ratan these three people actually saw joker okay for lisa lisa saw joker right but she didn't see any of the other three movie right the book of the life or avengers or the other one right so now let's see like what can be the next movie for lisa so in that case we are going to go for the voting right so list of lisa's watch movie find the same test user from the user simulator found the probable movie of lisa like suppose it can be avengers which gets more vote so it gets recommended for lisa so you can say uh we get almost um one two two votes for avenger and uh we get one vote for the book of life and one vote for the other one right so this is how the recommendation system thought that avenger can be the next movie for lisa she can watch right and she will like it so this is how actually user based collaborating filters are working so just let me know if you have any doubt in that okay so now let's come that okay so what actually it does for the contents of central entities walk with the data that the user provides either explicitly or implicitly right based on the data the suggestion are given to the user engine's accuracy increases with more input given to it so when it has a more input you are actually your system will become more uh effective right so if you don't have much data your date like your system will not working or performing well with that right now comes the content based uh filtering let's hover the same uh same example with the lisa so lisa have watched joker then see how can a recommendation system is going to use the cvf method to recommend her the next movie right so they actually generate the list of the features like actors directors theme or story or the characters right so now this part they are actually going to looking for the internal part of the movies right okay so here you can see we are going to compare columns for of each movies with the column of the movie joker so you can see we are going to see hero horror movie theme imdb rating 8 plus and comedy right so joker is has a hero obviously it's not horror it has theme indb rating is most probably not 8 plus i just need to check i'm not sure about that and it's not a comedy right so in that case what are the same uh similarity we have in the book of light right it's have a hero horror no theme yes i am to be rating is yes but comedy yes right so we got most of the similarities with the book of life so the most uh common movie is the book of life so the nation is actually going to say this can be your next recommended movie okay now we will go for demo of movie recommendation system so i told you we are going to use google column and i will show you how can you use google cola for that right so what you need to do you just need to go to the your favorite browser and just you need to write google colab right okay so you can see i got the first link for google call app and you just need to click on that right i'm going to click on that okay so lastly what you need to do you just need to click on new notebook that's all you need to do when you are going to use google cola for your any notebook so someone was asking that he has the mac and he doesn't know how to install uh python right in that case you can go for google collab you don't need to go for any installation and you directly can you start like start writing your code in that so yes okay so i got a question from easy that was the easiest way to learn python easy basically python is very easy language there is nothing easiest way to learn python if you start like like learning python you will love it yes okay [Music] okay so you see i got our notebook so i have made my notebook previously so now we are going to do the recommendation basically a movie recommendation system using machine learning so already i got my data from kegel and i have uploaded them to dropbox so i have two data set one is trading csp one is movies dot csv data set so all the data set you will get uh in the kegel itself otherwise you will get the notebook as well and you don't need to go for downloading the data directly it will face the data from data uh dropbox so i have used w gate function so it will directly and i need to give the path of the data set right so from that it will directly fetch the dataset from the dropbox so let me just execute that okay now you can see i just want to see my ratings dot csv what we have in our data set right so in that we are going to use pandas data frame this is a library of panda library of python well like using python using pandas you can go for many data mining codes like data mining techniques so pandas is one of the most popular uh library in python you can say okay so in that case we are going to use dot uh read csv function to read ratings.csv right [Music] so i got a question from nikhil this coding is done in jupiter notebook uh yeah you can do the same thing in jupyter notebook as well for me it's quite handy to use google collab so it's up to you which id you want to go for right so you can see uh i'm going to just store uh in ratings detail variable to the rating csv file right so you can see i use dot hit function it will return the first five rows of your data so i got user id movie id rating and timestamp okay so i have the movie id i have rating that how much rating the user gave and the user id for the uh respective user right so now i am going to take the movies.csv from dot uh from the dropbox itself so it will face the data same way yes you can see it's already saves the data now i will do the same thing so i am going to use movie details in the variable name and i will store the movies or csv on that variable and with the head i will get the first five row from that so i have movie id i have title i have jonas for each of the movies right now i'm going to like as i have the both data let's have a look at the what are the shape we have for each of the data set so i have ratings dot csv already in the ratings details right so if you want to see the shape of the data set what you will do you just need to write dot shape function right yes so you can see i have one zero zero eight three six columns and four rows okay okay so i got a question from riya can you please explain w gate method so we use w gate to feature data from any pro like suppose you have a link the data link so we use wgate to face that data from that click it will directly face that it's a command line command you can say okay so i got a question from ankit what is the use of panda so panda is basically used for data mining techniques if you want to manipulate your data if you want to clean your data or if you want to learn your data easily you can go for panda so panda is one of the most important library in python right okay now let's see we we already got the shape of the rating details and we just want to see the shape of the movie details right yes so yes we have nine seven four two columns and sorry nine seven four two rows and three columns right okay so this is the this is the shape of our data set and now we are going to use dot describe function to have a good understanding of our data okay so dot describe is basically will return you all the basic statistic part of your data right so dot describe yes you can see i got this and if i want to see this in a data frame that is also available here so i got count the mean for each of the column standard deviation for each of the column uh 25 percent uh 25 percentile 50 percentile 75 percentile and the max we got these are the basic these are the basic statistic for our state data set right now we are going to see the same for the movie data set okay so if you want to add some new code in your notebook you just need to click on the plus sign on the google collab right so movies we have movies details dot describe okay so i got a error okay it's movie details right okay i got a question from prashant that from where we get the give this data which you are using you will get this data from kegel prisan you don't need to go for any other well we all use the data from the kegel okay uh from gulam yes uh can you give this data you will get the you just need to go to great learning academy and you just need to go for registering the uh same course and you will get all the presentation and the notebook as well that is also for your first yes so yeah we see that describe for the movies details as well so we have a movie id and we have this you can see right these are that describe we already got for this okay okay so yeah let me just execute this once okay so by this yes now we are going to merge the data set we have the ratings we have the movie details now we are want to use merge the data set to data set but how can you merge that to merging the data set you need to go for dot merge function we use pandas dot merge function to merge two different csv file right so we have ratings details we have movie details right but we want to march those two datases on the basis of which column right so on the basis of movie id we want to actually merge those we want to concat you can say we want to actually uh merge those data set right so basis of movie id so movie id will be our basic primary key for this data set when we are going to merge them right okay so yes you can see we got uh so if you want to get the last five rows from your data set we are going to use dot tail function for that right so basically to get the last five of your last five rows of your data set you you actually need to use dot tail and getting the first five rows from your data set you need to use dot head right okay so already we have merged our data right we now we have in our data set we have user id we have movie id we have rating we have time stamp we have the title of the movie and we have the genres of the movie right now what we need to do we just need to change our time stamp into the proper date time so how can you do that so doing that we need to use a datetime function or datetime library and we can uh just say that pd dot that is also part of dot pandas date time and you need to provide the data set of the which column you need to change so i want to change my timestamp column to the trade time right so i'm providing that data set of timestamp column right okay okay so let me just execute that [Music] okay now let's see what we have in the data like what is the shape now we have in our data set right so as we merge our data we actually add a one date time column as well so now we have almost one zero zero eight three six call uh like close and seven columns okay [Music] uh so i got a question from prashant that when do we use scikit-learn in this uh scenario basically scikit-learn we use to uh split our data set to make the model from the like the make model and machine learning that's all the predefined algorithm is there in psychic learn okay okay um fine uh okay avinash is asking why can't we use concatenate so yeah if you want you can use concatenate i use merge function it's up to you which one you want to go for okay fine so let's see uh now we want to see what are the non unique what are the actually using uh in unique so it will actually return you how many unique data you have maybe we have the same user for just to rate the different movie or for that right just i want to see how many unique number unique uh customers we have in our data set right so let me just use and for that we use dot in unique function right so you can see for user id i have six one zero for movie id i have nine seven two four but our data is more than that so we have repetitive data as well so maybe our one user can see lots of movies so they are going to rate for that movies as well right okay so let's see that describe of the data set merge dataset what we have for the count we have one double zero eight three six we see that uh we have mean value standard deviation 25 percentile and fifty percent and seventy five percentile right now what we need to do now this data manipulation part scheme right so now what we need to do we need to group by our data to see the average rating for each for each movie now i want to see uh suppose i have joker movie right i want to see what is the average rating we have in our data set for that particular movie so how can you do that so i actually take a a variable in that variable i have stored our full data set now i want to group by my data so a dot group by a group by function is used to group your data according to the column suppose i have a title of joker right so in the title i have joker so what it will do it will actually group all the ratings and all the data we have according to joker and it will do same for the each and every title i have right so this is why we use group by function okay fine so in that case you can see i am going to provide title and rating i want to see a title like first joker and then the rating how many rating it has and then i want to take a mean of that just to get the averages to get in order to get the average for that particular movie right so you can see i already got that for title hell boy i got four rating right for uh xxx two double zero two for that i got two point seven seven ratings right so this is how you will get the average rating for each and every uh like movie we have in the data set right now we are going to see the short we are going to short our data set right so we are going to see how many movies has actually five how many movies has four how many movies has three we want to just short our data set in a descending order right it will start from five four three two one okay so to get the data set in the like as sorting the data in a descending order yeah in order to what you need to do you just need to write ascending equals to false right this is how you will get the descending order data right so let's see yeah so already it's stored in sorted ratings wise movie okay so let's see what we will get yeah we get like curls on returns is five and the beast of hollow mountain is 0.5 right so you can see we already got the data which is actually in a like the first five we got the the five ratings data now we got the less right this is how the descending order works okay now we are going to see the total number of rating for a particular movie for a particular movie how many numbers of rating we have maybe some of the movie is actually getting a high ratings but the uh like rating of the like num like rating of the customer is not high right suppose a movie is rated by a customers like team customer they rated four right and one movie is rated by two customers five so obviously the movie rating for that two uh for that two movie will be high right so for that we need to check what is the count of the users we have for each and every movies okay for that we are going to use the same technique dot group by okay uh we are going to use the dot group pi and then we are going to group by by title and the rating okay and we are going to use dot count function for that okay so you can see for xss 2002 we got 24 for state of union we got 5 for uh existence we got 22 right so i just gave the last five rows of my data set but uh for all the for all the movies it's actually giving me the rating how many ratings it has particularly right now we are going to make a new data frame right where we will have the average rating we will have the title of the movie and we will have the count of the total rating it will actually help us to analyze our data more easily so in that case we are going to make new record uh i got a question from avinash can we use print a in group by syntax i didn't get you can you just more uh explain it like can you give more explanation on that i didn't get your answer okay okay fine so i got a question from liquid that shouldn't we use mode here instead of mean so maybe it's up it's something like up to you if you want to use mode function just to understand your data yeah you can go for it so i use mean for that okay okay so now we are going to see how can you make a new data from to making a new data frame we just need to call period or data frame okay period or data frame that how you can call a new data frame and the new column of my uh the new data frame has a column called average rating right so i want a column average rating and i'm going to provide the a it has the all the average rating for each of the movies right so let let me do that okay and for the new record data frame i'm going to again add the count of total ratings right so in this data frame i will have three columns right one is title one is average rating one is count of total ratings right so let's see our new data frame yes we got title we have got average rating and we have count of total ratings right so for seventy one 71 2104 i got ratings of 4.0 for around midnight we got rating for 3.5 right and count of total ratings 2 this is one right we have in our data set now what do we need to do we just want to see and we just want to visualize our data for that we are going to use matplotlib library so we are going to use matplotlib library and the c bond library both okay fine so in that case first let's see what are the columns we have so you can see i got a column for average rating and the count of total reading but where this title goes the title is basically the part of index your index is titles right now i want to see what is the graph of my count of total ratings so let me just okay so from the output you can see the most of the movies has received less than 50 ratings right most of the ratings like less than 50 while the number of movies having more than 100 rating is very low i mean more than hundreds we even can see them but more than 50 is also very less right so this is how you can judge your data so zero one two are more you can see these are the more right almost six thousand data are almost zero one under zero one and fifty and you can see 30 40 maybe it can be under 30 40 these are also more but more than 50 is very less so maybe for that you can use more like a big data set i'm using i'm not using a big data set so if you want you can use a big data set for that to just to get a proper uh like efficient model right okay so now i'm going to see the average rating i want to see a histogram of the average rating to get the histogram what we need to do we need to use dot hist function okay the dos is function actually written as the histogram for that so i use the average rating uh column and i am going to use dot his function for that let's see what we will get so you can see this graph for what we actually we have the integer value have a taller verse than the floating values since the most of the user assigns rating as an integer value that is one two three four or five and furthermore it is evident that the data has a weak normal distribution so using this uh data visualization part you can understand your data more efficiently you can say your data is weak in the normal distributions right with the mean of around 3.5 and there are few outlayers we have in our data so if you want to remove those outliers in your data you can do that because when you actually remove your outlets from the data your modern will start performing very well right so it's actually data visualization is one of the most important part in your like uh when you are going to model up like build a model right so yes from the graph itself it will start saying so many things to you okay okay so here so in you can see like i'm going to plot one more thing that is using joint plot that is the part of c bond library so for the x i'm going to use average rating and for the y am going to use count ratings right so let's see yeah so we got this graph so this graph actually shows that the general movies with higher average ratings actually have more number of ratings okay compared with the movies that have lowered average ratings so from that you can see this is actually saying the same thing the movies with the high average ratings actually have more number number of ratings and the compared with the movies that have lower average rating like suppose a movie is actually getting one right so in that case you can see the number of ratings is less for that movie so you can see in this count of total ratings and the average ratings right okay fine so now we are going to use actually how can you make your recommendation system so in like before this part we are actually understanding our data what actually our data wants to tell us what actually the spread of our data set right so now i am going to show you we are going to use dot pivot table right in that pivot table we are going to use user id title and the rating okay and what we are going to do we are going to check the all the columns and the head as well so first let me just print that you will guys get that what we have in movie matrix so you can see in the columns we get all the movie name right we make a pivot table where the column is actually with the movie names right so you can see i am going to again execute that yes so we have the columns right we have the columns now what we need to do we actually need to take that x x x 2 0 2 right so we are going to use uh the movie name x x66202 where actually you will get all the movie all the ratings and the average ratings for xs6 2002 right so we are going to check what are the similar movies we have like xxx 2002 right okay so yes for that i am going to take all the ratings from movie matrix to xxx 2002 ratings variable okay just now let me check what we get in that yes we get that 3.5 2.0 the ratings we have for each users they have given right now we are going to see the find the similarity what are the similarity we have for other other movies like xxs2002 so checking the similarity we use dot core with function it actually give us the correlation with the same movies we are actually passing to it so in that case we are using xxx 2002 right so it's going to give me that what is the similar movies we have in this data in this like movie list right okay so let me just execute that [Music] yes you can see we got that hellboy and these are the these are the actually correlated movie we got that is you can see state of union and yeah and you can see xs62002 is actually getting the most correlation like correlation is one suppose correlation with one with the similar one is always one right so you can see that x is x two double zero two we get the correlation is one so this is how you can see actually it is working perfectly when you are going for uh checking the correlation of the particular movie with the other one right so here we are going to use the heat map just to check the correlation in a visualized way right so here we are using matplotlib for that you can see all this are one one one suppose rating k sat rating is always one right and user id with user id is always one this is how we actually calculate all the correlations okay right now we will see what is the correlation we got in a data frame okay so here we are going to pass our movie similar to xss 2002 and we are going to use a correlation column for that and we want to drop all the nand value from the data data frame right to dropping all the nand value from the data frame we use drop in a function right so you can see x axis is 2 0 2 correlation dot drop any in place equals to 2 so let's see yes you can see 500 days of summer has a correlation with the x axis 2 0 2 that is 0.83 right and 10 items on list 2016 has a correlation of one so this can be the next recommended system recommended movie for the particular user who have actually worked uh watched xxx 2002 right and maybe the next one can be five hundred days of summer okay so we are actually got the hey i actually print the head so if i'm not going to print the head if i'm going to print the whole part so let's see yeah you can see some of the have the minus as well so there is the meaning of minuses basically there is no correlation between that to language okay let's get started with the quick did you know about python right now python is a fantastic programming language it should really not come as any doubt to you that it is one of the most uh you know one of the most sought after programming language today right it is in fact considered as the world's number one programming language today now there are many reasons for that uh you know you can talk about any domain b data science be it artificial intelligence be it uh you know in terms of data analysis beat web development beat game development uh user experience development so you can talk about these wide array of domains where python might not have a direct footing right yet python is the programming language that again powers through all those domains as well that's one of the reasons why uh you know it's considered as a very very popular programming language today that and of course you know millions and millions of people use python today perfect now uh you know since you guys are watching this on youtube uh you know if you're watching this on your browser you might have watched certain tv shows on netflix platforms like amazon prime z5 hotstar and all of these things right so if you've ever used that did you know that all of these actually work with python in the back end even though directly you are not seeing python code that's because of user experience but in the back end right if you're using for example a firefox browser mozilla firefox it uses somewhere around 230 000 lines of code that are absolutely returned in python and that is what gives the functionality and that does what makes firefox work right similarly youtube uses a little bit of python on netflix uses a ton of python in fact prime hot star all these guys you know wherever they're trying to use recommendation systems wherever they're trying to uh you know push a product across they definitely are uh you know using python because it it offers a ton of benefits uh for a lot less time that you have to invest in it right okay now uh you know one fascinating thing about python that i always always get asked is you know what uh why did the inventor of python name it after the snake well we know the snake right python is a type of a snake it's a venomous snake uh so basically the founder or the inventor of python gyudo van rossum uh was a person who was uh you know a very uh a very big fan of a famous tv show called as monty's python circus monty's flying python circus or on the likes of that right so basically as this it is basically a comedy sitcom type tv show is what i've heard of course i haven't watched the tv show this was way back um you know at least 50 years ago so when you take a look at it uh budo and rossum actually wanted to pay tribute uh back uh to uh you know the tv show that he loved so much and that's the reason why he named his programming language uh after the tv show monty's uh flying python circus right monty's python circus is the reason why you see python as the name of this programming language how fantastic is that right now quickly let us discuss the third point uh scene on your screen in fact did you know that uh you know for the domain of natural language processing or nlp as it's called uh python is extremely popular there right now when we talk about nlp i'm sure that if you have clicked on this video and if you're willing to build your own chat bot you might know a little bit about nlp right natural language processing is basically how the machines uh understand everything that goes on uh you know when we converge or when we talk in the natural language right be it english language or any other language now if you take a look at it chatbots entirely depend on natural language processing right it's not like i'm telling out zeros and ones in terms of instructions and the machine is sort of figuring that out no i'm whenever i say hey google uh you know what's the weather like tomorrow i am basically using english language i am using a natural language language which is basically natural to me of course right now the computer has to understand all of that and it has to give me valid answers after understanding the language the question and everything that the question asks for as well so this is another reason why python is considered because again it has multiple libraries here that will give you uninterrupted access to uh some of the most amazing functions that you can use to build your chatbots in a really easy manner right so i hope this quick did you know you know helps you get aligned with uh what we are about to learn in today's session right perfect now let us quickly get started by taking an introduction to understand what chatbots are right now what are chat bots i just actually gave you an example in the previous slide when we were discussing they did you know right so whenever we think about chat bots in fact in the previous slide i already gave you the name of a chatbot right now think about siri think about alexa uh think about google now think about uh you know there's a lot of chat bots out there governing today's world and in fact they uh you might not even notice that you're talking to a chatbot uh in real life right now before we come to that let me discuss what chatbots are now chatbots are basically uh we're trying to simulate human beings on the other side of the of our display where we're trying to get our machine to understand process and give us vital answers and perform specific tasks by understanding human languages now this is the most fascinating concept uh that uh in fact i think this was long back we're gonna discuss about the history of chat bots but you have to understand that this has been a concept uh you know for for a hundred years and right now you can see uh the eccentric to which chatbots are having an impact around us right it's very simple the first chat bot uh you know came came out uh it was created by this person called as joseph weisenbaum in the year 1966 this chatbot was named elisa in fact elisa is still active there are certain websites where if you just type in eliza uh in google and uh you know just type lsa chatbot there are websites where you can actually practically uh you know talk to eliza and uh you know sort of see how it was way back in the late 60s right that's that's a very fun thing that you guys can do now an interesting thing about chatbots that you have to understand is that uh it is not just a simple transaction where i say hi to a chatbot the chatbot says hello back and we are done right it has it has being so sophisticated in today's world that chatbots nerf can sort of so they have the capabilities whenever you're talking to a chatbot right you will not know that that's a chat bot that is how good uh things have got now they have become so human uh in a way in their conversation that uh you know we have the capability uh to judge either as a chatbot or if it's a human and sometimes even the best of the best are fooled into thinking that a chatbot is actually a human being on the other side how fantastic is that right now uh we are making our computers intelligent enough so that they are discernible uh whenever you are compare it to a human being and that again uh as a pathway to achieving artificial intelligence and of course opening up 100 doors for the future is definitely something the world of chatbots is contributing into right but then uh okay so we know what chatbots are we know a little bit about them we have used them in terms of siri alexa google uh now and all of that but then what are we talking about when we uh you know talk about history first of all we saw eliza that came out in 1966 right but then even before the first chatbot ever was created there was this fantastic mathematician called as alan turing he was uh in fact he was yeah he had a very strong uh you know impact on how uh the course of the second world war in fact happened right so again that's another story of its own and uh in fact there's a movie based on alan turing as well uh i think it's called the imitation game or something like that you guys definitely should watch that so basically what alan turing uh did you know he asked a question saying can machines think like humans if if they can think like humans do we get to know that you know that either we're conversing with a human or a machine right now that led to a test called as the turing test a turing test is basically uh you know a test where a human is on one side and there's a computer on the other but the person doesn't know there's a computer so as and when he asks certain questions and when the human asks questions if the computer is able to provide answers to a degree where the person doesn't know that it's a machine doing it but rather he thinks he or she thinks that it's a human doing that is when that machine passes the turing test as in uh that machine was intelligent enough to fool the uh the human being sitting there into thinking that uh you know tricking the person basically uh knowing that hey this person actually was talking to a machine and not another human being right so that is when uh people were thinking about how best can you make your machines think in terms of uh you know align its thinking in terms of human being so at the end of the day whenever you're talking to a chatbot or whenever you are talking to an entity that has passed the turing test of course it's very simple to pass the during test now for computers and machine learning algorithms but back in the day when you had something that passes the test that is when things really got exciting because uh that was when people started realizing that you know you can bring together a whole world of technology that never existed right now for example i'll tell you this amazing application of this is done by the people at google right now they have a chat bot which has the capability to call a restaurant and book a table uh maybe you maybe you want to go out to dinner with your better half with your wife family whatever it is right you can say hey google book my reservation for that hotel at that particular time on that day and boom it's it's not like it'll remind you to do it yourself google will actually call up the restaurant talk to the manager uh make a booking ensure that the booking is done and the fact it'll even remind you later saying hey i booked this for you you have this how fantastic is that right they actually did a demo the people at google did a demo on this and it orders food and it uh books restaurant tables and you never know that you're talking to uh you know what could be basically uh a simple piece of code on the other side right that is how fascinating the world of chatbots has been and it's and it's just that we are at the stepping stone where we are still looking towards uh how best can we apply them and use them right now this is a a domain which again now in the world of chatbots is basically evolving in a very rapid manner as you can see there was a time when you could just say hey siri tell me what the time is hey siri set my calendar and all of that right now you can do whatever you want you can ask siri to guess whatever song is playing in the background you can take an image you can ask you can click an image and send it to siri and siri google alexa you can ask it to uh uh you know sort of take a look at the image and think uh tell you about what it thinks of the image and like guys the amount we can just go on and talk about these applications and how best the domain is improving but at this moment of time i think it's clear for all of you all that it is super important to analyze that the growth that is being shown in the chatbot domain and of course the number of applications itself for the future that should give you a positive trend and that should give you a fantastic trend uh in case if you guys are very serious about in fact you know maybe building a career around chatbots uh in python right all right perfect now let's quickly take a look at the history of chatbots and let's try everything uh try to line everything up down from the 60s to the latest standards right right as we saw uh eliza came out in the year 1966 then we had another chatbot called us parry in the year 1972 we had alice in 95 and we had a smarter child in 2001 now this is a good chance you might have not heard of the first four but there's a very good chance you might have heard of the next three right siri uh apple's uh chatbot came out in the year 2010 google nerf came out in 2012 and alexa came out in 2015 now if you take a look at it uh i've been using siri uh since the day one ever since cv came out uh from 2010 till today uh you know it's been what it's been 11 years and it's an absolutely fantastic ride of how they developed this particular platform and of course the person who voices city is also very famous her name is susan bennett uh you know the u.s female version of now they have multiple people voicing these chat bots it's not like you get a robotic artificial voice when you talk to the chat bots now it's literally a human voice on the other side right uh that is again a fantastic process of how they get it done uh now it's not like a human will say all the words that are present in the dictionary and then a chat bot is built they pick up certain syllables certain words and that's how they get uh the human being to record the voice for the chat bots it's again a very interesting process we can discuss uh you know some other time enough that we've understood uh this quick introduction now that we know a lot about python now that we know about the history of chatbots it becomes super important that we talk about the various types of chat bots that we have today right now uh see uh the chat box is basically uh a concept where you can sort of uh divide it into multiple different things uh you know you can have five six different types of divisions and uh it can get confusing and a little complex to keep it really really simple for uh you know to keep the entire scope of audience in mind uh we've uh sort of given you two very very important types of chat bots that every other chatbot is based on right so we have text based chatbots and voice based chat bots now the chat bots that you might have used in terms of siri alexa google they are both text and voice based chat bots right so you can either talk to them or you can even text uh with your keyboard open right so maybe you've seen this they're all text based chat bots as of how they work whenever you try to say something the message is actually printed again on the text and the text is what is understood by the chat bot right but then of course uh there are many uh chat bots where it will only recognize whatever you say as in it knows how to recognize voice it knows how to understand who is talking what is being spoken what is asked of it and what's needed to uh what's to be provided right so this is text based and voice based chat bots now if you are aligned with uh you know maybe uh in the uh shopping in the online world a lot you will come across these chat bots a lot uh be it either you're ordering food on sweetgear tomato or you're talking to uh you're chatting with an amazon customer support uh portal uh thing right so there again at this start 99 of the time you will always be talking to a chat but in fact some of these applications will even tell you that you're talking to a bot right now and it will help you with everything that might solve your problem right now there are things which might not solve your problem uh there there if the chatbot runs out of options it will not say sorry i cannot help you it will tell you hey can i connect you to a human representative so that you can have their help for this particular query right so it's a very well-rounded world where there is no or jagged edges no matter what sort of questions you ask a chatbot a very well-rounded text or even a voice based chatbot has the capability to give you well-rounded answers on that the text-based chatbot will give you a textual output while the voice based chat but of course as the name suggests will uh you know sort of read out the answer to you or read out whatever it is uh that you asked of it right so these are the important types of chat bots but then when you're thinking about designing a chat bot this is where you have to again uh you know sort of split it into two more types of chat bots uh what we call as the rule based chat bots and of course we have this self learning chat bot now if you take a look at root based chat bots as the name itself suggests rule based right so it has certain rules that it has to follow to answer all the questions that is asked by the user so every answer that it will sort of give you to your questions is already available in a in a group of details or let's say it's basically called as rules so it it uses those rules and these rules are defined by the user himself or herself that way that you know this chatbot can use all the data that's present there and sort of give you simple to even pretty complex answers right so if you have a ton of rules and if you just provide it a lot of information well at the same time your chatbot becomes super intelligent on its own right correct next you have something called as a self learning chatbot now self learning chatbot is an even better game to play because uh self learning chatbots are mostly based on machine learning models right so they're basically uh done created using uh machine learning algorithms so this is a bot which will have the capability to communicate to learn new things to understand new things by using the power of machine learning it will use it to learn new things assess what's happening what's required from it and give you the answer to the question that you are asking right so it's it's a fantastic thing when you take a look at the other side of it now when you talk to a non-technical person to talk about the types of chatbots you probably talk about textual based and voice based chat bots but now uh since you are on the verge of creating your own chat bot you definitely need to understand to understand two more divisions out here which is basically the rule based chatbot and of course the self-learning chat but both of them are super popular both of them have their own advantages and disadvantages and they have their own niche applications right now as soon as you look at it you might be like hey our self-learning chat watch looks more uh you know fascinating and looks more sophisticated just because it uses machine learning so is it more powerful in some cases yes it is more powerful the self-learning chatbot is more powerful but that does not mean that for all the cases it will work fine right so for some places like you you will need to understand that there is uh the uh you know you need to understand first of all that rule-based chatbots also are fantastic they're super efficient and if you provide a good amount of data there is no beating the efficiency of rule-based chatbots right so again these are two other types of chat bots that i wanted to discuss with you now that we are involved uh in the process of building one right perfect now that we understood uh you know we had an introduction we looked at the history we also understood you know how these chat bots work the types of chat bots and everything now we've come to a point where we can go ahead and discuss some of the top applications of chat bots right now when you are usually taking a look at top applications of chatbots well it in fact when i was researching for these video a couple of weeks ago you definitely have to think about an important thing every application i looked at when i was uh you know taking a look at chatbot was every application was fascinating it really is very difficult to say hey these are the top applications because uh i i have found use for like you know everything that you see on your screen right now in fact uh if you're a person who has used the concept of smart homes smart lighting and all of that you will definitely understand that hey you can literally do whatever it is that you want by bringing together the world of chatbots and uh you know internet of things iot right and most of these uh most of the chatbot application is basically that they have a smart iot device which also works with uh you know a couple of your chat bots that is what makes the entire thing uh you know to be smart right now anyway coming back to the top applications of chat bots first of all yes there's hundreds if not thousands of applications that we have today but again these are some of the things uh the applications that we're gonna discuss are some of the things that you might have already used right now take a look at the first one it says help discusses that as i mentioned maybe your food was running late on suiji or zomato uh maybe your order was running late on amazon or something like that you usually talk to a representative right you usually open a chat portal there it lasts all the details for it it usually asks the same uh questions you know in the form of a template whenever you have an issue you have to pick from multiple options and it will try to see if the chat bot itself can solve your problem but then yes there is only so much that the chatbot has in terms of functionality so if it cannot solve your problem uh and the help desk assistant it will basically uh you know raise the ticket so that a human uh representative an expert on the other side can connect with you and help you out with your problem right so that's one type of chat bots then we have the email distributor now email distributors are absolutely fantastic if you're a person who likes to keep your emails very structured and make sure that they're labeled they're ordered in a nice way uh well there again we have multiple chatbots working now these are the newest applications but we do have some of the classic applications where you know think about spam filtering right spam filtering is again something which is very much required because you require a chatbot to read through the mail that has been sent to your inbox and then make an assessment say hey this is an important mail this is a junk mail spam mail this has a virus what not right so that part of the thing with respect to chatbots is also taken care of and then take a look at home assistance home assistants are absolutely fantastic right uh so i have this vacuum cleaner called as roomba roomba basically is like a robot vacuum cleaner which just goes around the house and cleans your house and eventually it will know where all the walls are where the unclean spots are it'll go clean it eventually it'll just dock by all on its own you might have seen uh videos on instagram where there are cats or dogs which are riding on this tiny robot which is the vacuum cleaner right so again those are uh very intelligent uh you know appliances or let's say home assistant entities where you can actually have a lot of programming in them add a lot of functionality and use them uh to their fullest right now you do have uh something called as home automation which is extremely popular these days so uh let's say if you have a theater at your home all you have to do is you have to open the door and say hey google or hey siri uh uh you know turn on the home theater right so it's gonna turn on the air conditioner turn on your projector turn on your speakers turn on your computer it's going to turn on so many things and it's going to ensure at the end of the day that while you walk in to your theater right the temperature is also taken care of the lighting is taken care of there's music playing in the background so this gives you an entire uh well-rounded feeling is what i want to say right the home assistant has various types of home assistants again and then we have something called as an operations assistant operations assistant again can somewhat relate to the same example as we spoke about for zumato or in fact you can have a chatbot where you can ask technical questions to the chatbot and the chatbot can give you technical answers on how to solve something as well right that can be operations assistant and then your phone assistant for assistant of course you know every every other assistant that we have been discussing right now either uses uh an assistant which is built for your phone or it trickles off of that technology right so basically yes siri google alexa all these things are phone assistants again i don't think i have to explain a lot about them right perfect next we have the entertainment assistant now entertainment assistants are again super important in today's world when you think about it uh you can actually uh especially i've heard that this is extremely popular in the world of football and basketball where you can actually have conversations with respect to chat bots uh you can ask questions about what it thinks about the match its predictions uh who score a goal what will happen uh you know the end of the match and you can do a lot of these things where you can uh not just talk to your chatbot not just talk to your assistant but get your assistant to predict the future for you right now think about sport think about predicting what will happen at the end of it even before it has begun and i just gave you one example right of course you have ton of other assistance uh you know in literally every other form of entertainment sector uh today but at the end of the day when we're discussing about some of these applications i thought that it will always add value to highlight that one super popular application that you might have uh you know already known about now we've discussed a lot of these things uh about chat bots itself now we're gonna get right into the heart of the concept now uh you know we will have to take a look at the architecture of chatbots as in what makes a chatbot what it is right now that's a very important concept that we have to take a look at right guys now whenever we're taking uh taking a look into understanding the architecture of chat balls it must be very important that there are certain concepts there are certain uh you know uh elements that makes that build a chat board that is considered as common regardless of the type of chatbot regardless the type of application that it is trying to solve as well right now when we discuss the architecture of chatbots i'm going to bring in the general image i'm going to bring in a bird's eye view of this particular concept right now if you dive right into the details of how they work uh siri works extremely differently when you compare it to how google now works google works very differently of how alexa works and all that so instead of just confusing you with all these convoluted concepts let's take a look at the entirety of governance of how a chatbot works right now there's multiple different entities uh that uh the architecture should consist of first of all let's take it from the front end to the back end right you have the chat window or the session this is where you are communicating to the chat bot right maybe you open up your phone and you just swipe up and google now opens that is your chat window that is your session that that just got created right or you're talking to an amazon uh you know chat related service or you're talking to zumato you're talking to swiggy and whatnot so you have a user interface to basically talk to the chatbot right that is the chat window and as soon as a chatbot is active it creates something called as an active session uh that it uses right for example if you're talking to a chatbot there are certain chatbots where if you would not talk to it for a while it will ask you if you're there it'll try to ping you will try to see if you are still around right the cars tesla's cars do this when it's on autopilot the driver always has to have his hands on the steering wheel if not it's going to start giving you alarms it's going to start asking questions saying hey you know what are you awake are you like sleeping or something like that right i just give you i just gave you a funny example but yes uh so having this kind of a interface where you're directly uh uh you know working with the chat bot that's that's what the chat window or the session means and there has to be an interface between this is not the interface that we use it's not the user interface this is a bridge between the nlp model and the chat window now let's talk about the nlp model so we can talk about why we require this interface first of all right now when you're talking about the nlp model nlp stands for natural language processing this is a model which is again built either using a neural network or machine learning algorithms in a way where it has the capability to first of all have certain points of data to use to basically give out answers the second thing is it will also have a you know the programming necessary to ensure that it can basically understand the question for which it has to pick the data and give you an answer right that is all that uh fancy stuff all the heart of it all the brain of it or you know you can in fact talk about a lot of different things but you have to understand that without this particular entity nlp model this entire thing about chat bots will not be as sophisticated as we have today yes you can still have chat bots without having a very fancy model uh in fact we can actually demonstrate that you do not require a super super complex uh uh chatbot for the you know whenever you're programming from one from scratch so this is still required at the end of the day you can get away with it but uh you know again since we're discussing about everything it becomes vital that we have it here right ladies and gentlemen okay now the next thing that uh the nlp model whenever you're expecting a model to train what is it that it's required you require data now in the world of chat bots we usually call data as corpus as dictionary or as uh you know there's a lot of names that we actually go on to provide to this but whenever you whenever you hear those terms corpus dictionary data understand that there's a good chance it always uh means like it's basically a repository of information right uh it is uh the data that the chatbot uses to give you answers upon that's corpus we're gonna discuss corpus in detail in the next slide right and then you see application db application db talks about the database which hosts first of all your application second of all it hosts details about how you will have to bridge a gap between the application to send it to the nlp model and the nlp model eventually now i'm talking about the interface uh the nlp model eventually uses the interface to just push out the uh push out those answers either in a textual form in a voice form or whatever it is right that is where the interface sort of comes into the picture and then it just pushes the results out to the chat window this is basically the entire working of our chatbot architecture guys of course in the nlp model itself if you want to take a detailed look into it you have a lot of different things of how the nlp model works now i know that at this point of time you might be like okay so we do not know how the nlp model works well uh can you explain right well don't worry that's exactly what we are gonna do uh in the next section where we're gonna try to understand this process of what happens from corpus all the way to building your own having a functioning chatbot that works as required right perfect so we're going to take a look at just that before we go on to do that we have to check out how chatbot works theoretically right now take a look at the screen right now i have provided you with seven very very important steps right now this is a step by step discussion that we're gonna have for this section where right from the first step which is importing your corpus all the way to the last step which is one hot encode egg these are certain standard steps are standard methodologies of how we can use natural language processing and of course not just natural language processing use python use the foundational concepts from the architecture of a chatbot to see how things work and eventually to build one ourself right okay so what is step number one that you see on your screen right now it says importing the corpus let's talk about importing the corpus for a second right now corpus as i just mentioned is the training data that is required for your chatbot to learn without a corpus uh it's also called as a knowledge base by the way without the knowledge base without the corpus without the data without the dictionary all these are the same names but without any of these it becomes impossible for your chatbot to learn anything and to give you something useful in return right now if your chat bot does not know what an apple means and if you continuously keep asking it questions about apple do you think it can give you an answer no because it does not know uh uh you know i mean we are intelligent enough to research right now as soon as we want something we know that hey just head to google and type it in learn something on your own right computers sort of cannot do that they can do the process of it but when it comes to learning uh they do it very differently compared to what we as humans do right so again that is an important thing you have to consider we require a corpus we require the data compulsorily without which a chatbot will not work why will it not work because it will not have any answers to the questions that we get asked right now if there is an examination and i've never attended college in my life let's say and i just go there with a textbook in my hand and of course i've never opened that textbook do you think i can clear the examination without any knowledge about the domain some of you might say maybe uh you know you can do it probably right but then the majority of you can say no you know it's not possible because you have not learned anything to in fact go on to showcase what you've learned there again is the exact analogy of how the corpus works now once you understand corpus it means that at this moment of time either we have textual data voice data we have in a database our data that's needed by the chatbot but then as always data is a raw entity data is a messy entity and it takes a ton of efforts to pre-process your data and to keep it ready for the further stages if you do not do it and tell at this moment of time that uh you know your nlp model your machine learning algorithm whatever type of uh you know intelligence you are trying to impart into your chatbot that will not work if your data is not clean if your data is not accurate right because if you just provide inaccurate data your chatbot will not understand that it will not be able to discern it it'll just give you inaccurate answers back and uh when you might be thrown off by that right so that's an important thing uh pre-processing the data can be done in many ways in fact one very important way that we're going to be doing uh is text case handling right now the data that you might have textual data it can have a lot of sentences it can have a lot of paragraphs so if you just copy all the text and paste it somewhere you'll realize that there's capitalization everywhere there is lowercase text uh lowercase uh uh you know characters in most of the spaces but there are certain uppercase uh handling that that's uh needed as well right so to make sure that we uh avoid any sort of misrepresentation or in fact even misinterpretation of these words we just take all the data that we have in the corpus basically these hundreds and hundreds of lines of sentences and we either convert everything into lowercase or we convert everything into uppercase and yes python will know which is being done either uppercase or lowercase that later if everything is made into lowercase when the chatbot replies it will know to capitalize the first word for us right it's a super intelligent thing uh we might think that hey is this all required but then when you take a look at it when you yourself talk to a chatbot like zomatos wiggy amazon whatever it is you will understand that it is as close to a human as it can get so text case handling without which to process uh a chatbot becomes extremely difficult so this step is very much vital right now after we convert all the text into either lowercase or uppercase or anything like that this is just a uh sample on your screen with the process what we call as tokenization tokenization is when we have sentences right now i just said we have hundreds and hundreds of sentences in this particular document called corpus now we have to break those sentences down into each individual word because without that you cannot just have your chatbot give you an entire sentence as it is as an answer all the time right especially if you are working on a chat bot and you have user experience in mind you have to make sure that it will give you an answer that uh sort of answers this rather a specific question rather than you know just answering a random question out there right so look at this on your screen right now tokenization as i just mentioned yes it is a process of how you can take a sentence break those sentences into individual uh uh you know individual characters you can see that it's written this is a blog uh from this particular uh statement called this is a blog uh it has been split into four separate entities this is a blog right so this is the structured process that we use to convert sentences to individual collections of words because if you have individual collections of words they can be jumbled upon to make a sentence of its own which again works pretty well in the english english language as we have seen with python and then you can have your chatbot give you fantastic answers based upon that right that is tokenization once we finish with tokenization we come to another concept called as stemming stemming is fantastic if you guys are fans of etymology there's a good chance that you might have heard of this concept right take a look at uh the chart on your screen right now uh it just said original word root word and similar word in the original word we have a couple of words here jump jumped uh jumps jumping right we have four words but all these four words have one common root word what are we talking about jumping right so jump is the common root word so using this root word you can actually form all these four words yes or no yes right so with stemming all we're trying to do is instead of just thinking about millions and millions of words we just try to find the root word now if you're a fan of etymology there's a very good chance you know that this is an extremely powerful concept uh all in itself right so basically knowing the word jump and knowing where all it can be converted into and used will help you a lot now if you are a person who has ever prepared for competitive examinations maybe like gre or gmat literature section there's a very good chance you might be using any of these techniques uh to make sure you have a high high score there as well understanding the root word so that every other word maybe 10 or 20 other words that come below that root word you're already done with just because you know one word right so stemming uh is this process so where we try to find similarities between the words we find the same uh root words and we ensure that you know we can have this split up and also uh teach the chatbot saying hey this is how it is so you know you're basically making a chatbot not just more efficient at the same time but also more intelligent as well that's an important thing you guys have to know about perfect once we're done with stemming uh we have one more fantastic concept called as b.o.w or it's also called as bag of words now it seems like a very funny concept bag of words but when you when you literally take a look at how it works let me explain it first bag of words is literally like let's just say i'm holding a couple of words right now english dictionary words and i just take it all and i just put it in a bag right that is literally bag of words so with bag of words what we are trying to have in concern is that we do have a lot of words we do have words that can be used to answer a question but at this moment of time after stemming and after breaking it down into uh every single word do you think order will matter no right if i have uh well if you remember back when your kids used to play a game called scrabble or something like that like scramble scrabble something like that so that game again if you take all those individual blocks if you just put it in a bag and give it a nice shake do you think the order of what is happening in the bag is important no we just want the words the words are there in a bag right now they might be random they might be arranged we don't know but uh that's the entire magic of this right now for example uh let's take a look at the same thing that we've broken it into a single words right this is a blog that you see on your screen right now now what happens for a machine learning algorithm or what happens for a deep learning algorithm when it takes over this particular stage is that it does not understand english words as it is so these english words must be broken down into mathematical vectors uh for these particular algorithms to understand that's the reason you see this is a blog it's basically split into four words and you see zeros and ones that are given there right so when uh the when this statement has to be passed on to the machine learning algorithm this is a blog it basically takes this vector that you see here one followed by four three zeros and this entire vector and matrix right so this entire vector gets assigned there and of course uh there's another process of how you stitch them together where you'll be using something called as a dot operation a matrix multiplication dot operation where you'll be bringing two vectors together and sort of working with that as well again the process of it working can get super complex but the main advantage is when we try to implement this in python it still is pretty simple and easy to work with right but yes this is generating bag of words you have to understand you're going to take a couple of words you're going to find the root uh you're going to you're gonna stem it you're gonna separately cut it out into individual words you're gonna put all of them in a bag give them a jumble and right now you're gonna assign numbers uh to make sure that you know if there's a sentence that sentence will have a vector that can be understood by the machine learning algorithm right perfect so this is a bag of words now after bag of words there is one process one last process called as one hot encoding one hot encoding is basically the process where we are taking all these categorical variables and converting into a form where the machine learning algorithm just uses as i just mentioned in this previous process bag of words right so back of words creates a vector taking a vector and passing it uh to your machine learning algorithm for that for your machine learning algorithm to understand that hey this is this particular sentence that is another sentence right so it needs to have a way of how it can discern sentences itself discern words itself even though the order of words is not important all the individual words are still important right so that is why we require this process of one hot encoding now uh if you ever are curious about when you type something into a chat bot right now if you say hello or if the chatbot in the opposite side says hi to you hello to you how are you what's up or something like that one not encoding is the response is is responsible for a lot of these responses that we actually get uh you know because at the end of the day you your chatbot must be intelligent enough to go through the entire data pick up something which is meaningful and to find it as a reply to your uh to find it as a logical reply to your question uh all of that works in the back end basically with this process called as one hot encoding in itself now we've had a chance to take a look at the theoretical aspects of uh you know we took a look at a lot of things to understand a ton about chatbots right so now is the right time we'll actually go into uh you know checking all these out practically now guys the goal of this particular demo is to make sure that you guys can understand that you can also create a chat bot in python you can build a chat bot from scratch without having to put a ton of effort now at this moment of time you might think okay if i need a decent enough chat bot will it require hundreds of thousands of lines of code if you're thinking on the times on the lines of siri alexa and all of those things maybe yes but right now we do not have to write hundreds of thousands of lines of code to build a simple chat bot right that is exactly what i am going to demonstrate uh with this particular demo so let's basically open up google collab google collab is nothing but a simple python jupyter notebook hosted on the google cloud platform right so uh let me just open up chrome right to access google collab all you guys have to do is type collab on google guys okay as soon as you type google collab it's basically called as collaboratory it's a fantastic thing i am a huge fan of uh you know google collab and i use it all the time regardless of if i'm at home or even before covered when i was traveling right so it basically is a lifesaver all right guys so basically i've just loaded up the google collab platform out here i've already written down uh you know all the pieces of code that's required that's just been put on your uh that's ready so that you know we don't have to sit and type it out as we go together but then let me explain how all of these work right so let me zoom in a little so that you guys can see all of this this screen better now whenever we're working with a chat bot or anything as such the first most important step is to ensure that we can bring in all the libraries that we're going to be using for that particular use case right now in this case we're going to require four important libraries the first library we're going to require is numpy numpy is basically used for numerical computations and python an extremely popular library for data science and then we have nltk nltk is a superb library for natural language processing in python it's it's one of the most amazing libraries that you get to work with if you are if you're very serious about working with chat bots and of course we have two other libraries as well string and random string is to ensure that it can process and handle strings in python and random basically uh you're going to see why we require random as we go ahead as well right so as soon as i click that play button as you saw uh now i have imported numpy and ltk string and random with just a click of a single button i can do this on my phone i can do this on my ipad i can do it wherever i have an internet connection and i can open a browser how fantastic uh is this right now what's the next thing that we're gonna have to do now the next thing is again another four or five lines of code it's very simple x let me explain uh the first two lines of code basically f equal to open uh roddock equal f.read and dot lower uh with these all we're trying to do is we're trying to bring in our corpus right our corpus is basically in the computer i'm going to show you how we created the corpus also so we're going to take a look at the corpus uh we're going to make sure that our program has access to it when you bring it make sure our algorith our basically our working environment has access to it it can read it fine that's first two lines right the next couple of lines uh if you see a line three four five uh what we're trying to do here is first of all you remember preprocessing what i told you you either convert everything to lowercase or uppercase here that you can see uh with this particular dot lower function i am taking the entire document just bringing it down into its lower case once it's been put into its lower case then we will be using this concept of the tokenizer that we discussed right now in this case you might see that we're using one particular tokenizer from the nltk library called as the punk tokenizer now punk tokenizer is basically a pre-trained uh tokenizer which has the complete capability to sort of you know build a model around what's required and to give you an answer right so tokenizer as uh the process itself uh you know you might be wondering why we're only using the punk now there are many other tokenizers in the picture that we have we have the tweet tokenizer we have something called as a reg exp or the regular expression tokenizer that you can go on to use but in this particular case we are trying to assess or we're not trying to import the corpus may be from twitter if you were doing that if you were maybe analyzing hundreds of thousands of tweets you need to just copy all of them and bring it up uh bring it together right that is when you'd be using something called as a tweet tokenizer uh we're using the pump because it's already pre-trained it is very easy and fantastic to work with it right okay perfect now the next couple of uh lines that you see is basically how we are converting all the documents that we have right now into a couple of sentences and all of these sentences into lists of words later right so we're gonna have two documents created uh basically it's called the send tokens and word tokens send tokens means they are sentence tokens and then we have word tokens where each individual element is a word in sentence token each element is basically a sentence right as simple as that all we're trying to do is put it into sentences splitting into words and bringing it into our working environment as soon as i click run on this this will not run it will give me an error why because it does not have access to that particular corpus file so it's in my desktop uh let me quickly open up the corpus file in my desktop now while that is uploading i'll tell you how i created the corpus right so basically i just went to google i typed in uh you know data science wikipedia and wikipedia just gave me this particular page for data science right so all i did was i copied everything from first line here uh till the last line and i put it in a text file and i call that file chatbot dot text that's it right so this is literally everything that's copy pasted uh from our wikipedia now this is a very simple chat bot but i just want to give you a demonstration to show what the corpus itself uh looks like right now let me run this again it should run absolutely fine now see i told you right so it just didn't have access to it and now that it has access to it it can do it now what we'll do is we'll print a simple sentence token to see if the our tokens are working fine right now you can look at this data science is an interdisciplinary field that uses scientific methods processes all these this is one sentence uh right so you can see uh below there is another sentence data science is related to data mining machine learning and big data so with this piece of code what i did was i printed two sentences out from the sentence tokens similarly what i want to do is i want to print out the word tokens now i just want to print two words from the entire uh you know corporate i have simple i just printed the words data and sign so basically what i'm trying to do to print this is to not showcase that you can print it of course you can print it i am just trying to verify if my uh you know data that we have actually created let me just crawl up here uh here when we created this particular document where we are converting it into list of sentences and list of words i want to see if it works or not that's the reason why we have that right now once that is done the next step in the process is to basically pre-process the text as i just mentioned all right now uh lemmetization stemming all of these processes are something that we took a look at right so this entire process of pre-processing basically works in this way either we're going to convert everything to lowercase or uppercase and then the other thing what we're going to try to do is uh we want a normalized vector as a result of out of these we we can have a regular uh corpus data sent into it and we want to organize vectors so we'll have to remove so many different things so we have to remove so many different things right so first of all there might be sentences which are not broken down into words and if your sentences are broken down into words what is happening to the punctuation marks do you guys think about that commas full stops and all of that right so you have to think about that as well so to process uh all of these things will be using the word net wordnet is basically a dictionary that's already inbuilt and included with the nltk library when we work with python right so it makes this entire process of uh uh if it weren't for this particular these couple of functions that we're using or if it weren't for wordnet uh we would have to write a lot of code for this particular segment for pre-processing right now as soon as i click on it my data is pre-processed next thing is the fun part now this is a function where we are trying to build a greeting function now a greeting function is very simple what does it mean when i say greeting when i say hello to my uh you know to my to my chat bot i wanted to give me a reply but then i i i will not just say hello i can say so many different things i can say hi i can say sup sup is basically the millennial short form for what's up right uh i can say what's up i can say hey uh you know i can see a ton of different things there and you can understand that uh these are the inputs that i'm going to be giving it right it doesn't mean that i have to give it all i can give any one of these inputs and it'll understand that hey this person is greeting me and when i input either hi hello greetings up what's up and hey it will give me a response back saying hey hi nods hi there hello i am glad you are talking to me right so it will give you any one of these responses this is why we require the random function at the start you remember let me scroll up uh import random why we require that library is when i myself give it an input it will basically give me a random output look at this word random dot choice right so it'll just try to find my response and as soon as it finds my greeting response it's gonna my greeting input sorry it's gonna give me a greeting response right as soon as i run this now it knows how to greet people right and what have we done uh we've used four five six seven eight maybe 20 25 lines of code and we've come to a point where our chat bot can already say hi and bye to us right now the next thing that we're gonna have to do is get some responses to the questions that we ask or to do that we basically uh will be using two concepts called as the tf idf vectorization and something called as cosine similarity now to quickly explain what this means without going into a lot of detail tf idf basically stands for term frequency and inverse document frequency term frequency as you might guess right now talks about how many times all the words each individual word is repeated in your corpus right that's term frequency the frequency of occurrence of words and then you have the id of the inverse document frequency this is an absolutely fantastic metric because see to find term frequency is very simple just find out all the words list them out see how many times are repeated add them it's a 10 line of 10 lines of code and you can do it simple right but the idf part of it is very interesting because here we are not just trying to uh pick up words and say hey this is a word that's repeated ten times but here i am trying to attach a sentence of attach a component of how rare the word is that is why we require the inverse document frequency how rare is the occurrence of the word in the corpus right now there are so many words which can occur so many times for example if the word is data science or if it is data let me search for this see there is 120 times the word data has been reused but it is not uh very uh rare or something like that right it's used 120 times but if i try another word which is not so popular like the word explorer it has been used but it has only been used once so it is super rare the idf picks this up and it realizes that hey okay these words are rare these words are you know repeated so many times and all of that right cosine functionality does that once we have bag of words once we have everything ready in terms of ones and zeros it takes all of that and gives us a normalized output now if you do not follow the technicalities of it i absolutely do not worry anything so basically what we're trying to do is find the number of times a word is repeated in our document to ensure that you know we can have and map the rarity of it and the third thing is we can have the normalized vectors so that whatever we are doing right now our machine can understand that right that is the simple aspect of it now the next process that we're going to do on the same lines of response generation is to make sure that we provide certain data to it and it gives us an answer we've already provided the corpus so now it has to uh give us uh the particular answer when you're building the entire uh neural network and giving us an answer out there right but then you also have to think about certain situations where you try to type in something your chatbot doesn't understand so you have to also write if tf idf is zero it means that your chatbot does not understand what's written on your screen or whatever you just gave it as an input so there it will basically tell you hey i'm sorry i don't understand whatever you are saying right so basically we you can greet right now and the next thing is it can know what it can understand and what it can not understand as well right so let me zoom out a little for the next one uh and uh basically for the last piece of code right so we are already at the last piece of code and we are not writing hundreds of hundreds of lines of code here right okay perfect now this is where we're gonna be defining the start and end of the conversation the start of the conversation is basically a for loop uh where the bot will continuously keep waiting without any sort of time or it'll or forever wait if i do not talk to it right uh but then there you also you can change maybe if the user has not responded for 30 seconds you can actually close the bot you can do a ton of these things this is the most basic bot just to showcase the power of how you guys can go on to uh you know use python to get to at this point as the goal here right so let me actually in fact run this here all we're trying to do is as soon as uh uh you know you're trying to the user says hi it'll start talking to you back if the user says bye it's gonna quit and eventually it's gonna give us all the responses in between that makes our chat bot intelligent right let me run this piece of code now after the piece of code is run let me just read what it says it says my name is stark let's have a conversation also if you want to exit any time just type buy well let's just see if the exit thing works right i'll just type in buy as soon as i typed by it says goodbye take care and a small heart that that part of it the code is right here right so when i said buy it understand that this person wants to leave let me run it again this time uh you know what i'm going to say i'm going to say hello so when i say hello it nods as in it's shaking its head and it wants to listen right if i say hi uh now it'll give me another reply hey so it basically is randomizing all these inputs whenever i say hello hi greeting what's up and i say any of these things it'll be it's basically giving us an output for this part of it right okay now high worked hello worked and all that work now since we're talking about data science let's try to see if we can find out something about the corpus itself uh so when we want to try to ask what is data science right now the model is not our chatbot is not that intelligent enough where it will understand all the questions directly but if i wanted to print me let's say the foundations of uh the you know data science let's see if it can print something like that using the corpus right see it found so many things when i said the word foundations it says data science is an interdisciplinary field focus on extracting knowledge from data sets which is typically large data sets applying knowledge and deriving actionable insights to solve a wide range of problems or applications look at that i just said i want the foundations of data science and it basically gave me a one-line answer to what it understands uh as an answer to what data science is right how fantastic is that uh maybe i want to ask another thing it says impact at a third point here i want to say impact of data science enter and uh as soon as i have an impact look it gave me one more fantastic sentence of course there are certain warnings which says your stop words are inconsistent because we do not have that big of a corpus to train uh it says as big data continues to have a major impact on the world data science does as well due to the close relationship between the two so it's telling me that hey big data has a huge impact in the world data science also has a huge impact into the world just because these two are tied very close to each other right i asked it saying what's the impact of data science right and it gave me another answer uh now what happens if i just type data science now if i just type data science it says that hey it just picked up one random sentence i don't know this is where the bot might not give you as accurate answers as you might think but then hey this is a very very very simple bot uh it says however data science is different from computer science and information science so basically this is the line it first picked up uh from its uh corpus and it's just giving us that particular answers right now maybe if i want to find out uh techniques in data science right let me just start with techniques and look it found something in this particular domain called as technologies and techniques they said there are varieties of different technologies and techniques that are used for data science which depend on the applications because this is a very good answer right so apart from one or two answers you can see that once i quiz it with five six questions it answers all of these things and ladies and gentlemen okay let me just say bye to the chatbot let's close it first uh let me say bye to it and it says goodbye take care at the heart ah okay cool now let me actually go on to clear all the outputs first of all and uh let me zoom out to show you how much of code that we have written right four lines another couple of six lines 10 11 12 18 24 25 26 36 46 uh all of this is less than 60 lines of code guys 60 or maximum 70 lines of code and all these are very very simple pieces of code right take a look at that uh you know we have not done much i'm just scrolling through it if you can look at the scroller on your right and all of this uh is an entire chatbot on its own of course this chatbot is not as intelligent as google it is not intelligent as siri but at the end of the day this is a chat bot that is going to give you uh you know all the important uh answers that you require for your questions and we never put in a lot of effort uh to go on to uh you know showcase this right so yes at this moment of time the goal of the demo again is to not run through each and every line of syntax to tell you why this is called them tokens why is it not called anything else no my point here is to showcase that you can build a chat bot in python without having to put much effort and i think that was showcased beautifully with this particular demo right perfect so what exactly is python so if you are new to python you can use programming language let me just give you guys a brief intro so python is a free and open source software now what do i mean when i say it's a free and open source software uh it basically means that you can download it and you can immediately start working with it there are no prerequisites at all you don't have to pay for this language and it's also cross platform compatible and that basically means that irrespective of whatever operating system you have whether it is a mac whether it's a linux or whether it is windows you can go ahead and download python onto your system and python is also an object oriented programming language and an object-oriented programming language basically means that you can implement the real-world entities in the programming paradigm so when you look around you you would notice that you are surrounded with objects the laptop which is there in front of you is an object the mobile which is there in your hand that again is an object the wallet or the headphones which are there beside you there again objects and if you want to represent all of these real-world entities in the programming paradigm you would need an object-oriented programming language and python is one of the most popular and most widely used object-oriented programming language and the best part about python is it has a huge standard library so it has libraries with respect to numpy pandas matplotlib c bonds so whatever you would want to implement so if you want to perform data analysis python will have a library for that if you want to do data visualization python will have a library for that if you want to create a website python will have a library for that if you want to implement artificial intelligence python will again have libraries for that so that is why python has a huge standard library and if you want to install python you can go ahead and install python from this particular website over here python.org downloads let me just go ahead and click on this let's just wait for this to start and this what you see over here download python 3.8.5 and when you click on this python will be automatically downloaded and since python is platform independent you can download python for either windows linux or a mac system and once you download python we would need an ide so exactly an id id stands for integrated development environment and if you have work with other programming languages such as java c or c plus plus you would know that um an id is just something it's you can consider that to be a development kit which would help you to perform all of your codes in a much more smooth manner and one such ide for python is known as pycharm and to install pycharm they would have to go to this particular website over here jetbrains.com pycharm and when i click oh show we have the download button and over here we have the professional version of python and we also have the community version of python and since we only need it for general purpose python development i can just click on this and this pycharm will be automatically downloaded and mostly for data science purposes we use this toolkit called as anaconda and it's the most widely used uh python distribution so when it comes to pie chart it is only an ide but anaconda is a complete distribution when you download anaconda it will give you an id called as jupiter notebook and also you don't have to explicitly install some libraries which are there in python so if you download anaconda you will almost have all of the libraries which are there in python such as you don't have to download libraries like pandas matplotlib cborn numpy and so on to download anaconda you'd have to click over here and when you click over here products when you click on individual edition you have the download button over here and when you click on download again you will see that you can download anaconda for windows mac or linux and since i have a windows system i will click on this and i can download the 64-bit graphical installer so this is all about anaconda and this will give us an ide called as jupiter notebook and to work with jupiter notebook you would have to go ahead and select this you have to if you are using a windows system just type in anaconda prompt select this anaconda prompt over here and you would have to type in jupiter notebook and when you hit enter you will get this home page this what you see is the jupiter notebook over here and if you would want to create a new python notebook click over here and as you see i have this new python 3 notebook i will click over sure and this is how this new notebook looks like i'll just go ahead and rename this i'll name this as let me just given simple python and if you want to go ahead and print something over here all you can do is just type in the print command add in some random gibberish over here when you hit on run it will be printed over there so this is how you can work with jupiter notebook and if you want to save a jupiter notebook you have the save as option and if you want to download this you have the download as option over here and normally whenever we are working with a jupyter notebook we will download this as a dot ipynb file which stands for so over here i python notebook that is the extension of this and this what you see this is known as a cell and if you want to add a cell about this you can just click on insert you can add a cell above it similarly if i want to add a cell below this i click over here and as you see i have a cell below this so this is all about a basic introduction to pie chart anaconda and jupiter notebook which one is better pycharm or jupiter um i personally prefer jupiter notebook because it's a web-based interpreter and if i would want to share this with someone i can easily share it and also pycharm is a bit cumbersome to work with and if i let's say if i when i directly install anaconda i have all of the libraries pre-installed in jupyter notebook i don't have to explicitly install most of the library so i personally prefer to work with jupiter notebook rather than pycharm can i use visual studio absolutely you can use visual studio that again is another ide for python basically stating that she is new to data science and which part of python she needs to focus more to learn data science in the best way so if that's the case then i would recommend you to focus on these three pi primary libraries which are numpy pandas and matplotlib numpy is a library which is the core library for numerical computation so that is something you would have to have expertise on then you would need pandas which is the core library for data pre-processing and data manipulation so it gives you two objects the first object is known as the series object and the next object is known as the data frame and most of the data sets which we'll be working with will be in the form of a data frame so pandas so must learn when it comes to implementing data science problems with python and then if you once you analyze something you would have to show it in the form of beautiful graphs and that is where you need visualization libraries so you will have two visualization libraries the first one is matplotlib and the second one is cbon again i'd recommend you to learn both of them because both of them are equally important and both of them have their own pros so these are the four libraries which you need to have expertise on if you want to learn python for the sake of data science so it will be numpy pandas matplotlib and c bond what degrees are needed with respect to python developer and big company um there aren't specific degrees with respect to python so if you have a b e or b tech in computer science or maybe even if you have a bca or mca that will be more than enough after that it's all about your skills once you have the basic degree it boils down to how much skills do you have how much have you worked on open source projects how many uh you know how many uh git repositories you have how much you have you know where do you rank with respect to the different projects in kaggle all of that matters so once you get your degree you can go ahead and work on more open source projects that will be more beneficial to you guys what is the best way to practice data science and machine learning that's a very good question again four pillars of data science or machine learning the first pillar is statistics as statistics forms the core component of data science machine learning and artificial intelligence and if your statistics concepts are not clear then in that case you will not be able to properly comprehend the machine learning or data science or the ai concept so that is why it is extremely important to have your basics and statistics clear you would need to know what are measures of central tendency you would need to know what are measures of deviation you would need to know about different probability distributions and there is a probability distribution known as normal distribution and this is an extremely crucial point when it comes to most of the machine learning algorithms because you would normally prefer our data to be in the form of normal distribution so once you're good with statistics then you will head on to programming now when it comes to programming you have two main languages with respect to data science which are python and r and again i'll repeat this this is not a case of either or i would recommend you guys to learn both python and are because both of them are equally important and it would depend on the project so when big companies out there it would depend on the price so one project maybe you will be using python and immediately for the next project you will be using r so it totally boils down to what project you're working on so the first pillar is statistics second pillar is programming language then the third pillar would be machine learning algorithms so you would need to understand so if you are good with the stats part and the math part then you will be able to easily understand the underlying concept of these machine learning algorithms you will be able to uh properly comprehend what is linear regression logistic regression decision tree random forest nye base svm and all of the other machine learning algorithms and so basically these are actually the three pillars stats programming and ml algorithms and once you're good with this you will be this would be the proper uh proper road map or the proper path to get started with data science is there any similarity between matlab and python that's interesting the outcome i'd say so i'd say matlab is more towards computer vision so if you want to perform more of uh anything related with computer vision then you can go ahead and prefer matlab i remember i had done a project in my final year of engineering which was with matlab so it was about uh about uh the scriptures which are there on a palm leaf so we had all of these scriptures over there but the problem was uh the most of the text either some of the text was missing or the page was stolen a bit so we had to find out how much of those scriptures of you know how much of those scriptures were the text was missing and we had to put in boundary boxes over there and we had to sort of predict what would be the next word if let's say of the some particular sentence over there in sanskrit and a couple of words are missing over there then with the help of matlab and computer vision we would have to predict what would be the next word over there so i'd say if you want to work with matlab it does more towards the computer vision side of things and um python is more towards the deep learning side of things but then again computer version is again a subset of artificial intelligence so you can use both of these tools and we do doctorate and data science absolutely a lot of people do masters and doctorate in data science itself so the top ivy league schools such as stanford howard mit all of them do provide a doctorate course in data science artificial intelligence or computer vision dr sarkar himself has his phd from stanford and he has his phd in statistics apart from numpy pandas and matplotlib how much programming in python do i need to learn and data science or mlr ai domain well i'd say that would be more than enough because these comprise of the main libraries in python and if you have a very good understanding of numpy pandas and uh matlab that should be enough but these are the core libraries but if you actually want to implement data science then there's another library called as psychic learn and with the help of psychic learn you'll be implementing different algorithms such as linear regression logistic regression and so on and if you want to implement deep learning algorithms or artificial uh uh you know or artificial intelligence concepts then you would need to work with frameworks such as keras and uh tensorflow what exactly is tensorflow so tensorflow is a deep learning framework with the help of which you can implement neural networks so we have something known as artificial neural networks and if you want to implement these artificial neural networks you would need these deep learning frameworks such as tensorflow or keras how to install numpy and matplotlib in jupiter so it's very simple i'll just show you over here so if you want to install a library all you have to do is type in pip install and then given the name of the library so if you want to install numpy given pip install numpy if you want to install pandas you'd have to type in pip install pandas similarly if you would have to install matplotlib just type in pip install matplotlib over here so that is how you install different libraries and anaconda four pillars of data science or machine learning the first pillar is statistics as statistics forms the core component of data science machine learning and artificial intelligence and if your statistics concepts are not clear then in that case you will not be able to properly comprehend the machine learning or data science or the ai concept so that is why it is extremely important to have your basics in statistics clear you would need to know what are measures of central tendency you would need to know what are measures of deviation you would need to know about different probability distributions and there is a probability distribution known as normal distribution and this is an extremely crucial point when it comes to most of the machine learning algorithms because you would normally prefer our data to be in the form of normal distribution so once you're good with statistics then you will head on to programming now when it comes to programming you have two main languages with respect to data science which are python and r and again i'll repeat this this is not a case of either or i would recommend you guys to learn both python and r because both of them are equally important and it would depend on the project so when big companies out there it would depend on the price so one project maybe you will be using python and immediately for the next project you will be using r so it totally boils down to what project you're working on so the first pillar is statistics second pillar is programming language then the third pillar would be machine learning algorithms so you would need to understand so if you are good with the stats part and the math part then you will be able to easily understand the underlying concept of these machine learning algorithms you will be able to uh properly comprehend what is linear regression logistic regression decision tree random forest knife my twitter developer account hasn't been uh you know uh it hasn't been yet restored so the code will be pretty small over here so what we'll do is we'll quickly go ahead to the code for video sentiment analysis we'll see what is happening over there so this what you see is the code for twitter sentiment analysis and before all of this you would need an api we would need the twitter developer api so we would need a twitter developer just type in twitter developer api over here in google select this link this first link which you see this will open up you would have to click on app so here and when you click on apps you will actually get an option to create a new app but my twitter developer account was suspended for some reason because uh maybe i was taking a lot of live sessions with this twitter account so that is why it would have been suspended so what happens over here is once we create a twitter developer account it would give us an option to create a new app so this what you see when you click over here you will be able to create a new app and when you create a new app you will get four things over there the first is known as the consumer key then you will have something known as consumer secret after that you will have access token and you will have access token secret so you will need the values for all of these and since my twitter developer account is suspended for now i will not have the access to these so uh i'll just show you i'll just explain what this code does and uh unfortunately i'll not be able to show you or extract the tweets from twitter because my twitter developer account is suspended right now so i'll just explain this entire code to you guys it will start off by importing all of these required libraries so the first library is obviously twice and with the help of this twee pi library we will be getting access to this twitter developer api if you want to access twitter through python then we would have to install this to epi library and the command will be the same you would have to go over here you'd have to type in pip install we pi and this is how you'll be able to install this particular library then after you have installed this library you would also have to install this library called as o authentication handler and with the help of this o authentication handler you will be able to uh you'll be basically able to authenticate your consumer key consumer secret access token and access token secret with twitter so this is you can just consider to be an authentication check which twitter does so here once you have these twitter just ensures that your key secret token and all of them are actually verified and to verify this you would need this and to actually understand the sentiment of the tweets which you extract you would need the text blob library again to install you just go to anaconda you'll type in pip install text blob and once you once you install this library you can just go ahead and import this so here you will type in from text blob import text block so now that we have all of our required libraries over here and once we create our app we will have the values for consumer key consumer secret access token and access token secret and you will just go ahead and add all of those values over here and once you add all of those values you would have to authenticate your tokens so to authenticate your tokens again you would have to keep though keep that in this exception handling uh exception handling bracket you will have try and accept and you will do this because let's say if your authentication fails then you would have to print out something so here what you do is you will use this oauth handler and you would have to create an object of this so to create an object of this you will have oauth handler handler and inside this you will just pass in consumer key and consumer secret and that you will store it in this object called as self dot auth and once you have this object you will set the access token and secret so here you will have self dot auth dot set access token and inside this you will pass in access token and access token secret once all four of these are set you will type in twi dot api and you will pass it the object which you have created and this is how you will set the authentication with twitter and once the authentication with twitter is done we have this twitter api over here so api dot search and whatever you would want to search whatever term so let's say if you'd want to uh want to analyze the tweets of ipl you'll just type in ipl over here and right now we have a lot of you know youtube is uh just booming up with pinot so let's say if i maybe just type in pinots over here i'll get all of the tweets with respect to binos similarly maybe if i want to each with respect to fifa i'll just type in fifa over here and i'll get all of the tweets with respect to fifa and that is something i'll go ahead and store in this object called as tweets now this tweets object will have all of the tweets with respect to fifa and what i'm doing over here is with respect to the for loop i will be printing out every single line of the tweet and once i print out every single line of the tweet as you see i have this text blob method over here and with the help of this text blob method i will be able to analyze this so this result what you see analysis will give me three things this will tell me whether my tweet is positive negative or neutral so the rating will be between zero to one so if it is zero it will be uh it will actually be between minus one and one so if it is minus one then it will be extremely negative if it is plus one then it will be extremely positive if it is zero then it will be neutral so this is something you will get with text blob over here and this is how you can perform video sentiment analysis and if you want the code for this uh i believe you have already registered over here then you will get the code for this and also the sessions ppt and if you are somehow not able to access the code for this you can just send us a meal and we will forward it to you how many months are required to become an expert in ds uh this is again an interesting question and uh i'd say that is very subjective and that would differ from each individual person and that is because let's say what i learned in one month would be completely different with respect to what someone else can learn in a month and it would depend on how much you practice so i can just tell you what are the different things you can uh you know you would have to you need to have expertise on and it will totally depend on you how much time it is required for the first part is statistics you need to be absolutely perfect with statistics once you're good with statistics you need to be good with both python and r and once you've cleared statistics once you have expertise with both python and r you would need expertise with all of the different ml algorithms so in ml we again have supervised learning algorithms and unsupervised learning algorithms in supervised learning we have algorithms such as linear regression logistic regression naive bayes svm and so on and in unsupervised learning we have algorithms such as key means clustering hierarchical clustering then we would also have principal component analysis lda so these are the different uh techniques which come under unsupervised learning so once you have expertise with this as well you can go ahead and combine all of your knowledge and then you can work on more of open source projects and that is when you can improve your expertise you can go to the site called as kaggle and keep participating in more of data science contests over there and you will have more of expertise so i suggest you need to have expertise over all of these but if you ask about the time duration that would that is totally a subjective question and that will totally boil down to you to run the code sure i can run this but we will not get the result because as you see i don't have these values over here the consumer key consumer secret access token and access token secret these are the values which i'll get if i have an app but since my twitter developer account is suspended over here i will not have values for these so i have run this and when i click on over here as you see i get unexpected indent i'll also remove this i'll remove all of the indentations but even then we'll get an error because these are not the actual access keys and consumer secrets over here i would need some real values this is also done and here an indian does not match so let me actually put in some proper indentation over here i'll keep it over here let me remove this piece from here let me also remove this piece from here as well let me give in one single tab oh here let me keep it at the same indentation this also i will given it the same indentation over here now if i hit on run you would see that authentication has failed and we get this because we don't have the right consumer key consumer secret access token and access token secret so if you have a proper twitter developer account you will have access to all of these you can go ahead and put them up over here and you will uh your authentication will succeed over here what point of learning we can start taking part in kaggle contests um once you are good with all of the python libraries so once you know that you're good with all of the python libraries and once you are good with all of the ml algorithms as well so you need to know at least maybe around uh three to four supervised learning algorithms and there are maybe two or three unsupervised learning algorithms so i'd say that is when you can go ahead and start competing on kaggle so this very simple data or this very simple contest called as titanic data set so that is what most people start off with on kaggle so you can go ahead and start working with that uh titanic contest on kaggle and it'll be easier for you there would be a lot of entries so you can just submit your entry and see value rank over there and you can keep improving your model and see how does it improve what exactly is a kaggle so this is analogous to websites such as hacker rank or hacker earth so hacker rank and hacker earth are coding websites where you have contests with respect to data structures and algorithms so similarly if you need contests if you need data science contests then these data science contests happen on this website called as kaggle you can consider this website of kaggle to be the mother home of data science and you have a lot of contests over there and most of the companies actually recruit from kaggle so if you may be uh in some of the contests if you rank in the top 10 throughout the world then the top companies directly hire from kaggle itself is our language necessary for ds or python and as i've told you guys i'd recommend you to learn both r and python this is not a question of either or this is a question of python and r and also it is not difficult to learn another language if you know one if you know python you can easily learn r and if you know r you can easily learn python what does text blob do please repeat so text blob it gives you an analysis over here when we pass in this text this will basically give us three things those three things are positivity negativity and neutrality and we are storing that in this object called as analysis and the result will be between minus 1 to 1 so if your result is closer to -1 then it would mean that your tweet is negative similarly if your result is close to plus one then it would mean that your twitter has your that tweet has a positive sentiment similarly if the value is close to zero it would mean that the tweet is neutral what is the strategy for practicing ml algorithms it's the same strategy make sure that your statistics part is extremely good and once you're good with all of those statistics concepts and it's not only statistics it's also probability and there's one very important concept in probability which goes by the name of normal distribution so for all of the folks who would want to start learning machine learning algorithm you would definitely have to learn about probability distribution and normal distribution keep this in mind because when you are implementing machine learning algorithms most of the algorithms they would require your data to be normally distributed so this is something which you definitely have to learn then after that you have a concept in statistics called as central limit theorem and this is also something which you need to understand properly then you have concepts such as hypothesis testing p value and so on and this hypothesis testing would help you to put up a problem statement and come up with an answer to that problem statement once you're good with statistics then you have to learn programming languages that is when you'll be properly able to understand all of your machine learning algorithms properly can we use hadoop and python or not so hadoop is totally different so hadoop it comes it mostly deals with big data and it is a storage system it is not an analytic system so hadoop is more of a storage system and it would help you to store terabytes petabytes and exabytes of data in a proper place that is where hadoop comes in let's start with introduction to machine translation for this we will be taking an example suppose you work in a mnc firm and your company wants you to go from india to france to attend a business conference and the thing is you don't know any other language other than english and your native language so suppose they give you a boarding pass and they ask you to travel to france so now you boarded the flight and you after 12 hour 39 you reached france and now the moment you reached there you felt that you are hungry okay so now you are feeling hungry so then suddenly you saw a pizza outlet there so the moment you entered the pizza outlet you saw everything is written in french because it's a france so uh it's a natural that most of the thing will be written in french so everything is written in french and you don't have any other friends who knows french and and they are with you okay so what will you do the moment you get the menu you started uh scratching your head so then suddenly you got an idea that yes you can use the google translate or bing translate or or any other translate application so you open your browser and then you typed the keywords that are written in french and then you can translate them into english and then you can order it okay so this make your life easy so this is all about machine translation so machine translation is a field of ai which enables or which gives the ability to a machine to translate your language from uh like from one language to the another language so some of the examples of ai driven machine translations are google translate google assistant facebook translate grammarly siri alexa then there are a couple of browsers which allows you to translate whatever written in a browser from one language to another language now let's discuss some of the machine translation techniques available uh one of the most uh predominantly used or the most widely used machine learning translation technique is statistical machine translation technique so what is statistical machine translation and how it works so let's see statistical machine translation uses parallel corpus to train a machine translation model so what does it mean let's say there is a language x and there is language y okay and we want to translate from x to y let's say x has a b a b a c c d the corresponding values for y would be for the training set would be let's say 1 1 2 let's say 1 2 1 1 2 then it's will be 1 1 4 or let's say 1 6 okay then c would be 6 or it would be 14 1 4 then do d would be 9 okay and so on okay so there will be for each corresponding data points there will be the value for y okay so it will translate in such a way that for x what is the data point and corresponding data point present in y okay so this way it will translate the or it will train the model okay so the goal of smt is to translate a sentence from source language to a target language so mathematically suppose that we want to uh translate a hindi sentence that is y given a sentence x that is written in tamil okay or or that the speaker is speaking in tamil okay so this will use uh this formula that is nothing but a base theorem okay so it will be if we uh just elaborate it it will be our max of y p of x y and p of y so this will be your probability or or we can say machine translation model and this will be your language model okay so let's try and understand how this function works okay with an example so let's suppose uh rohan was supposed to [Music] study with me i called him but he did not then there is a blank okay now suppose we want to fill this word and and want to complete the sentence okay so based on uh the probability of this word or or the most suitable word we can find out with the help of let's suppose we have corresponding corpus answer well good go study have okay and then there can be more words like that okay so uh like with did not what is the word whose frequency is highest okay so if we see uh and and since we know english language so we can we can understand that have has the highest frequency so if we write did not have okay so with the probability we are able to fill this the next word but is it making sense no this is not making sense okay so uh what can be the answer so let's suppose that highest probability was answer okay so in that sense answer is the correct word which we get okay so there is a probability or we cannot say that uh it's a fixed that we will get the correct next word which will uh fill this uh blank okay we can fill it with half we can fill it with go also okay so it has been seen that although statistical machine translation model works well but in some of the cases it does not perform well as it is expected from it okay so in order to tackle this issue we have a model that is known as neural machine translation model okay now let's see what is neural machine translation model okay so smt model was very complex and it and we saw with the example that some of the time it does not perform well okay so then come uh neural machine translation model okay so neural machine translation model uses sequence to sequence architecture which involves two rnn okay then encoder decoder encoder encodes the sentence and feed this encoding to decoder and then decoder produces the target sentence okay so let's try and understand this okay so what happens that there is a input okay so this goes to a black box okay this is a black box okay so this black box is neural machine translation okay and for that input it gives you output okay so now further break this down what is happening let's say we have input okay this input goes to [Music] here this sequence to sequence model so here we have encoder we have decoder so i'll just write e and d for the same okay so it goes to encoder then encoder encodes it send it to decoder and then decoder decodes it and produces the output okay further uh how how this happens let's uh elaborate it further more so encoder encoded it into a contextual vectors okay and that contextual vector goes to decoder and then decoder decodes it and gives us the output okay we will go and discuss this what is the structure of encoder what is contextual vector and what is decoder and how it produces the output what how input goes there so we will discuss those things uh later in this slide okay so we just wanted to discuss that what is encoder decoder on a broader picture so before we go and discuss what is a sequence to sequence model let's recap and try to understand the natural language processing concepts in deep learning so there is a concept of text vectorization any machine learning model be it a random forest be it a logistic regression or it's a cnn or it's a rnn lstm gru you name any model they work on numerical data and since nlp deals with textual data so we need to transform our the textual data to a numerical data so text vectorization is technique with the help of which we can transform or we can convert our text into numbers so some of the predominantly used or most important text vectorization techniques are bag of words tfidf word embedding and character embedding so what is word embedding or let's start with what is bag of words so bag of words is nothing but bag of word is a technique in which from the entire text corpus let's say we have a raw data and then we do the processing okay we do processing and after processing we get a data okay so this text data will go into a box okay for understanding we can understand that it's a bag okay like a bag of coins it's a bag of words okay then there each word will go and corresponding their id's will be there so that whenever we want to identify or we want to convert that from text to number we can convert it let's take an example this is car this is my bat okay so what will happen each word this is a car since this is already present and is is also present my will be the next term and that will be the next step so correspondingly they will get some values or they will get some ids to uniquely identify them let's suppose it will get 1 it will get 2 it will get 3 this will get 4 this will get 5 this will get 6 okay then suppose there is a new term comes this cat is not mine okay so when we will translate it into vectors what will happen this will get one cat since we don't know cat and in our vocabulary cat is not present so it will get 0 is will get 2 since again not is not known to the model so it will get 0 and mine is also not present known so this will get like this okay like similarly if we translate this this car is mine how we will get the vector form for that this will get one two three four similarly this will get one two then five and six okay so this way we do the transformation using bag of words okay so let me erase this so now let's try and understand what is tf idf so tf idf stands for term frequency inverse document frequency so what it does uh in a in a very crisp words we can understand that those documents which are present in the corpus and they have less frequency of occurrence will get higher precedence or will get more weightage than those words whose frequency are higher in the uh corpus okay so now let's move on to our next slide what is word to work okay so word to work is a word embedding technique okay now uh we discussed a bag of words okay so what typically happens that in bag of words or in any normal model when where we are uniquely giving ids to each token and then we are creating it let's suppose that we have a a corpus let's say c is the corpus that has total 10 000 unique words present okay and if i write this is my bad okay so what will happen we will get a one is to ten thousand size vector okay so this will be translated something like that into there will be some values okay similarly uh let's suppose a total number of document we have one lakh okay so the vector size or our vector will be cross 10 000 okay so this will be a sparse matrix okay so this sparse matrix is very difficult to pass through the model and train a model who which can perform well okay so in order to tackle those issues what embedding came into picture and one of the techniques of word embedding is word to vect where we instead of uniquely identifying or uniquely giving wettage to each word what do we do we do the word embedding or we do the word to work with the help of features and the word okay so let's suppose uh the how how it works we can understand that uh there is a king plus boy minus green okay so what what does this means okay so to word embedding or to word to back since we are passing king okay so king has certain properties some of the properties would be uh power army land money similarly the gender of uh king okay similarly boy has certain properties uh like like what is the color where to which location he belongs who is the father of this boy then then similarly uh what like what is his gender okay since uh like this is male this is male and minus queen that means this this person is not female okay so predominantly we can say that he is prince okay so this is the way by which uh word to eck work okay so there are two types of word to available one is a c bag of words and character based back of words and another one is skip gram word to egg okay now let's move on to our next topic that is character embedding so character level embedding uses character level composition and provides a numerical representation of words in one dimensional convolutional neural network okay now let's try and understand what is rna rna stands for recurrent neural network okay so why rnn came into picture before we go and discuss the architecture behind this and how it works let's try and understand why rnn came into picture so before rnn we had cnns okay so in cnn models they are very good with the image kind of a data or data where we don't need the information uh to from from the previous state or the from future state okay so if we have the present state then in those cases cnn works well now let's take an example of text data suppose you are watching some marvel series movies okay now since you know that uh the avengers series in avengers series what kind of uh like what happened in the last season or what happened in the last movie based on that you can understand that what are happening in this situation so now suppose we train a cnn model on top of that so since cnn model does not know or does not hold the previous information since they don't have any holding technique or or they don't have the memory so what typically happens that in cnn models they do not perform well with the text data so in order to tackle those issues rnn came into picture so rnn stands for recurrent neural network as we discussed so rnn has a memory state to grasp or to hold the memory so if we try and understand that what happens here so in any rnn cells okay we have two inputs one is our input vector and the other one is hidden state okay so if we try and understand it what happening here that there is a input then there is once a model start it starts with then state 0 or we can say h0 okay so this input vector 1 goes there okay so these goes to the model okay so this model and then correspondingly uh your output will be generated okay and your hidden state one will be generated okay so this continues so if we see here in this example let's say this is nothing but a we can say a breakage or how this works it is shown here if we unfold it okay so it starts with h t minus one hidden state when t minus one uh the input vector goes there okay and x t minus one the hidden state goes there okay and we get output t minus 1 then this vector goes here and this x t will goes there okay this this will generate qt then qt plus one uh will be generated in this state when we will be providing hd plus one and xd plus one likewise it it holds the previous information and the previous state this enables or this allows the rnn to keep the information with itself okay so this way rnn outperform cnn in textual data okay now let's move on but we have seen that there is a exponential gradient problem in deep learning so what is exponential gradient problem exponential gradient used to update the weight in the right direction with right amount okay so when your model is training so with the help of uh understanding that how much your model is uh not uh performing well uh error gradient are used to update and and uh to just tweak the model weight to make it uh like like better uh so that it can perform well okay so in rnn during training some of the time what happens that uh this rnn this error uh gradient accumulate and and may update the weight with with very large gradient okay so as a result of it we we see a click explosion okay which we this this explosion disturbs the model's state okay and due to which models completes training although but it does not perform well as it is expected from it okay so in order to tackle those issue uh then uh lstm and gru came into picture now let's try and understand lstm before we go and discuss the architecture of lstm let's try and understand lstm with an example suppose you are watching television uh let's suppose you are watching fifa okay it's a match between portugal and argentina okay so then suddenly your mobile phone starts ringing okay and look who is there it's rohan okay and rohan is calling you after a very long period of time that is five years okay so the moment uh your mobile starts ringing you pause your uh or or you mute your tv and then you pick up the call okay and you started conversing with rohan okay so what did you do first you stop the television or you pause the television so by this what unintentionally you did you just get away from from the state where you were there and then you start conversing with rohan okay then you guys started discussing how how did you guys uh studied during the college and how is professor sumantho or or how is click rupture is doing and all okay so then then uh the moment you guys started the conversation okay so what happens that uh with the all the information or or the data which rohan is providing you then you're you're recalling those things okay and based on those recalls and and based on those information what you started you started responding to the uh to the opponent or to rohan okay so now let's understand the same example with this lsd of architecture so this is known as cell state this is known as your forget gate okay so this is the cell state where you were there before talking to rohan watching television then this is forget gate the moment you started conversing with rohan you decided what to keep and what not to keep so you decided that i will not be watching the match so you stopped the match or you stopped the information which you which you are getting from the live match okay so this is how forget way gate works then there is another gate that is known as your input gate okay so based on the forget gate okay you you decide uh that it's a sigmoid function uh like like you you decide that what to take and what not to take from the in information which rohan is giving you okay so this is your input gate then based on that input from rohan you update yourself okay and you got to know what is happening in your college let's suppose how is your uh teacher is doing or let's say how your uh mentor is doing let's suppose that how your your anjali mam is doing okay so so you got to know uh from from uh rohan that now she is in a message you set okay so these kind of information you are getting and based on that you are giving the response okay so this is the output okay so lstm has three inputs or three gates one is forget gate which allows you to keep certain information and to what to keep and what to remove we can do it with the help of forget gate and then input gate is with the help of forget gate and the previous state you can take a decision which information to keep and which information not to keep okay so this is the input and with the help of this input and all you can update your output okay and then you can give it okay so this is your next state and this is the cell state okay so it's a same then there will be one lstm set then there will be one lstm and one lstm and it will go until how many units you want to keep for the lstm okay so this is the basic functionality how lstm works okay so now let's move on to the next slide so greu is also similar kind of a model okay but it has only two states one is reset and another one is known as output okay so it has opposite to uh or or not similar to lstm it has only two output one is update okay which does update your values okay and reset is doing the same thing like what to keep and what not to keep okay and then it gives you y hat and uh the the output and the next state okay so then there will be one more gre attached to it then there will be one more attached and then similar to lstm architecture it is same but gru and lstm works similar to each other but gru has only two gate that is update gate and reset gate but lstm has three gates forget gate input get output gate okay so what is sequence to sequence model till now we have covered the basics of natural language processing which we use in deep learning now let's try and understand what is sequence to sequence model so in order to uh solve complex tasks like language translation creating a chat bot text summarization question answer system sequence to sequence model works very well or it has been seen in various fields where sequence to sequence model works best for this okay so encoder decoder architecture is one of the most common sequence to sequence model architecture available in the market so there is one encoder so it encodes the output and and gives a context vector and this will go to decoder and then decoder will give you the output so this is the basic uh structure of how encoder and decoder works okay so now let's uh discuss them briefly so before we go and uh discuss encoder and decoder let's discuss some of the real life examples of sequence to sequence model some of the examples which we use in our day-to-day life is google translation google assistant siri grammarly then we use google smart compose we use android smart keys we also use alexa okay these are some of the sequence to sequence based models uh use case which we use in our day-to-day life now let's discuss how encoder decoder model works so this is the basic structure of encoder decoder model so encoder decoder model both uses rnn okay so it can be lstm or it can be gre okay so these can be the rnn models which we can use in decoder encoder but we need to keep in mind that if we are using lstm then we cannot use gru here okay we will use only lstm okay or if we are using gre then we will not be using lstm okay we will be using gru okay so now let's try and understand how uh this encoded equator works okay so we can have a box something like that let's call it encoder denoted by e okay and we have a decoder the same as shown in the picture that is decoder denoted by d and it has few rnn states and similarly this has few states this okay now suppose i want to feed a input okay let's write how r u okay and this gives you a output i am good okay so first this how will go okay so i'll just mark it one this has two this has three okay so the moment this will how will go to this okay this one will go then this h1 is generated that is hidden state one okay so then this will go to the model okay as shown in the picture then similarly two will go okay then this two will go okay and it will generate h and and this will generate h2 then this h2 will go and then this three will go okay and this will generate h3 okay h3 and this h3 will be your context vector we can say or we can say encoder vector okay as shown it here okay and this will go to this okay so the moment it will go here it will generate i okay then this will this will go to the next context it will generate m then this will go to the next context and it will generate good okay so if we see closely in this diagram what we can see that in the second box or the decoder we have the outputs and we have the next hidden state but in this if we see closely in this encoder we don't have the output from this rnn so the rnns architecture of encoder is such that they don't produce the output okay and we don't or we can say we don't store the output here okay instead of that we provided the hidden state okay so uh this is the basic structure of how encoder decoder works okay so now let's move on to our next slide so encoder decoder using teacher forcing so what happens that when we train our model okay so sometimes this happens that encoder decoder model takes very long time to process okay so encoder decoder model are known for taking very longer period of time to train and some of the time they over fit itself okay so in order to tackle that we use the teacher forcing so how teacher forcing works that uh when this hidden state 3 let's say h3 goes to this okay to this a encoder vector okay this encoder vector let's suppose that this is h3 it goes there y1 similarly there is one output this is your decoder input okay so this is your encoder input we do not directly give the x1 we do the embeddings okay so be it what word embedding uh like like be it what to wake be it a character-based embedding be it one hot time coding embedding anything okay bad bag of words embedding but we do provide this uh encoding to this x1 x2 and x3 are nothing but the uh input vectors okay so this goes there similarly this decoding input is nothing but the input vector for uh the decoded uh the correct output okay so this goes there okay so the moment it goes there and and the moment it predicts let's suppose that it predicts y one one okay and then it checks with this decoded input that whether the output which it is generated generating is correct or not okay so our correct output is y1 okay so the model will penalize itself okay the optimization technique will start working and it will penalize the model it will do the uh like like update the weights of the rnn model and then it will refit itself uh so that it can generate y1 okay so this way uh we get the better result uh by applying teacher forcing in encoded decoder model and it also uh stops model to getting overfit as well as it helps us in it reduces the time taken to train the model okay so let me erase this now let's move on to next slide so what are the drawbacks of encoder decoder model so it has been seen that the encoder decoder models have limited memory okay so since we are using lstm okay so it can it can perform well with longer text but this cannot work well if we are passing an entire neglect sentence or a sentence is containing 100 lines of text okay in such situations encoder and decoder model not perform well okay similarly model does not perform well when the length of the string is greater than 10 okay so with the help of blue score in in various uh researchers researched and they found out that with increase in word length the blue score started decreasing okay so blue score is the parameter with which we can judge the performance of any encoder decoder based model for machine translation so this has been seen that with this it starts uh reducing the blue score starts decreasing so in various research researchers found out after performing the various various research they found out that if the word length is greater than 10 or is 10 okay it does not perform well this is not a thumb rule but this has been seen in various uh situations this can be opposite of that for some of the cases when our data is different okay since we know that machine learning or deep learning model works on data okay if your data is correct model will perform well if your data is not correct model won't be able to perform well whatever technique we can try okay so this depends upon the type of data as well or the data and the frequency of the words but it also uh has been seen that with a longer string a model does not perform well okay now let's try and solve a machine translation use case using python programming language here we will be using encoded decoder based architecture with teacher forcing and for this use case we will be using english to french language translation dataset that is freely available in kaggle.com you can use this link to download the data so now without wasting any time let's get started so first we will be loading the packages that are necessary to doing all the pre-processing stuffs so we will be loading pandas and numpy so now we have downloaded this data from the kaggle.com so now let's upload this data so we loaded it up and now let's see so there is a english word corresponding to that we have a french word or a sentence so like for example hi and corresponding to that there is salute okay run then cause okay and then uh i'm not well versed with french words so i'll prefer not to spell them yeah but we can understand from this data that for each data point in english there is a corresponding data point in french okay so like we discussed in a statistical machine translation model that it's a parallel corpus like for x there is y for x there is y okay now let's move on so what we will be doing first we will be identifying or we will be finding out the length of the string okay so with uh with the help of a string function we will be first splitting them into a chunk of words and from those tokens we will be counting the length of them okay so let's take an example how this works for example let me create a so this will create a list okay so what it will do it will create all the words into a tokens and that tokens will be stored into a list okay and then this sdr dot length will do str dot alien so let's execute it it will take some time so it will find out that how many words are present there let's for example let's see there is one word so it will count one okay so now let's execute it okay similarly we'll execute it for french also and now let's see the data so now we can see that there is english word count and french word count okay so for the sake of uh this encoded decoder model and we know that encoder decoder model does not perform well when the string length are very large okay so for this purpose we are strictly using the length till five okay so if anything that is greater than five we will not be using those data points for training and for testing our model okay so now let's execute it so if we see earlier we had one lakh seventy six thousand six twenty one data now we have fifty four thousand two hundred and six twelve data okay now there can be a situation that we have a duplicate data and when we have uh duplicate data in our model we need to drop them off otherwise there can be a situation that our model will be overfitting so with the help of drop duplicate function of pandas let's execute it now let's see how many data so we didn't have any duplicated data so now we have our data ready okay so now let's uh divide the data into train and test okay so before dividing the data into train and test let's just understand uh basics of how train and test split works okay so let's suppose this is our entire data so this is our data okay let's say it has 100 data points so uh in our unique data uh let me write it unique okay so from those unique data first we divide the data into train and as well as we divide it into test okay after dividing the data into train and test let's suppose that we divided it into 80 and 20 so 80 will be the train and 20 will be the test and since we are using trend test split so our uh data will be stratified sampling that means the proportion of the data would be same okay there won't be a situation that one type of data would be in in uh train and the other would be on the text so there will be a mixture of the data so now we have this 80 20 ratio ready now this further we can divide it into further train and validate so this can be 80 and this can be 20 of that so that we can train the model and for validation and for optimization of the model we can use this 20 of data and once our model is ready then we will be using this 20 of the data that is totally unknown to the model the model is not knowing about that data that data is unseen to the model and based on that data we will be try and and will try to find out that whether our model is performing well or not so this is the best technique whenever we are dealing any machine learning or deep learning model okay so we need to divide the data into three parts first train test then that train will be further divided into train and validate now let's divide the data into train and test using trend test split so we have divided the data into train and test it will take some time now let's see how many words are present in a train set how many french word maximum present in text so it would be 5 because we are not keeping uh those data points whose word length is greater than 5 so now further we will be dividing the strain data into train and test with test size as 20 so we have divided it now let's calculate the uh this word frequency of english term and french term okay that is available so let's execute it now we have this uh is this english and french uh frequency words of csv file ready this takes typically 10 minute time or something like that so i had already executed it so i'm not going to execute it so instead of that i'll just load this data and now see we can see that in english the words and their frequency okay and the frequency is uh such a way that the highest frequency terms is get on the top okay so it's a decreasing in the order of decreasing frequency okay similarly the french keywords are there okay now let's do the data pre-processing okay so whenever we are preparing the data a for for any uh machine translation model or for any sequence to sequence model we provide a start token okay let's say start then whatever is our input then we provide the end okay so this we do with all the data okay so that the model will understand what from which point of data the model is starting and from which part of the model it is stopping okay so to know the start and end of the data we provide two anchors that is start and end similarly here we are providing start and end to raw lines okay so we are using the whatever the data we are getting to that data we are appending it it with the start and end okay now let's see so if we do this we can see it in english that every sentence that start i am just a teacher and okay are you getting tired start end okay likewise we will do it with the french word also using this get word function so let's execute it so now for each french words we have that start and the end string okay so now we will do the for the validation data also we had did this with the train data so now let's do it with the validation data now we have the validation data ready okay so now a since we are using the frequency of words maximum of five okay and there can be minimum of one okay so what will happen if our input is one so let's say that uh the the token gets uh for for high okay the high term will get uh frequency or let's say it's token id is one okay so it will be one okay then there will be some term whose like number of tokens are present one two three four five then there will be one two three then one then five then uh five and four okay and things like that okay so how can we normalize them or how can we uh just just fix the length of each of the terms okay so to do so we use the term padding okay so we pad the each term after it and we can do it with the pre and post okay so here we are using post okay let's say there is two tokens 240 254 and two okay then there are three spaces those three spaces will be occupied by let's say zero so zero is our padding so after the once the string gets completed this will be filled with the zero okay so we are using the tokenizer from keras okay and then we are uh creating our uh token uh this frequency token french token english token okay and then we are training it with the french train word and english train word and then we are creating a padded sequence for each of the terms like for english and for french so now in the next thing what we are doing we are finding out the number of input token number of output token number of uh what is the maximum length of input and what is the maximum length of output okay so let's execute it this will take some time let's execute this also so now it is executed let's execute it so now we can see that each of the terms are getting uh at the end zero is appended okay now let's execute this now let's count the number of output tokens so number of output tokens are 17187 number of input tokens that would be around ten thousand now let's see [Music] what is the input length maximum input length that is seven what is the maximum output like that is also seven okay now let's see french train word okay that was uh something which i'm not very comfortable in reading uh now let's see the padded of french so we can see this frequency of this so this process uh is good but we have a very good package known as ktext okay so we will be using this k text package to do all the data preprocessing okay so this k text package is uh you can download it using pip install k text okay now let's uh import k text function so now we have imported process okay now we will be doing the cleaning we will be tokenizing and we will padding the each text okay for this we are using all the keywords that is 10407 and maximum padding length we can assign we can assign any values but for now we will be assigning it to be seven okay and then we will be creating a train vector okay so it takes a longer time to process because it does all the cleaning tokenizing everything okay so i'm not going to execute it because it takes around half an hour process so this way we can create a preprocessing for english similarly for train vector we can generate for french also we can do the same thing okay and once we are done with that since we will be using this reprocessed vectors into uh whenever we will be deploying this model or if we want to do the prediction with this model so we will be uh keeping those values into uh typical files as well as numpy files okay for uh saving it into apical files we are using uh typical okay that is dil okay and uh we are using numpy to store it as a numpy vector okay so since we have not executed the uh above two portions so i'm not going to execute this because uh once we get this strain vector and this uh pre-processed files then we are dumping them into english pp french pp similarly french trained vector and english strain vector okay so once we do that why did we do that because let's suppose that this takes around one hour to process okay and now suppose when we have a data of 20 crore what will happen with those 20 crore uh let's suppose it takes six hours to process okay will we do each time uh the 6r will invest in uh preprocessing no so what we will be doing we will use it for once we will invest those six hours and we can use the same file each and every time okay that's why we have stored it into a pickle file that is serializable file okay whenever we want to use it we can use it whenever we don't want to use it we will not import it okay so suppose whenever we want to load those data okay so this data the numpy file uh now we want to load that decoder input since uh we we discussed that we will be using the teacher forcing so we need the decoder input while discussing the architecture of uh teacher forcing with uh in in encoded decoder we discussed that why decoder input is important okay so we will be loading it so once we load it we will be getting the decoder input data as well as decoder target data similarly we will be loading the uh encoder input data okay so we will be getting the input encoder input data as well as the dock length okay so in order to load the any pickle file the typical files which we have stored already okay we are using this load text preprocessor okay so we will be passing the typical file and the that typical file will be open and we will get the total number of tokens and the uh the processed file okay so it is returning that uh number of tokens and the pp that is english pp and the french pre-processed okay so now let's execute this uh now let's execute it now load it okay uh since we have not loaded the d pickle okay so let me just copy it so let me just add it here okay so we will be loading the d pickle so now we have loaded it okay so the number of vocabulary in size of vocabulary is six to eight zero for english vocabulary and the uh french vocabulary is 9998 okay now let's import the vector files okay so with the help of load encoder and load decoder we will be loading them we will be getting the encoded encoder input data decoder input data the dock length and the decoder target data now let's see how it has been processed okay so for this french input we get the text vectors okay similarly for this english we are getting uh the original text and the after preprocessing we can see it okay now for this uh model training on model building we are using keras we will be using gre okay so now let's import the packages now let's create a encoder model and decoder model now let's try it understand encoder and decoder model how we can uh write a code for that to create an encoder and decoder so for this we are using the embedding as 300 dimensional embedding okay so our embedding is 300 cross 300 now let's define the encoder model so encoder model uh shape would be the dock length that is seven okay and we are saving it with the name of encoder input okay now this encoder input will go to the embedding that embedding has the number of tokens uh the dimension of the embedding and we are storing it to body word embedding okay and we are passing this encoder input to it and then we are doing the batch normalization so that x will go to the grou okay and since uh it's an encoder so like like we don't need the output from the encoder so we are not keeping the output from the encoder we are just storing the next state okay so that's no that the next state is there then we are building a encoder model with the help of model api okay we are passing this encoder input output as a state h that is the output will be the uh the last state or the context vector okay and then we are creating a sequence uh encoder output that is encoder model to this we are passing the encoder inputs so this input which we have created here so that we are passing it here and with with this we are creating encoder output similarly let's uh define our decoder model so for that decoder model we are passing the shape as none okay and we are keeping the name as decoder input for teacher forcing okay since we are using teacher forcing so we will be passing this decoder input to the uh decoder model okay then we are creating a word embedding for decoder okay that is we are passing the number of decoder tokens latent dimension and we are naming it with the decoder token and we are passing the decoder input to it then we are doing the batch normalization and then we are creating a gre model and we are uh this is our gre model and to this gre model we are passing the this the batch normalized output okay we are storing it here and initial state is our the sequence to sequence encoder output because we know that how encoder and decoder works uh let me just draw it so there is a encoder and there is a decoder so this last state from the encoder that is context vector goes to this decoder so this is nothing but that context vector okay so now let's see we have uh now we have that decoder output okay so now we will do doing the batch normalization for that decoder output and then we will be passing this to a dense layer and we will be using a soft max act as a activation function and then the the model uh like we are storing it to a this decoder output by passing this x to this decoder dense okay now then we are creating a sequence to sequence model that is model here we are passing the encoder input and decoder input as a output okay and decoder output as uh output okay these are our input this is our output and then we are compiling the model with losses sparse categorical cross entropy okay so now let's execute it now our model is ready now let's see how the model the architecture looks like so this is our model then what we are doing it here we are just uh creating a model checkpoint and cv logger okay so why model checkpoint we are using so now uh let's create a sequence to sequence model and then train it okay so before training the model we are storing the loggers into csv logger file okay that we are creating a csv logger that is with the name of tutorial sequence to sequence dot log and we are saving it into us local and then we are creating a model checkpoint so what model checkpoint is model checkpoint checks each and every iteration of the model and till now the best model whatever the best model is we will be saving that model only okay so by this way we uh just keep the model which performs better irrespective of storing the overfitted model or the underfitted model okay so this way we prevent the model to get over fitting and under fitting now let's see for batch normal batch size we are using 1200 batch size epochs we are using 70 epochs and that whatever the model history is during the training we are storing it to history a variable okay so now let's execute it so so this takes uh usually this takes a longer period of time so we can wait for some of the times and we will not be uh keep the model running for 70 epochs because this takes time so for that what we have already done is we have trained the model completely and the best model which was with us we saved that model only so for now we will be training the model for let's say because this will take one minute time and then if we keep it keep on running for 70 bucks this will take around uh 90 or 80 odd minutes okay so i'm just stopping it for now here so just keyboard interrupt okay now let's delete this sequence to sequence model object okay since we have stopped it at a bit middle okay so now we will again create a sequence to sequence model object using this model where we are passing this encoder input decoder input and decoder output and we are compiling the model okay so our model is ready now we can see the model a lot summary so we can see the same summary of the model okay total number of parameter what are the trainable parameter what a non trainable parameter then we will be loading the model weights so what typically happens that whatever the weight has been assigned to the model while training it we have stored it and that we will be feeding it to the architecture of this sequence to sequence model so now let's execute it so now we have our model loaded with that okay so now we have our since we have already trained it and we have uh loaded the weight to the model now uh let's see how model has performed okay so for that we will be extracting the encoder model and we will be extracting the decoder model okay so first for that encoder model let's create a function extract decoder model then we are passing the entire model for this we will be passing the sequence to sequence mode this will give us encoder model model dot get layer okay if we go up we can see that while defining the model to each model we name them okay let's say encoder model okay so we will be calling the model with the name encoder model and we are returning this encoder model similarly we are in extracting the decoder model by providing the decoder word embedding okay then we are uh just extracting the decoder input decoder embedding decoder uh batch normalized okay and at the end we are extracting the decoder model okay so now uh let's execute this function now we have extracted this okay now we will be creating a sequence to sequence interface okay so what this will do this will uh get the input from a user okay and this will give a output uh that is in french so it will take a input in english and it will generate a output in french okay so now uh let's see here we are using all the encoder pre-processed decoder pre-processed things okay and we are initializing a init function okay uh that will get all those values okay maximum length of the french word maximum length of the english term okay and and things like that everything then model we are loading this sequence to sequence model here then we will be creating a function to generate french okay so what it is passing it is passing self it is passing the raw input text that is nothing but the english term okay the english text for example hi hello how are you this kind of a text okay then it will be passing a maximum length of french for default it is none we will be getting this uh with the help of this default max length french okay so now if the max length french is none keep the max length as self dot default max length french okay and if you provide while calling this function if you provide any number let's say 10 20 30 that will be the maximum length of french keywords okay now we are loading the raw tokenizer okay for english we are using this pb english dot translate we are providing this raw input text so the moment we provide the raw input text into a list of in the form of list it will translate and give a tokenized vector for that similarly this raw tokenized input will go to the model okay the encoder model and it will predict the encoding body for that okay this body will uh we are storing it as a original body encoding okay then we are uh just uh doing a set state value here what we are doing we are using this uh french this french tokenizer okay to token id to start okay and then we are reshaping it to one cross one vector okay then decoded sentence stop condition for now it's a stop uh false okay while not stop condition till we do not get a condition for which our stop condition will be true it will iterate okay then it will predict and state okay then self dot decoder model then predict okay we will pass this state value as well as this body encoding or original body encoding we are passing it and it is predicting it okay then that prediction will be we are getting the maximum uh prediction value since we are using soft max okay while creating the encoder and decoder model so it will generate each words it will not generate the text rather than instead of generating the tokens it will generate the probability for the token okay and that probability we will be we are getting here the maximum probability okay so with the help of np dot arg max we are getting that token whose frequency is highest okay and that that token uh that id will uh we will be passing to id dot token okay and that uh will give us the predicted word for that okay and until or unless if we get uh that predicted stop word as end okay or the we reach to the decoded sentence is uh greater than or equal to the maximum french word we will change the stop condition to true and break it okay the moment it breaks it will go and it will break the loop okay and then it will not execute these states and at the end we will be getting original body encoding join decoded sentence okay so this way we are generating the french okay so now next we are creating a function that is known as print example so here we will be passing the english text and the french text and we will be seeing that uh what is the predicted output french text for that okay so that we can understand that whether our model is performing good or not or how it is performing so now let's execute it so we have created a class for that now we will create the object sequence to sequence inf create it okay now we will be using this uh this demo model prediction okay here we are passing the df so that df is nothing but our text a test data so now let's see our test data is so our text test data is this this has 10 000 some values okay so we are not passing the entire 10 843 values instead of that we are passing the 50 data okay and df is equals to test data so now execute it so uh let's see it takes couple of time so now here is our english text okay i'll be the 17th next year then the original text is this for this we can see that this is this is the uh text for that okay and we can see that our uh model has predicted this start and end okay so we can see that our model has not performed well for the first case now coming back to the second case we can say what a beautiful down okay so now we can see that our model has performed well so we are getting a good result here i don't smoke so it is also again giving us a correct result uh then if we see this example our model is not performing well now if we see this example our model is performing up to some extent it is performing well for the last keyword it is they are different okay uh it is it is giving us the same result okay so if we see the results we will see that for some of the uh data the model has performed well but for some of the data it is not performing well okay so this is not the best model but yeah we can tune this model or we can add and remove the uh tokens with frequencies let's say one so with the frequency of one we know that the word has occurred only once in the entire corpus so we can remove those uh tokens and then we can retrain the model and that model will outperform this existing model and then we can we can do some of the architectural changes and let's say we can use attention into that we can use beam search into that so we can we can do various kind of a thing in this model and get a good result for that but for this use case our intention was to just build a sequence to sequence basic model and try and see how the results are coming for a machine translation model okay what is customer engagement um i don't know in hr uh if you're either in a hr job or if you have come across you know some of the hr activities they talk of something called employee engagement right so what do we mean by the term engaged it is that you are emotionally connected if its employee engagement then the employee is emotionally connected with a job similarly if its customer engagement then the customer is emotionally connected to the brand right now why voice is important now when we are emotionally connected with something we don't let it go that easily right that's the reason why we stick on to friends stick on to our spouses relatives and so on because the the aspect of engagement comes in here right so customer engagement similarly is very important to retain customers it is about having an emotional connection between a customer and a brand now there are several other definitions i am just going to go through a two more just to give you a bit more of a dimension of customer engagement so it is the extent to which a customer makes brand focused interactions like how engaged are you with the brand how attached are you with the brand and it also measures how much the customer interacts with the brand in various stages of his or her life cycle right so this is what is customer engagement and why is it so important today of all the times as all of you know there is a lot of impact on several businesses right due to the pandemic and the lockdown and there is also uncertainty and you know the economy is very volatile it is critical to stay connected right so when this being the state you have to be connected with your customers in order to retain them and to do that you cannot follow you know one strategy fits all you will might you might have to have a human centric relevant and targeted strategy right so this is where we are going to see how your nlp and social media analytics will add value now what does this bring to the business obviously as we said it's going to increase your customer retention and reduce your churn which means that when i have more customers more engaged customers it's going to increase audience because it's going to spread by word of mouth you are going to re acquire more customers then we'll widen the scope for cross sell and upsell now when the customer has trust on your brand then there are more opportunities for you to sell more of your different products in your same brand and increase your revenue and again the revenue realization is faster right people they look at something they trust you so they don't do more of research they don't go to competition they quickly buy your product there is increased brand loyalty and you can also look at we we will do this as part of the use case where there is a net promoter score and finally with nlp and social media analytics your customer service becomes much more effective and efficient now let me show you some stats to confirm how important is customer engagement right there are various you know research findings available the first one says that up to two-thirds of a brand's profit might rely on effective customer engagement this is done by hall and partners and a fully engaged customer generates 24 percent more revenue than the average customer this is research done by gallup and research done by bail and company states that five percent in increase increase in customer retention will increase your profits by 25 why basically because you don't lose customers so the cost of losing customers you save and usually acquiring customers is very expensive right so you make a saving there so you can see that how crucial customer engagement is okay so now coming to the role of social media right we've seen how important customer uh engagement is now why is why are we talking about social media here now with social distancing and you know the absence of very minimal brick and mortar store available a lot of your business interactions are moving online right i even order my daily vegetables and grocery you know through donso or swiggy so the the touch points physical touch points between the customer and the business is coming down a lot of these touch points are moving to online and social media right which means you have a lot of content available in social media customer reviews are available customer scores are available social media posts are available which means there is so much of information out there which we can use effectively to understand what the customer wants and then deliver whatever is required now coming to the customer engagement lifecycle for the first step of course we have to acquire customers right so to acquire customers we have to identify who is our relevant and profitable customer then we acquired our customer the customer has started doing business with us right the next step is how do we get him or her to do more business where we are talking about revenue growth now when will i buy more of a brand or when will i make use of a service when it is relevant to me right so if i am able to have targeted promotion and selling then i will increase my revenue okay so i start making a purchase with a brand and i have to feel good about it right so that's where comes your customer experience so when i feel good about a brand then my perception about the brand is good and the brand is able to deliver a good customer experience then comes customer loyalty where i i that brand is there with me right i am emotionally connected to the brand and then i also start promoting the brand like i talk to people about it i say that you know this is a good brand you might probably want to try it out i have had very good experiences and things like that now this is how the whole customer engagement life cycle works coming to our business use case this is our use case and we are going to see how to you know build the customer engagement in this business use case now there is a fitness brand which is trying to sustain its business amidst challenge and challenging times now you know that you know gyms are closed right there is not there is there's a lot of revenue so a lot of gyms are trying out you know their online workout sessions and then you probably they might they might have been selling merchandise now all that is lost out so we have a fitness brand which is trying to sustain its business amidst challenging times as a first step the brand wants to enhance customer engagement why would they want to enhance customer engagement it's because now when things get back to normal right you will still slowly want to increase your foot falls and if things do not get back to normal you have to find other ways of sustaining the business right so either way you will have to enhance your customer engagement so the good news for the brand is that the brand has a huge online presence which means they have apps they have a good website so lot of reviews are available lot of customer data is available and they also have well maintained social media accounts like twitter instagram facebook and they so they have a huge amount of social media data which is mainly text so this is the scenario now we have a fitness brand now the brand is trying to enhance customer engagement they have lot of social media data now what is the task that we need to do we are going to formulate a customer engagement strategy step by step now in the next 30 to 40 minutes using nlp and social media analytics for this customer now before we jump into the actual use case for people who are not aware about nlp i would like to give a quick introduction about what nlp is and why is it important right so nlp is the abbreviation for natural language processing and nlp is a combination of using human language with a bit of programming and artificial intelligence so that we are able to convert text into a format that can be you know run through algorithms and used for prediction right so we are converting the unstructured text data into structured data and then gaining insights this entire process is called nlp now you might have a question right now there might be a lot of customer data available in terms of numbers okay probably their purchase data probably rows and columns of a lot of numbers now why should i use text and you know complicate the whole thing this this question might come up to a lot of people especially people who are new to nlp i'll tell you why only 20 of your data is structured right the so called transactional data the customer invoices or data where you have numbers laid out in beautiful rows and columns as an excel table is only 20 percent of the data available all the other data is unstructured data which is text images audio video and whatnot so which means especially if you want to look at what the customer feels about what are they talking about then we have to rely heavily on text data and that is where nlp comes to play okay let us quickly look at some of the applications of nlp in real life okay you might have heard a lot about this sentiment analysis this is a very interesting application of nlp we will be doing sentiment analysis uh you know in some time now sentiment analysis is basically to understand the perception you know of somebody based on the text right if i am angry i use certain words if i am happy i use certain words if i am you know sad i use certain words now based on the word usage you can use sentiment analysis to extract the sentiment from a corpus of text now this is used extensively to find out what customers like and what they don't like this is also used for any hate speech detection the next one is a chat bot now any website that you go you know you see a chat bot which of you which pops up so this is very effective for customer service because i don't have to this can be switched on 24x7 this is based on conversational ai so i don't need to man my website 24x7 i am able to immediately answer a customer's query with a properly trained conversational ai engine and i am able to give timely service to my customer then we have speech recognition the common example of speech recognition are our series and alexa right and our google assistant we have machine translation which is another important application of nlp which is used in your google translator for language translation keyword search right now as we said there is huge amount of data available text data available right who has the time to you know read through all this sift through all this and then extract what is more important so here you have very important applications of nlp keyword search which will which we will be again using in a while for us to extract information from websites right what are the top five keywords that they are discussing in this news item okay and what are the if i look at some research say research on kovid 19 right there are hundreds of publications then what are the top keywords i'll take you know i'll be able to extract and then what are these topics it could be hydroxychloroquine it could be you know something else so these are certain very simple but very powerful applications of nlp finally we have pattern matching now i keep posting stuff on fitness right i say i have done this workout now i say that you know i've i've been by being on a paleo diet whatever then immediately after some time you will see that you have ads on your facebook feed related to fitness products right so it takes basically text from what you post and then or searches that you made on google and then advertisements are matched accordingly this is also used extensively for profile matching in sites like now creep right now there is a job profile given by naokri so say for a particular role now you have posted your cv and then there are certain you know skills that you have posted now who has the time to read cv by cv and then you know look at what is important so what basically happens there is there is a pattern matching algorithm which looks for similar words and then ranks your cvs based on similarity and then probably somebody will pick up the top 10 or 20 cvs and take it forward so if you look at all of this basically text mining and nlp is about decluttering you know whatever is there and then extracting the information now we are going to we have seen the customer engagement life cycle now we have seen what nlp is now we are in our use case we will try to plug in these techniques in each stage of our life cycle okay so the first one is customer acquisition right now what are we going to use nlp here for we will use social media analytics and nlp to identify potential customers right there are millions of people out there now if i have to run a campaign i cannot run it to everybody i would like to select potential people who might be interested and i might want to run the campaign with them okay the other thing is i might also want to look at as i said what people post okay now i might be a fitness buff but i might be interested more in running right i might want to go for a marathon somebody else might want to do cycling and there are a few people who want to do a workout right so i want to understand even in terms of fitness what are customer interests and preferences now once i have acquired the customer as you know the next stage is revenue growth so in revenue growth now we will look at what exactly customers are talking about and target the promotions based on that not only that you need to know where to post right where to place your promotion and then where to advertise your products that's that's very important now if you're if you're looking at youtube you're constantly searching for analytics videos then you will see similar ads coming in your feed right that's basically because it picks up from the keywords and then throws relevant ads now where somebody you are looking at a lot of data science related videos you might even want to do a course you might click on one and then convert into a customer right so the two things we will do in revenue growth so we will extract text and then we will target promotions by using the most effective channel right and for customer experience we first need to understand something called voice of customer now people who've been in six sigma who've been in customer service have heard about this term boc which is wise of customer now with all the reviews that we you know do in swiggy and you know our mantras and amazon and all that there is huge amount of data available as customer reviews right and when do i post a review usually when i am extremely happy or i am when i am extremely frustrated so we are going to look at reviews for both these extremes and see how these are able to enhance your customer experience and the other thing of course for customer experience is your chat bots and ai now finally customer loyalty when it comes to customer loyalty or even employee loyalty for that matter there is a term called nps code which is nothing but the net promoter score right so it talks about whether you will promote a brand whether you will not promote or you hate a brand which is called detractors and the other thing is whether you will be neutral right so there is an nps code we are going to use sentiment analysis to derive nps course and we will see how this can be used for agile customer support right now this is how we are going to embed our nlp and social media techniques in our customer engagement journey so this is how our life cycle starts right now the first stage is to identify potential customers now how are we going to do this now i am going to choose relevant social media accounts right social media accounts so for example if i am somebody who is interested in fitness i will be already either following a fitness brand or maybe a fitness influencer or i will post something on fitness right that's how we are going to find relevant social media accounts and then we are going to extract user information about these people and then maybe you know use that go to the reach out to those accounts and then position our ads now we are going to we have seen the customer engagement life cycle now we have seen what nlp is now we are in our use case we will try to plug in these techniques in each stage of our life cycle okay so the first one is customer acquisition right now what are we going to use nlp here for we will use social media analytics and nlp to identify potential customers right there are millions of people out there now if i have to run a campaign i cannot run it to everybody i would like to select potential people who might be interested and i might want to run the campaign with them okay the other thing is i might also want to look at as i said what people post okay now i might be a fitness buff but i might be interested more in running right i might want to go for a marathon somebody else might want to do cycling and there are a few people who want to do a workout right so i want to understand even in terms of fitness what are customer interests and preferences now once i have acquired the customer as you know the next stage is revenue growth so in revenue growth now we will look at what exactly customers are talking about and target the promotions based on that not only that you need to know where to post right where to place your promotion and then where to advertise your products that's that's very important now if you're if you're looking at youtube you're constantly searching for analytics videos then you will see similar ads coming in your feed right that's basically because it picks up from the keywords and then throws relevant ads now if you're somebody you are looking at a lot of data science related videos you might even want to do a course you might click on one and then convert into a customer right so the two things we will do in revenue growth so we will extract text and then we will target promotions by using the most effective channel right and for customer experience we first need to understand something called voice of customer now people who've been in six sigma who've been in customer service have heard about this term boc which is voice of customer now with all the reviews that we you know do in swigi and you know our mantras and amazon and all that there is huge amount of data available as customer reviews right and when do i post a review usually when i am extremely happy or when i am extremely frustrated so we are going to look at reviews for both these extremes and see how these are able to enhance your customer experience and the other thing of course for customer experience is your chat bots and ai now finally customer loyalty when it comes to customer loyalty or even employee loyalty for that matter there is a term called nps code which is nothing but the net promoter score right so it talks about whether you will promote a brand whether you will not promote or you hate a brand which is called detractors and the other thing is whether you will be neutral right so there is an nps code we are going to use sentiment analysis to derive nps course and we will see how this can be used for agile customer support right now this is how we are going to embed our nlp and social media techniques in our customer engagement journey topic modeling we are going to see topic modeling uh morally krishna so one of our examples has topic modeling uh we'll be also creating data pipelines we are only talking about text extraction um so data pipelines uh i'll take it later offline after the main session is over um we will look at click-through rate as well on how it can be used to improve click-through rate um hr analysis as i said it can be used for employee engagement and the other use case as i said is you know the mps course you can look at comments from employees and converted to exit interviews course and then you can extract keywords to look at what people are why people are leaving you can also use it as i said for pattern matching and then you know selecting from series now if you have advertised for a role and then you have thousands of applications you can use nlp to pick the most relevant cvs so this is how our life cycle starts right now the first stage is to identify potential customers now how are we going to do this now i am going to choose relevant social media accounts right social media accounts so for example if i am somebody who is interested in fitness i will be already either following a fitness brand or maybe a fitness influencer or i will post something on fitness right that's how we are going to find relevant social media accounts and then we are going to extract user information about these people and then maybe you know use that go to the reach out to those accounts and then position our ads now this is the first use case that we are going to see now i have done this in r so i have extracted data from twitter the reason why i use twitter is its and even for you to try out twitter has a very you know especially with r it has an open api so with just a two-step process you will be able to connect to the twitter api and you will be able to extract tweets with this library called r tweet okay so this is the first use case we are going to take a look at so what what am i doing first i initiate the libraries as i mentioned earlier all of these code all of the data sets will be provided to you at the end of the session okay so i'm going to extract followers i'm just showing you some examples here right now like this there might be several fitness brands there might be several fitness influences now when you actually do it in real life you will collate all of that information to create your customer database i am just showing you an example of how and where to start i mean this is not exactly exhaustive right i want you to keep that in mind so i am using this get followers for me to extract followers for a particular brand secure fit i chose curefit because they have a lot of online presence they have a good twitter account they have their app so lot of text data is available i am also getting followers of rujuta devakar i know people who have followed her know she is very famous for her you know diet recommendations and she is you know the fitness coach for a lot of celebrities so now i have follower information from twitter i mean the same thing can be replicated to any social media as i said the reason i use um twitter is because you know twitter has um you know is an easy interface to an api to extract tweets okay so after i extract then [Music] i can use the lookup user function in r tweet where i get a lot of information about these users so let me probably okay so i get their screen name i get their latest tweet what is their source okay and a lot of information is available from this look up users so once i have this then i'm able to extract their screen names now i've just got it for six people now imagine each one of these right each one of these has 5000 followers at least and then if i do it for you know several brands imagine the number of relevant user accounts i will have access to right so this is the first one so what have i done here i have selected followers of a fitness brand and a fitness influencer and then for this followers i collect more information about them in terms of their social media accounts so that i can target my promotions towards this okay so now we have a list of relevant customers say that you know you have to pitch to these customers and then you will have to start getting business from these customers right now how are we going to do that we need to make targeted promotions based on customer interest okay i answer two questions here where to promote which is what is the most effective channel and what to promote right i am going to have see how how long are people going to look at an ad maximum 5 seconds so in that five seconds i have to give them a very powerful message which is acceptable right or which people like so which is when my brand gets promoted okay so with this in this part we are going to answer two questions with nlp and social media the first one is where to promote and the second one is what to promote okay we'll go back to r right now if i have to promote about fitness where do i have to promote i have to look at fitness magazines or sites which are for fitness or probably you know i can use fitness influencers i can pay them some kind of royalty and promote on their account now so these are typical sites that i will select right if i want to promote on fitness so now i have taken four popular fitness magazines on twitter these are more you know in the international context but you can i mean i i just took this to show you an example but you can of course take relevant ones you know for your business so these are the four twitter handles for um you know four fitness magazines now like earlier i am extracting the user information right now what am i looking for in the user information is again the followers count now imagine what now there are people who talk about in instagram right saying i have 1 million followers i have 2 million followers why is it important to have followers right when there are more number of followers then a lot of people look at what you post right so you have increased visibility and then there are greater chances of conversion so i would want out of these four choose that twitter site which has the maximum number of followers so for each of these sites i am extracting the follower count right so this is what i have now the first one has ten thousand sixty three thousand thirty seven obviously this is the winner right shape magazine has about 600 000 followers and if i'm somebody who wants to promote obviously i will use this because with a single ad i will be able to reach 600 000 people right so this answers the question of where to promote now as i said this is an ocean okay i am only giving you leads of where you can start you might i did it for four you might be doing it for 20 or even 30 different sides together to identify the best one okay now coming to the next one on what to promote now this answers where to promote so from for our use case i know that we have to promote in shape magazine now what to promote as i said is the promotional data that we are using okay so for this again i have extracted tweets right and in twitter there is something called retweeting right uh in facebook you share something right if the somebody has posted something you like it you share it right you reshare the same content so retweeting or sharing content is called amplification in social media language okay now i tweet something and then hundred people retweet it it means that they like it so they are tweeting it again so whatever message i say is amplified hundred times okay so i am going to look for the tweet text and how many times it is retweeted right now if something is retweeted hundred times then i know that message goes well with people who like fitness so let's look at this count okay so yeah i have my tweets and then for each of it i have the retweet count what i will do is now i will sort this in descending order okay i sort this in descending order so i have the top 10 tweets that were retweeted okay let us take a look at these now these are the top 10 tweets this is a bit small so let me go here okay so if you look at this these are very catchy right add a quick to your workouts and then working work out like a world champion would so these are all catchy messages which people seem to like a lot okay so add a kick to your workout so something which syncs with a lot of people who like fitness so you will do your research around what are the themes which people like so this is something which goes well with people okay there i just heard i have a chance to work out with superman so when you post something like this then a lot of people's attention is you know diverted to this ad so you will take a look at all of this and then you will wrap your branding message in this direction okay so now this brings us to the second part of our customer engagement life cycle uh yeah basically nlp is more about you know there is a lot of pre-processing required then there are certain concepts required in terms of nlp are there all that you will be covering as part of nlp plus python yes you just need a working knowledge of python or r yes fake news is a combination of nlp and unsupervised learning algorithms you can use nlp to you know detect face fake news so to quickly sum up first we looked at acquiring customers so we looked at you know followers of uh cure fit and rajita devakar so these are our audience potential audience then we looked at okay we have these people how do i increase my business okay so i have extracted data from twitter uh the reason why i use twitter is it's and even for you to try out uh twitter has a very you know especially with r it has an open uh api so with just a two step process you will be able to connect to the twitter api and you will be able to extract tweets with this library called our tweet okay so this is the first use case we are going to take a look at so what what am i doing first i initiate the libraries so i'm going to extract followers i'm just showing you some examples here right now like this there might be several fitness brands there might be several fitness influences now when you actually do it in real life you will collate all of that information to create your customer database i am just showing you an example of how and where to start i mean this is not exact exhaustive right i want you to keep that in mind so i am using this get followers for me to extract followers for a particular brand secure fit i chose curefit because they have a lot of online presence they have a good twitter account they have their app a lot of text data is available i am also getting followers of rujuta devakar i know people who have followed her know she is very famous for her you know diet recommendations and she is you know the fitness coach for a lot of celebrities so now i have follower information from twitter i mean the same thing can be replicated to any social media as i said the reason i use twitter is because you know twitter has you know ease an easy interface to an api to extract tweets okay so after i extract then i can use the lookup user function in r tweet where i get a lot of information about these users so let me probably okay so i get their screen name i get their latest tweet what is their source okay and a lot of information is available from this lookup users so once i have this then i'm able to extract their screen names now i just got it for six people now imagine each one of these right each one of these has 5000 followers at least and then if i do it for you know several brands imagine the number of relevant user accounts i will have access to right so this is the first one so what have i done here i have selected followers of a fitness brand and a fitness influencer and then for this followers i collect more information about them in terms of their social media accounts so that i can target my promotions towards this ok so now we have a list of relevant customers say that you know you have to pitch to these customers and then you will have to start getting business from these customers right now how are we going to do that we need to make targeted promotions based on customer interest okay i answer two questions here where to promote which is what is the most effective channel and what to promote right i am going to have see how how long are people going to look at an ad maximum 5 seconds so in that 5 seconds i have to give them a very powerful message which is acceptable right or which people like so which is when my brand gets promoted okay so with this in this part we are going to answer two questions with nlp and social media the first one is where to promote and the second one is what to promote okay we'll go back to r now if i have to promote about fitness where do i have to promote i have to look at fitness magazines or sites which are for fitness or probably you know i can use fitness influencers i can pay them some kind of royalty and promote on their account now so these are typical sites that i will select right if i want to promote on fitness so now i have taken four popular fitness magazines on twitter these are more you know in the international context but you can i mean i i just took this to show you an example uh but you can of course take relevant ones you know for your business so these are the four twitter handles for you know four fitness magazines now like earlier i am extracting the user information right now what am i looking for in the user information is again the followers count now imagine what no there are people who talk about in instagram right saying i have 1 million followers i have 2 million followers why is it important to have followers right when there are more number of followers then a lot of people look at what you post right so you have increased visibility and then there are greater chances of conversion so i would want out of these four choose that twitter site which has the maximum number of followers so for each of these sites i am extracting the follower count right so this is what i have now the first one has ten thousand sixty three thousand thirty seven obviously this is the winner right shape magazine has about six hundred thousand followers and if i'm somebody who wants to promote obviously i will use this because with a single ad i will be able to reach 600 000 people right so this answers the question of where to promote now as i said this is an ocean okay i am only giving you leads of where you can start you might i did it for four you might be doing it for 20 or even 30 different sites together to identify the best one okay now coming to the next one on what to promote now this answers where to promote so from for our use case i know that we have to promote in shape magazine now what to promote as i said is the promotional data that we are using okay so for this again i have extracted tweets right and in twitter there is something called retweeting right uh in facebook you share something right if the somebody has posted something you like it you share it right you reshare the same content so retweeting or sharing content is called amplification in social media language okay now i tweet something and then hundred people retweet it it means that they like it so they are tweeting it again so whatever message i say is amplified hundred times okay so i am going to look for the tweet text and how many times it is retweeted right now if something is retweeted hundred times then i know that message goes well with people who like fitness so let us look at this count okay so yeah i have my tweets and then for each of it i have the retweet count what i will do is now i will sort this in descending order okay i sort this in descending order so i have the top 10 tweets that were retweeted okay let us take a look at these now these are the top 10 tweets this is a bit small so let me go here okay so if you look at this these are very catchy right add a quick to your workouts and then working work out like a world champion would so these are all catchy messages which people seem to like a lot okay so add a kick to your workouts something which syncs with a lot of people who like fitness so you will do your research around what are the themes which people like so this is something which goes well with people okay there i just heard i have a chance to work out with superman so when you post something like this then a lot of people's attention is you know diverter to this ad so you will take a look at all of this and then you will wrap your branding message in this direction okay so to quickly sum up first we looked at acquiring customers so we looked at you know followers of uh cure fit and rajita devakar so these are our audience potential audience then we looked at okay we have these people how do i increase my business okay the third step as i said is to look at customer experience the first stage is understand voice of customer now there are usually tweets that i make right on a particular brand or i provide reviews right on amazon or flipkart or on myntra i provide some reviews so i can use all of this but there are millions of posts like this right i just want to declutter all of this cut out the redundant information and only instruct extract the most important information right so we will do that using keywords and then we will also use topic modeling to understand what are the topics generally discussed about this brand right that will also suppose i am looking at thousand thousand two hundred tweets right i do not have time to read all of them but i can look at what are the top five topics discuss discussed across these thousand or two thousand tweets to very quickly get an idea of the customer perception right let me do this in python because some of the nlp tasks are you know much python is much more powerful later if you want to you know do some text classification uh some sequential models then python is handy so i am now going to shift gears to python okay so i have my tweets right i have about 1700 tweets right i've taken a very small subset of tweets you probably might have millions of tweets uh because i just wanted to you know quickly run this and show it to you as a demo i've taken a very small subset so now we are going to extract keywords right so python has a lot of very powerful nlp libraries one is nltk then you have sparsi so for this we are going to use nltk i am importing the library and then i am extracting the tweet text right so the tweet data that you extract not only has text but it has about 90 89 to 90 columns of metadata around the tweet including screen name from where the tweet is uh how many followers how many times this is retweeted right and several other information so for us here our interest is the actual tweet text itself right so i'm extracting this and then as i said now this twitter text or social media text is especially has a lot of noise right what do i mean in terms of noise it could probably have emoticons it can have web urls uh it can have slangs right and then it can have punctuations other you know special characters now if i want to extract whatever is important i have to get rid of all this redundant information so it is like sieving right if you are trying to save something then you do that and then you know you let out all the the smaller particles through the sieve and then retain only the bigger particles right so similarly i only want my most important keywords so i am going to do some text preprocessing okay so um one part of text pre-processing is um called you know removing stop words now there are general stop words uh for example um and is because this is used all along right this cannot be an important keyword and then there could be some context specific stop words now if i am exporting tweets on cure fit definitely you will have this word cure fit coming again and again so i do i don't want to look at these words right so i am going to filter out certain stop words or redundant words right and then this is a very small function or a for loop that i have written here for us to pre-process text now fundamentally what does this do it removes punctuation converts to lower case right and then lemmetizes or normalizes your text and gives you a clean corpus okay so it's removed all the redundant information and it does this for all the 1700 tweets okay now after this i create a vocabulary of words what is the vocabulary of words now across the 1700 tweets what are the unique important words so for this there are two techniques that you can use count vectorizer and tf idf vectorizer so this is more to do with how these words are extracted okay so i do this to get my top 10 words okay so tweets on cure fit have these i can take top 10 top 20 how much ever i've taken top ten so what is this talking about it's talking about keep going so which means right keep doing your fitness stay safe this is more related to your lockdown so stay indoors now all of these gyms are coming with indoor workout so it talks about indoor workout and then you know live from home so you have live like our youtube live cure fit or cult fit has their live sessions getting better every day be better every day is again cure fits you know one of their taglines so you can see how beautifully you know from the 1700 tweets you are able to start extracting the top keywords right i can also visualize this using a word cloud you know if you want to make it jazzy and as i said tf idf is yet another way of extracting keywords um so let's take a look at this right so this is talking about adding past days so in cure fit if you if you use the app now if you you know it's the if you have a three month membership you can pause a few days if you cannot go to the gym right so when it comes to lockdown what a lot of customers are asking about this about your past day okay see so you know that customers are concerned about this right so i am not able to go to the gym right will you be able to pass my membership so several terms related to past day have come out so again a very very effective way of extracting what exactly customers are looking for from your you know keywords right then the next one is we'll go to topic modeling so same text preprocessing can be done and like how i extracted just keywords i can also extract topics right what are the most relevant topics so these are some of the topics right so the first thing is it's about be better every day your dance fitness the second topic is about startups right there is probably there are some tweets on startups like swiggy tomato curefit ola and all that and then ankit nagori i guess is one of the founders of curefit so this is about him and then this is about the hrx brand right so you can see how topics also emerge from the tweet now as i said i only taken 1700 tweets now if you do topic modeling on more like say twenty thousand thirty thousand tweets this will be very very effective so with just a few steps we are able to not just extract keywords but also topics then based on you know uh the tweets using nlp now there is another aspect of customer perspective perception which is your sentiment analysis so now we looked at yeah topics keywords i am going to show you a very beautiful library in r called suzette which is used for sentiment analysis so i am basically taking you know tweets the same tweets on cure fit okay and i am going to perform sentiment analysis to understand the overall perception of the customers about the brand right how do customers feel do they have a positive sentiment do they have a negative sentiment that's the first step right overall what is it out there do customers like my brand that's what we'll take a look now suzette is beautiful because it not just gives you the positive or negative sentiment which we will take a look at in a minute but it also gives you an emotional score okay so when i talk about an emotional score for cure fit based on the 1700 tweets we have we know that the most dominant sentiments are trust joy and anticipation positive and negative definitely the positive sentiments are high compared to your negative sentiments right but is this enough for us to retain customers although there are only very little negative sentiments these are people who could probably destroy your brand right so it is very important to find out who these people are and what they what is that they are unhappy with right that's the next step okay overall i am happy the brand perception there are a lot of positive sentiments of around you have trust joy anticipation which is good but i also have anger you know disgust so i want to find out what's really wrong with these people so i might want to call them and address this right if you've heard of nps course the process of coming up with nps course is you know where you have to send out a survey you will ask people i don't know if you've done a survey which says on a scale of 1 to 10 how how is it likely that you will recommend this brand right so running such a survey is very expensive and you may not get enough data but sentiment analysis can be converted to an nps course right that's the last application we will see and with that we will be able to wrap up our use case and give a very beautiful strategy to this fitness brand so here what i've basically done is i've used the same tweets on curefit and i use the wader library where i get a positive negative and a composite score for each of my tweets okay so for every tweet i can get whether that has a positive score or i have a negative score right once i have this information what i would probably do is i will then look at what are the kind of tweets where people are happy okay so let us look at some of the positive tweets okay so these are some of the positive tweets where they are talking about a particular trainer they say she is amazing amazing what amount of energy she has thanks for the great session so these are all positive tweets this is where people are happy right now what is more important we are we want to look at people where they have given a negative tweet now what are they unhappy about this person is unhappy about being spammed with smss so probably you know you don't you might want to look at how you are sending out your messages out to people and then see if curefit is an app-based fitness right so it doesn't have a phone number it certain people find it very difficult to kind of appreciate this so this is what this person is very unhappy about right there is no response to any of our emails so this can be a very important customer right you might directly want to contact this customer to find out what is wrong now imagine if you do not have a mechanism like the sentiment analysis this we did like we did now this tweet is lost in the millions of tweets right but if i do a simple dashboard where i am able to flag the most negative comments i will able to quickly reach out to these customers and then find out what has gone wrong and then you know try to set it right so this is what i call as agile customer service this is one the other thing is based on the score so if you look at this you have this course here right now based on this course i can bucket customers as promoters detractors and neutrals right so now for every customer now in this data i have 433 promoters but i also have 272 detractors these are people who can ruin my brand right so i have to look at why they are unhappy and then there are neutrals you would want to push the neutrals into the promoters so that they vouch for your brand okay so for every customer i will be able to get whether they are a promoter or a detractor or a neutral with this we have completed the four steps of customer engagement and then we've seen how to use nlp and social media in each of these areas so to sum up how do we enhance customer engagement know your customer so for this your social media data and nlp is extremely helpful extract voice of customer so we saw keyword extraction you know we saw topic modeling and then we saw sentiment extraction with that you can understand your what how customers perceive your brand and then you can segment your customers right promoters you will follow a different strategy or you can do some kind of clustering and customer segmentation models and then target one specific strategy for each segment and we saw about proactive customer support right i can either use a chat bot or as i showed you if you are able to quickly flag negative tweets or negative reviews you can reach out to that customer before he or she causes damage and with this you can provide world-class customer service okay so thanks nps course the process of coming up with nps course is you know where you have to send out a survey you will ask people i don't know if you've done a survey which says on a scale of 1 to 10 how high how is it likely that you will recommend this brand right so running such a survey is very expensive and you may not get enough data but sentiment analysis can be converted to an nps course right that's the last application we will see and with that we will be able to wrap up our use case and give a very beautiful strategy to this fitness brand because let me go back to python now so here what i've basically done is i've used the same tweets on curefit and i use the wader library where i get a positive negative and a composite score for each of my tweets okay so for every tweet i can get whether that has a positive score or i have a negative score right once i have this information what i would probably do is i will then look at what are the kind of tweets where people are happy okay so let us look at some of the positive tweets okay so these are some of the positive tweets where they are talking about a particular trainer they say she is amazing amazing what amount of energy she has thanks for the great session so these are all positive tweets this is where people are happy right now what is more important we are we want to look at people where they have given a negative tweet now what are they unhappy about this person is unhappy about being spammed with smss so probably you know you don't you might want to look at how you are sending out your messages out to people and then see if curefit is an app-based fitness right so it doesn't have a phone number it certain people find it very difficult to kind of appreciate this so this is what this person is very unhappy about right there is no response to any of our emails so this can be a very important customer right you might directly want to contact this customer to find out what is wrong now imagine if you do not have a mechanism like the sentiment analysis this we did like we did now this tweet is lost in the millions of tweets right but if i do a simple dashboard where i am able to flag the most negative comments i will able to quickly reach out to these customers and then find out what has gone wrong and then you know try to set it right so this is what i call as agile customer service this is one the other thing is based on the score so if you look at this you have this course here right now based on this course i can bucket customers as promoters detractors and neutrals right so now for every customer now in this data i have 433 promoters but i also have 272 detractors these are people who can ruin my brand right so i have to look at why they are unhappy and then there are neutrals you would want to push the neutrals into the promoter so that they vouch for your brand okay so for every customer i will be able to get whether they are a promoter or a detractor or a neutral right so with this we have completed the four steps of customer engagement and then we've seen how to use nlp and social media in each of these areas so to sum up how do we enhance customer engagement know your customer so to the for this your social media data and nlp is extremely helpful extract voice of customer so we saw keyword extraction you know we saw topic modeling and then we saw sentiment extraction with that you can understand your what how customers perceive your brand and then you can segment your customers right promoters you will follow a different strategy or you can do some kind of clustering and customer segmentation models and then target one specific strategy for each segment and we saw about proactive customer support right i can either use a chat bot or as i showed you if you are able to quickly flag negative tweets or negative reviews you can reach out to that customer before he or she causes damage and with this you can provide world-class customer service if you haven't subscribed to our channel yet i want to request you to hit the subscribe button and turn on the notification bell so that you don't miss out on any new update or video releases from great learning if you enjoyed this video show us some love and like this video knowledge increases by sharing so make sure you share this video with your friends and colleagues make sure to comment on the video for any query or suggestions and i will respond to your comments
Info
Channel: Great Learning
Views: 50,163
Rating: undefined out of 5
Keywords: natural language processing, what is nlp, what is nlp and how does it work, nlp, natural language processing in artificial intelligence, natural language processing tutorial, natural language processing in 5 minutes, nlp tutorial, nlp in 5 minutes, nlp programming tutorial, nlp and deep learning, nlp and machine learning, nlp and its applications, natural language processing tutorial for beginners, nlp explained, nlp tutorial for beginners
Id: igKTO7lQxNo
Channel Id: undefined
Length: 483min 27sec (29007 seconds)
Published: Sat May 07 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.