Exploratory Data Analysis (EDA) using Python | Data Analysis | Satyajit Pattnaik & Shivani Shimpi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um hello guys we'll start exactly in uh seven minutes uh i just don't want anyone to miss out the initial few minutes so let's wait for seven more minutes will start exactly at 8 10. ah okay so hello everyone i hope my voice is audible uh shivani can you just confirm if it is audible uh yeah hello everyone this is satyajit patnaik and uh welcome to my channel in case you are new to this channel please subscribe the channel and hit the bell icon to get notified on my future video sessions webinars and all those things today we have with us shivani shivani is a ml researcher focused mainly on novel state of art models in deep learning and machine learning she has formerly also published novel research approach in ensembl model learnings apart from that she is also a part of our data science community where she helps other fellow members and data science enthusiasts and this is obviously not the first time she is a part of my webinar i have recently started doing such live workshops and webinars and got several requests from many to conduct a dedicated session on exploratory data analysis so that's why we are here initially i shall be walking you through the concepts of ada what are the various data visualization techniques which technique should be used in which scenario what are univariate analysis what are bivariate analysis categorical analysis numerical analysis all these things will be covering and then once you are done with the basic part which will be around for 60 minutes then i will hand over the session to siwanit so that she can help you implement eda on live use case and as per her implementation you can always take the same ideas and go back and take as an assignment and work on similar use cases as well so ada concepts are going to be very same you have to implement those concepts in your use cases so let's get quickly started on this in case you have any questions in the middle of the session you can post it as a comment or in the live chat and i'll try to get back to you with the answers if your questions are somehow missed you can definitely reach out to me or shivani directly links to our linkedin profiles are already shared in the description of this video so i think we have enough people uh and we can start let's give one yeah yeah shivani i'm sorry uh please introduce yourself as well into the crowd i'm really sorry for that yeah i was just saying it's um even if you guys have doubts i'll be looking over the live chats so you can post them and i'll answer back whenever satyajit is having uh his explanation sessions and likewise he can take over when i'm explaining the question yeah yeah please introduce yourself so that people know about know more about you okay hi guys um i'm shavani so i have been on his channel before as well for the keras tuner part and the a n video you can go and check that out it was real fun working with him and apart from that i'm currently working with a company and it's it's sort of like i'm leading the data science team and we're working on psychology and uh we're you know trying to merge ai with that to understand uh how people behave and stuff like that so yeah that's interesting apart from that it's really good to uh team up again with satisfied and be here so that i could you know uh just give back whatever so far okay so yeah thank you for having me here sure so let's get started guys we have a lot to cover in next two hours so probably yes recording will be available because it's youtube live so it will be there in my channel forever uh but yeah try to attend it live so that you get the most of it let's get started uh just uh let me know once you are able to see my screen shivani is it visible yep it is okay so today's topic as i already told it's all about exploratory data analysis i'll be talking about why is it important in the field of data science and analytics because eda is the core part without ada nothing is possible in the field of data science and analytics okay so this is going to be the overall agenda for today's session the first one are is going to be very theoretical uh will will not be able to show you some practical quotes but definitely whatever we will learn in the first 50 60 minutes we are going to implement that in the next second hour okay so this is the agenda what is eda why eda is important what is visualization and why data visualization is important what are the various charts which are available in data visualization bait c-bond chart or pipeline matplotlib charts and we'll go through those things what are the various steps involved in ada what is data sourcing data cleaning what is your univariate bivariate and multivariate analysis with visualization and what are derived metrics so you will have a understanding on all these concepts once we get started and and of course the use case part so this is the data science process which actually happens you can see the reality from reality reality is nothing but it all depends upon your use case the first step is always data collection right when you are working for an mnc or or any company where you already have a database of last 10 years 15 years five years whatever it is the first task is going to be data collection and once we start collecting the data after that we start processing the data start making it to unstructured to structured data start cleaning the data there could be multiple things in the data which we need to clean right so we'll also understand what is data cleaning and we also try to explore data as well like what the data is telling us once we know what the data is telling us then we have a clear-cut idea what should be our next goal okay uh one second yeah my bad my message was continuously visible okay uh i hope it's okay right now siwani right yeah it is okay so yeah raw collection raw data collection then data is processed then we clean the data so cleaning data and data processing is somehow somewhat related to ada process itself so we need to do an analysis of the data to understand what the data is telling us only when we know that then we can implement the models okay and then once the eda part is done then we develop the models and algorithms and then we try to take some decisions and this is basically the data science process which happens okay we'll try to understand what is eda eda is an approach to analyze the data sets to summarize their main characteristics in form of visual methods eda is nothing but a data exploration technique to understand various aspects of data okay the main aim of ada is to obtain confidence in a data to a certain extent where we are ready to engage a machine learning model that's what i told you right ada is the first and the basic step in the data science workflow once you have done the ada once you know what the data is telling you then you can take those insights or take the output of the ada plus ada step and then you can build your models okay ada is important to analyze the data as it is the first step i already told you it gives a basic idea to understand the data and make sense of the data i think this is very theoretical we'll move ahead with some examples so that it will be engaging and you understand more about that okay let me go through a few slides and then i'll explain using an example then comes your data visualization once you already have done your ada part like understanding the data what the data is telling you then what we need to do is we need to visualize it so sometimes just looking at the statistical output of the data it's not very hard for us to understand what the data is telling us so we take help of various charts so that we can understand okay this is how the data is the data is distributed okay so data visualization plays a very vital role easily analyze the data and summarizes it easily understand the features of the data it helps to get meaningful insights from the data it helps to find the trend or pattern of the data okay once we start visualizing the data it it start giving us more insights so now we will be talking about the various charts which are available for visualization and while we talk about the various charts i'll also talk about some basic differences between this and that which one to use in which scenarios so let's get started so these are the important charts uh for visualization first our first chart is histogram one second okay first chart is histogram histogram before histogram let's try to understand what is a bar chart you can see this is the histogram and this is a bar chart now bar chart is basically the definition says bar graph represent the total observation in the data for a particular category bar charts are basically used to compare the frequency total count sum average in different categories by using horizontal and vertical bars you can make it horizontal you can make it vertical as well now coming on to histograms histograms basically represent the frequency distribution of the data but how histogram and bar charts are different from each other as i told you bar chart is nothing but it is used to compare the frequency total count mean or sum or average in different categories now coming onto histogram histogram is a type of bar chart that is used to represent statistical information by ways of bars to display the frequency distribution of the continuous data when we start implementing the code part you have you will have a better understanding how histogram looks like how bar chart looks like and all those things okay these are the basics the next one is box box plot if you if i go if i go back to the previous graph i also forgot one thing that for example histograms can be used in let's say you you want to you want to find let's say you have the students marks let's say student english mark so if you want to plot like the count of students who got english marks in exam in various ranges probably you could use a histogram okay now what is box plot as you can see this is how a box plot looks like now box plot basically displays the five number summary okay minimum first quartile median third quartile and maximum now if you look at this probably i will draw a paintbrush so that you understand more about box plots now this is how a box plot looks like okay now this is how a box plot looks like box plots are also called as box and whisker plot okay so this is the minimum just one second let me change the mouse this is the minimum this is the maximum now this is basically called as the iqr inter quartile range this is nothing but it is a range between 25 percentile and the 75 percent percentile okay so the 25 percent is basically called as q1 this is called as q3 it is also called as lower quartile it is also called as upper quartile q1 is called as lower quartile and q3 is called as upper quartile and there is one more line to identify the median this is nothing but the median so box and whisker plot shows you five number summary of a set of data one is your minimum one is your maximum one is your lower quartile one is your upper quartile and one is your median okay then the next type of chart is pie chart pie chart is one of the basic charts which i think everybody knows about pie chart represents the percentage of the data by each category okay for example percentage of runs scored by three batsmen like rohit sharma virat kohli and rishabh panth probably so what is the total percent of the scores which has been scored by these three batsmen if i talk about only three the total percentage is always 100 percent right so if we draw a pie chart there will be three sections one will be for virat kohli one will be for second player and third one will be third third player right so pie charts basically represents the percentage of the data by each category the next one is violin plot violin plots are i would say it is also very much used uh they are mostly used to plot numerical data now if you can look closely in between the violin charts you can also see some box charts box plots as well this is basically a box plot okay so let me go back to the next slide so that we understand more about what is the difference between violin plot and a box plot because by looks they are almost similar but they have their own differences now violin versus box plot so there are a lot of uh differences between violin plot and box plot but i have just mentioned the two of the best uh differences okay so box and whisker plots basically if i go back to the paintbrush they show median mean max and ranges etc so they allow comparing groups of different sizes they are super simple to create and read so naturally they are all over the place they are mostly used like box plots is very much used when we go ahead with the idea part okay but box plots can also be misleading there could be scenarios where your box plots are not showing you the accurate information okay so they could be sometimes misleading they are not affected by data's distribution when the data morph but managed to maintain their stat summaries their box plot remains the same that is the drawback with box plots so this is where your violin plot comes into the picture you can see this simple example for all kinds of data the box plot remains the same but you can see how the violin plot is varying right so sometimes box plot could be misleading while in graph is visually intuitive and attractive as well okay if i just draw a violin plot let me draw a violin plot here hopefully i'll be able to draw it let's say something like this something like this okay this is a violin plot internally what it has is it has a box plot it's very ugly something like this okay now what do we understand is obviously these are your statistical summaries which it is telling you this is your median value this is your 95 confidence value this is your q3 value this is your q1 value so this is your iqr range okay but what is this what is this area this area is nothing but density plot width also called as frequency okay so when we start understanding the data when we start plotting violin plot and box plot you could probably spot the differences between them okay going back to the line chart line charts are used to track change over line and short period of time line charts are used in time series data you can see the x axis is basically the time data yearly data and the y axis is nothing but the population data okay something versus time so time series data we have to do line charts let's say time series forecasting or any kind of time series analysis you are doing so line charts are preferable to use next we have strip plot a strip plot i have not used much would you like to show some light shivani what how to use strip plot and which scenarios to use so generally you know you would want to use strip plots when if you're plotting like several types of categories maybe you know um you have the people who so in this graph as you can see it is absences versus g3 so this was during one of my uh sessions where i was saying like how many people are absent based on the grade that we received so as you can see uh there's more concentration on the number 11 and also on number eight but even in the number 11 most of them are around like you know there's a percentage where they're only 20 are mainly absent and there are a few more um on the range of like you know 60 or there's one around the range of maybe 55 or something like that so if you want to um have started data between um you know the categorical strip plots or the bar graphs that we plot in that case you can have strip plots so to understand strip plots in a very easier way you know it's bar graphs and in bargain in bar graphs you have the scatters scattered plot and when you merge them together what you get is strip plots okay so it's basically like used for categorical variables right yeah right okay so the next type maybe you can also show some strip plot examples if you have in the use case as well okay moving ahead to the heat maps part so heat map is a data analysis software that uses color the way a bar graph uses height and width so when we try to analyze analyze numerical data uh how to find the correlation all these kind of things you can definitely use a heat map okay so this basically tells you the score so you can see uh i cannot zoom it but if you can see 0.0 that means this variable and probably this variable are not related to each other like 0.0 so it is basically showing you the correlation values we'll talk about what is correlation in couple of minutes but this is how you can use your heat maps okay these are the various steps involved in ada so the first part is data sourcing as we already know first step is collection of data we should know what the source of the data is sometimes when we are working on some use cases we do have to do some data mining techniques or do have to some script data from some online web web portals or something like that so that is basically your data source it could be some areas where your data source is oracle database or some azure data links so it could be anything you have to identify what your data source is you have to extract the data then comes your data cleaning part data cleaning has its own steps like handling missing data and to see whether some variables are important or not those kind of things can be done under your data cleaning part once you have done the cleaning part unstructured to structure data then comes your analysis so analysis means univariate analysis bivariate we haven't mentioned the multivariate thing but once i uh give you a small example i will also explain how to do the multivariate analysis as well so after that we also have derived metrics sometimes what happens is uh derived matrix by the name itself it is like it has been derived from some other variables now for example you are trying to analyze the age data let's say customer uh customers age is a numerical variable right somebody is having 20 somebody's age is 21 somebody says is 60 50 so it's just a number right so if you try to analyze the number it could be a scenario where you are not able to analyze you are not able to identify which are the specific age groups where you are able to see some spikes so sometimes this kind of scenario scenarios what happens is we try to convert the numerical data to categorical data one example is this like converting a age to a age group now if i create five bins 0 to 10 11 to 20 21 to 30 31 to 40 41 plus something like that these kind of steps are basically your derived metrics okay let's get started let's get ahead data sourcing as i already told you it's a process of gathering data from multiple sources as external or internal data collection it could be to public data and it could be private data right so the data which is i mean i don't have to explain all these things we already know about this basics right public data which is easy to access without taking any permission from the agencies is called public data publicly available private data which is not available on the public platform and to access the data we have to take permissions of organization right like banking telecom so domain specific data is basically your private data after collection of the data the next step is data cleaning so all these basic things i have already covered these are the some of the steps involved in data cleaning how to handle missing values we have everything covered in the next slides standardization of the data outlier treatment i'll talk about that it's a very interesting topic and then we have handling invalid values issue after this let's go back to the next slide so these are some of the questions we need to ask ourselves about the data before cleaning it so before cleaning it we should be answering the questions the questions are quite obvious whether the data which has been collected so far does it make sense does the data you are looking at match the column labels after computing the summarization statistics for numerical data does it make sense these kind of questions comes into your mind and you have to ask yourself okay so handling missing data how do we handle missing data some people say there is a thumb rule that let's say you are dealing with some data sets and let's say your column one of the columns is having more than 30 or 40 percentage of missing values so let me let me take an example to explain okay let's let's just talk about this and then i'll explain the example so these are the ways how you can handle the missing data delete rows or columns replacing the data with mean median or mode there could be some algorithmic imputation as well or else we can use some prediction technique to identify the values and then fit the values okay so let's talk about the first thing deleting rows or columns some people say the normal thumb rule is that for example if i have a column let me take a small example let's say i have customer name i have customer id i have gender i have geography i have let's say let's say is car that means does the person has a car or not and then i have car type okay and i have multiple other variables i'm not talking about that so let's say uh so what what people do is i have also talked to a lot of data scientists they also say that randomly if your data if you one of your columns is having more than 35 percent missing values that means if you have total thousand records and if you have 350 missing values for gender then you can simply ignore this column this column is not used it's not important some people say that some people also say if the if the ratio is 40 more than 40 percent missing values you can ignore it but there could be scenarios where you cannot apply this step blindly okay so there are some techniques so one of the examples is let's say i have a column called as is car and i have a column called as car type iscar is basically if you have car or not let's say yes no yes no and let's say summarizing it 600 people don't have car no car 400 people have car okay so definitely those who don't have car for them car type will be null right agreed those who have car let's say this guy has car what will be the car type let's say xuv let's write it at xuv suv something like that maruti i don't know but i'm just writing it so it's quite obvious that people who don't have car will have null values right now just imagine you don't have the domain knowledge you don't know this use case now if you are blindly using this use case if you start analyzing this column you can see 60 percent missing values will you be deleting this column from the eda or from your data analysis point of view yes some some people will do that some people will delete this column but in real scenario you shouldn't delete it because it is dependent on this particular column right so we already know obviously people who don't have cars there will be null values so we shouldn't delete it in that case we should we should either fill it with some more values or we should fill it with some random values so there are some different techniques for that as well okay we'll talk about that next topic is next topic yeah next topic is replacing with mean median and mode now what is mean median and mode let me explain you using a small example everybody might be knowing about mean for sure let's say i'm talking about 4 values 4 1 and 7 the mean will be 4 plus 1 plus 7 divided by 3 mean is nothing but average okay so mean is the statistical term for average okay 4 plus 1 5 plus 7 12 by 3 4 so the mean is 4 what will be the median of this value if i just sort it out this is how it look like so the median will be the center value so the median value is 4 let's say we'll talk about mode so this is not an example to talk about mode let's let's let me take one more example let's say 4 2 4 3 2 2 1 2 something like that you have a series of numbers and let's say you have two missing values let's say here is one missing value and here is one missing value okay now how to impute the missing values what value should populate should you populate if you are using the mode technique your output will be 2 because mode basically is nothing but which number has the maximum occurrence so here 4 has 2 occurrences 3 has 1 occurrence 2 has 1 2 3 4 four occurrences one has one occurrence so two has the highest occurrence so your output will be two so if you are filling it with two same example four two four three 2 2 1 2 so i have two missing values and i'm using the mean technique so what will be the output 4 plus 2 plus 4 plus 3 6 10 13 15 17 18 20 20 divided by eight so whatever the value will be it will be 2.5 so you can write it as 2.5 2.5 okay and then go ahead and try to build the model and see how your model is behaving so machine learning is all about machine learning and ada all these techniques are hit and trial there is no thumb rule to do it but yeah going forward once you do it more and more you know which scenario to use in what case okay algorithmic imputation is nothing but there are some machine learning algorithms which supports handling missing values as well okay k n knife base random for us these are some algorithms as of now let's not talk about that the fourth technique is predicting the missing values so again prediction of missing values is nothing but let's say i am taking these values and i'm going to do a time series forecasting and based on my forecasting let's say my my output is 3.1 so i can write it out as 3.1 then our next step will be considering all these data points like 4 2 4 3 2 2 3.1 1 2 we have to predict the next again we will do a forecasting this is a very tedious task but it has been seen that it also works sometimes okay so again no thumb rule it's all up to your intuition what technique you are going to use this is the same example which i already explained how to computer mean median and how to computer mode okay so our next topic is standardization or feature scaling let me take a look at the watch okay i still have 20 minutes time so the next topic is uh feature scaling feature scaling obviously i had to teach it in the data science part but anyways it is also imply i mean it has to be you have to know about all these concepts when you are dealing with the idea processes as well now feature scaling is the method to rescale the values present in the features why why do we rescale one small example will help you out why do we rescale it now as we already know ultimate goal is to prepare a model okay and model is prepared by in the computer right so computers understand numerical data if i just provide them categorical data they will not understand right so normally what happens is i'll just give a small example again customer id customer name gender geography let's say height and let's say weight and some other features okay this is the data which i have now the height is something like this 173 171 169 165 180 okay now as a human being we already know what are the different scales where we can see the height variable right it could be in centimeters it could be in meters it could be in feet it could be in inches but we already know that this is in centimeters i don't have to write this as centimeter right this is a human behavior this is the human neural networks how it has been created right so we already know this is centimeters now if i pass this data to the model let's say i'm building a model on top of this data so computer for computer this is nothing but a number so what happens is and let's say there is some other let's say some other let's say age let's say 12 13 20 something like this so when what happens is when we pass the data to the model sometimes it is misinterpreted by the models that it gives you more priority to the maximum number values it is a possibility that it might prioritize this feature on top of this feature because my computer doesn't know that this is height this is in centimeters and this is age this is in years right my computer doesn't know so it is always asked to do a feature scaling now feature scaling is nothing but it scales down to a certain scale now one of the easiest feature scaling technique is let's say i'm just taking the same example 173 175 180 i'm taking three three values okay this is height now if i have to do a feature scaling let's say i'm trying to scale it down within for a scale of zero to one so what i need to do is i can divide it by the largest number so 173 divided by 180 what is the value let's calculate 173 divided by 180 0.96 0.96 so what is this 175 divided by 180 0.97 0.97 180 is 1 let's say 155 160. so 155 will be 0.86 0.86 and 0.88 0.86 and 0.88 now when we do feature scaling we what we do is we convert it into a scale of zero to one it could be from zero to one it could be from minus one to one it depends there are multiple features killing techniques but try to understand what is feature scaling okay now the numbers are 0.96 0.97 so these are in a scale of 0 to 1. similarly we can also do this with age so now both the variables are termed equally like interpreted equally by the machine tomorrow if i have the height variable in meters let's say 1.73 1.75 1.8 1.55 1.6 again these are not in a same scale right so again we'll do the same technique even if we do the same technique 1.73 divided by 1.8 the output is going to be same 0.96 0.97 you can see this however the core values are different this is in 1.73 this is 173 but the converted values are almost same not almost it is same right so let's try to understand what are the various feature scaling technique so whenever we deal with numerical variables we always do feature scaling okay importance of future scaling when we are dealing with independent variables or features that differ from each other in terms of range of values or units of the features then we have to normalize or standardize the data how to scale features there are multiple techniques to scale features okay one of the techniques is standardization one of the techniques is mean normalization one is mean max scaling okay let's try to understand each one of them i do have slides for that but i don't want i i want to draw it and explain so that we have a better understanding so the various feature scaling techniques okay first is standardization standard dissection okay standardization is nothing but replacing the values by the z score again what is z score i it will be a long session if i start talking about what is z scores and all those things but try to understand the formula of standardization is like what is the standardized value x dash is your standardized value x dash is nothing but x minus x mean divided by standard deviation this is your mean also called as this one and this is nothing but your standard deviation and we already know that the formula of standard deviation is you don't have to memorize it it's okay if you don't memorize i hope i'm wrong i'm correct okay this is how it is let's say let's say this example i have income i have age and all those things so what should be the new value of income fifteen thousand twelve thousand and thirty thousand thirty thousand so let's say i have fifteen thousand twelve thousand and thirty thousand so what will be the mean value mu will be fifteen plus twelve plus thirty which is nothing but twenty seven fifty seven thousand uh fifty seven forty two fifty fifty seven thousand divided by three so nineteen thousand okay so 19000 is my mean value so for this if i replace it with with the formula okay i need standard deviation as well so let's calculate what is the standard deviation x i minus mu square so you you can just calculate this let let me calculate it this is nothing but for the first one it will be 15 000 minus 19 000 square plus 12 minus nineteen thousand square plus thirty thousand minus nineteen thousand square divided by three whole root okay if you start calculating it the output will be seven thousand four hundred let's say 7400 it it is i think it's seven uh okay so i have already written it seven thousand eight hundred seventy four you can do it okay so this is your sigma value this is your standard deviation so what will be the new value 15 000 minus what is the mean value 19 000 divided by 7874 if i calculated 15 thousand minus nineteen thousand divided by seven eight seven four minus zero point five and this is how you can calculate for other values as well so this technique is called as standardization our next technique is mean normalization mean normalization okay so mean normalization is nothing but x dash is x minus me sorry x minus mean divided by max of x minus mean of x very easy to find we already know what is the max 30 000 what is the mean 12 000 just have to replace the values in this formula and you will easily get it okay this distribution will always have a range between minus one to one and the third technique is min max scaling mean max scalar okay the new value x dash is nothing but x minus mean of x divided by same as this one max of x minus mean of x the only difference is this part this is mean of x and this is minimum of x okay and this scaling always brings down the values to a scale of zero to one so you can use any of these techniques to do the feature scaling step again if you ask me which one which one is better there is no straightforward answer for that we always have to do a hit and trial then build the model and then see how the model behaves okay our next topic is outlier treatment now outliers are the most extreme values in the data okay quite obvious right so it is an abnormal observation that deviates from the norm now if i if i draw a small example let's say i have a graph and i have all the values somewhere here and let's select let me draw a time series forecasting kind of graph let's say this is how the graph is moving something like this something like this something like this okay let's say this is 2000 this is 2001 and this is 2020 something like that in this data i also have some data points here here here here here so outliers are nothing but which deviates from the data distribution you can see if you find the minimum so the mean value and the standard deviated value here it will fall somewhere around here you can see how the difference is this is very far away from the normal graph so this is basically an outlier this is an outlier this is an outlier now how to handle outliers there are various techniques to handle outliers there are some anomaly detection techniques where you can use now detection of outliers using following methods we we can identify outliers using box plots using histograms scatter plots and z scores as well okay once we start with the practical part you will have a better understanding because in the use case we already have the outlier treatment in that handling outliers using following methods remove the outlier so some people do remove the outlets but removing the outliers sometimes backfires because when you are removing outliers you are also removing some information out of it that information could be very vital information okay when you are dealing with very uh some use cases like in the health care domain let's say you are dealing with cancer use case cancerous and non-cancerous patients you have to identify so obviously there will be very few people identified with cancer disease right so there could be scenarios in your data where some outliers could also carry very important information so it's not the right way to remove them immediately we have to analyze them we have to see and then we have to take the steps okay there are some techniques like replacing the outliers with suitable values by using following methods quantile method and interquartile range we'll talk about that and there are some machine learning models which are actually designed in that ways that they are not sensitive to outliers we can also use this kind of machine learning models okay k n decision trees svms ensemble methods the next topic is handling invalid data now invalid so missing values was one part and invalid data is one part right so there could be scenarios where you are getting invalid data issues what could be the possible reasons it could be because of encoding and unicoding right so encode unicode properly in case the data is being read as junk characters try to change encoding technique try to use something instead of utf-8 there are multiple encoding techniques right sometimes what happens is when we are reading the data we are not able to read it properly if you change the encoding technique you will be able to read it converting the incorrect data types correct the incorrect data types to the correct data types for ease of analysis example if numerical variables values are stored as strings it would not be possible to calculate metrics such as mean median etc right so let's say you are dealing with age and age is a uh object data it's not an integer data so it could be scenarios where you're not able to find the mean median and all those things right so always convert such data to integer data if you're dealing with float data convert it into float data if you are dealing with date time data convert it into date time data and then use it sometimes date data is also in strings let's say this example 2013 august it could be in string we have to convert it into dates and then we can use it this is also one of the basic steps which we follow correct values that go beyond range if some of the values are beyond logical range for example temperatures less than minus 273 degrees celsius there could be multiple scenarios as in there could be multiple glitches how your data is being captured could be scenarios where your data captured is somehow corrupted or something like that so you should always correct that values as a data analyst that's your role okay correcting the wrong structure values that don't allow a defined structure can be removed example in a data set containing pin codes of indian cities a pin code of 12 digits would be an invalid value and needs to be removed so all these techniques comes under your ada okay now i'll not move ahead with the next slides rather than that i will try to explain you using some examples i'll try to finish it in 10 minutes shivani so that you can take it ahead okay now next topics are okay i'll just try to cover these things univariate and bivariate analysis and also i'll talk about derived metrics okay let me start a new interest okay now let's try to understand what is univariate analysis and what is bivariate analysis i'll again use the classic example which i which is very close to my heart which is nothing but churn retention or churn prediction use case churn prediction churn prediction okay churn prediction is one of the use case where we are going to identify whether a new customer is going to leave the company or not one classic example is for example i'm working for vodafone just giving an example i'm working for vodafone and over the past period of time there are multiple customers who are leaving the system it could be of multiple reasons they are not happy or the cost is high or jio is giving them good offers could be of multiple reasons so my manager has asked try to create a model based on the current data it should be able to predict whether a new customer is likely to churn in next 6 months or 12 months okay so let's say vodafone has its own data set like database let's say i have a database of 10 000 customers and out of this 10 000 customers 1000 customers have left the system so they have churned that means why 9000 customers have not churned so what i will do is i'll create a model let's say a classification model and that model let's say a new customer comes in new customer it should be able to predict that this customer is likely to churn okay once we know this there are some different strategies which we can follow and we can try to retain them the main the main use case is that once we identify the potential customers who might leave the system we can probably retain them how do we return them by giving some offers by giving some but by giving some uh attractive offers and all those things right so that's the use case some people also tell me why you always use churn prediction john prediction is not very specific to telecom it is not the reason why i use churn prediction is a very relatable use case it is used in every industry banking industries it is used in telecom gaming industries it is used in all the industries okay because customers leave let's say currently you are you are using standard chatted tomorrow you will switch to icici next uh maybe three months back uh three months later you will move to rbl because rblb rbl is providing me eight percent rd interests on rd right that is also one of the use cases in churn prediction okay let's get back to the use case we'll try to understand what is univariate and bivariate analysis okay so i have customer related data let's say customer id customer name anybody has any doubts i haven't seen the chat section since a long time anybody shivan you okay thanks a lot okay i have customer id i have customer name gender geography uh contract type whether he's a monthly contract or a yearly contract and blah blah blah and i have the column called as char that means yes or no okay yes no yes no something like that multiple customers will have multiple values right now universe analysis by the name itself uni means one so we have to analyze one variable now analyzing one variable means let's say you're analyzing gender so what is the distribution of gender okay let's say i have ten thousand records six thousand male and 4000 female as simple as that this is your universe analysis just analyzing one variable analyzing two variables gender with contract type let's say there are three contract types so let's let's let's just be uh very like i'll just ease up the example let's say we have two contract types monthly and yearly contracts monthly and yearly so yearly basically means you have a contract of 12 months monthly means you don't have any contract you are free to go okay second step is bivariate analysis by varied what is bivariate gender is obviously 6k i know that but what is the further distribution of 6k how many monthly and how many yearly customers let's say 3500 monthly customers 2 500 yearly customers similarly in female let's say 1000 monthly customers and 3 000 yearly customers this is your bivariate analysis because you are analyzing two variables at a time so what should be your insights from it there are more male monthly customers than may female monthly customers right male monthly what is the ratio 6 3 500 by 6 000 almost 60 percent 60 percent of the males are monthly customers similarly just 20 percent of the females are monthly customers okay so these are the insights which you get and when when it comes to your use case let's say your use case is churn prediction right so we want to analyze the churners who have already left the system only when we analyze them we will know what are the characteristics of those customers right so what we do is we just refine the values let's say i have 10 000 records and i have 1000 churners so what we do is i divide my churners data in i create a new data frame for channels data and a new data frame from for non chernus data and from this i just try to analyze out of this thousand what is the gender distribution let's say 600 and 400 male female so now we analyze in the overall data the male is to female ratio was 60 is to 40 that means 3 is to 2 what is the ratio from the churner's data it's still the same 600 is to 400 which is 3 is to 2 so the ratio is almost similar that means gender has not much of role to play but still gender can be used it's not like gender is useless okay because gender along with some other variable could give you some more information let's say the next step will be my bivariate analysis i'll try to analyze out of this 600 how many are monthly customers and how many are yearly customers let's say monthly customers are oh sorry how many were monthly customers uh okay this was the original letter so let's let's say monthly churners are 450 yearly charters are 150. here monthly channels are let's say 350 yearly channels are 150. so what is the ratio for the churners 450 by 600 in 200 this is the percentage of the monthly churners what is the what is the percentage 450 divided by 600 multiplied with 100 75 of the people who are male and monthly customers have churned how many were monthly customers just 3500 out of 3500 almost 450 have churned okay you can you could get that value now how many female customers have churned 350 by 400 so these kind of analysis you need to do so this is your univariate and this is your bivariate analysis when you take it to the next level let's say gender versus uh contract type versus geography obviously your graph will be very complicated to understand but this is your three variable analysis and it goes on so basically it is termed as three like uh univariate bivariate and multivariate analysis okay these things you need to do to get more insights about the data then comes your derived metrics as i already told derived metrics are creation of new variables okay again a small example let's say i have age obviously if i have age i have i could be have values from 1 to 60 or more than that let's say my maximum age is 60 let's assume that so one two three four five six till sixty you will have to plot something like this it you will you will be very much confused where your like where is where are the more most number of people are forming where are they are they forming in this age group or in this so sometimes what happens in these kind of variables we don't get more information so what we do is it's not always but i'm just giving an example of a derived metrics okay what we do is i know that i'm the data i'm dealing with i already know that it's financial related domain or telecom related domain so as a data scientist i should know how to bend them let's say i'm binning them as 1 to 18 19 to 32 33 to 50 and 50 plus i'm winning them as four categories and i will see now what is the distribution let's say now i can see 18 to 20 sorry 1 to 18 like this just an example then i can probably say okay these are the highest categories and all those things so these are all intuitions which you need to follow once we go ahead with the live use case you will have more understanding and yeah data analysis i think in 1r nothing more could have been taught to you right so what is feature winning i just explained about feature beginning feature winning converts or transform continuous continuous or numerical variables to categorical variables it can also be used to identify missing values or outliers so these are some feature encoding techniques feature encoding helps us to transform categorical data into numerical there are multiple encoding techniques label encoding one hot encoding and target encoding hash encoding let me explain you something in brief uh let's talk about encoding okay now i already told you that ultimately the data which we fit to the model it should be a numerical data right so if it is a numerical data by origin yes we know we feature scale it and then pass it but what if it's a categorical data how to pass it because the model will not accept right let's say i have gender and i have male and female so how do i pass this to your model one easiest technique is label encoding which is nothing but you mark it as one and two or else zero and one or else one is zero okay mark it and then pass it one other technique is also called as one hot encoding it is mostly used for categories where there are more than two values now one classic example would be let's say i have i have geography i have city city let's say the values are mumbai kolkata hyderabad mumbai kolkata hyderabad mumbai okay let's say and one one more let's say delhi these are the one two three four five six seven customers i have seven customers data i have now how do i how do i convert it to numerical variables let me write the delhi here itself so that it comes in one okay delhi now what happens is we have to create multiple columns let's say i create uh multiple columns let's say how many valid values we have we have one two three and 4 nothing more right so what we need to do is we need to create four variables let's say i create a variable like mumbai not variable column i create kolkata i create hyderabad i create delhi so wherever there is mumbai i will populate it as 1 rest as 0. wherever there is kolkata i will populate it as 1 rest as 0. wherever there is hyderabad i will populate it as 1 rest as 0 so now i can easily feed the model this numerical data okay and there are some concepts like dummy trap let's not talk about all of them so dummy trap is nothing but when we are having four columns like this it is normally recommended to just use three of them and ignore one why because it is irrelevant because if this is one this is zero this is zero it is quite obvious this will be 0 if this is 0 this is 0 this is 1 it's quite obvious this will be 0 right so we there is a concept called as dummy trap that is the reason we ignore one of the dummy variables we call it as dummy variables mumbai kolkata and hyderabad and delhi are the dummy variables one not encoding is one of the techniques there is one more technique called as get dummies you can also use get dummies to convert your categorical data to numerical variables i think that's it next we will be talking about the use case uh i don't have to tell you what are the use cases being used in ada eda is a mandatory thing it's a basic first step whenever you start a data analysis or data science use case okay be it cancer data analysis be it uh customer retention beat anything cross selling up selling any use case eda is a mandate okay we have to do eda on each and every use case so that's it from my side uh in case you want to in case you have any doubts you can just let me know uh we'll probably wait for two minutes i'll i'll just go for a water break quick two minutes water break and then shivani you can take it over yep sounds good yeah so we'll start at 9 15 if that's okay yeah okay cool so okay i am back uh shivani if you want you can take it ahead you can start sharing i'll just add you in the loop okay sure just can you confirm once if my screen is visible uh yeah it's visible let me let me make okay now it's yeah okay great um okay so i hope everyone is back um or should we actually wait until 15 because we said 9 15 what do you say yeah okay let's wait for two more minutes okay okay can we start shivani yeah sounds good okay okay okay okay so um in eda so like satyajit has already clarified the concepts really well um and now i'm going to be walking you through the use case where we are going to analyze like you know the um so if you're a credit card company then you want to analyze your customers you want to see their behavior you want to predict how they are going to behave over the time or how they have been in the past so a few things that we're going to analyze is so what we are going to cover the first thing is to deal with missing values the second one is to actually uh you know see how to work around with categorical data and visualize that data and then we're also going to check up with um so if you have given somebody a loan then what percentage of your customers or what what kind of customers are actually going to pay the loan and if they are paying back then uh how often are they paying back and what type of uh class is paying the loan back much sooner than the other types of classes so there's a lot more things that we're going to explore as we walk through um so let's start with uh the session what so currently we have like three data files the first one is application data the second one will be columns description which will be having um you know all of the description of the columns that we have so far and the third one will be the previous application data so uh these are going to be the applications that have been previously uh submitted and um so i'm using google i usually you know uh prefer using google collab especially because that's very convenient to go with also uh unlike the last session i'm not going to be coding live because it's almost like a 500 lines code and i've already prepared the entire thing so i'm not going to do that again but yeah we can do that maybe next time so i've already mounted my drive as you can see and um i've also copied the location where all of these four files are so if you want to check out then the way you can do that this code cell will mount your drive for you and you can just pick out whichever account you want to choose and from here so as you can see these are the three files that we're going to have so we'll put up the code and all of the data up on um you know maybe our githubs and you can check out from there but not right now after the session of course so i'm i've just copied it from here copy path and i've pasted it right here this is going to change your directory to um whatever whatever path you've said and from there there's a few libraries that we're going to use so first of all we're going to go with pandas numpy matplotlib seaborn and also um you know some more maybe uh so first one the reason why you need pandas is because we are mainly dealing with csv files and csv file files are best processed using pandas and the second one is numpy numpy is usually used for you know dealing with the numeric values and matplotlib is the one that you want to use for plotting the graphs the reason why we also want to use seaborn is that is also used for plotting the graphs but your graphs plot with seabourn are much more cleaner and they're more uh you know visually attractive whenever you look at them so the first file that we have is the columns description and i'm gonna put that into uh this desc variable uh with the encoding latin because you can't really use the utf-8 encoding everywhere so um as you can see over here we have the table uh what table it is on like what uh file it is on so the first few columns as you can see they are on application data and then we have the rows the different types of rows that we have then we have description for each of those rows what they mean so you can go ahead and check out those descriptions whenever you want and then we have what is the speciality of um that every uh sort of row from that table so moving forward we're going to read the next file which is our application data so these are the two files that we have the application data and the previous applications so first we're going to start reading the application data and in the application data we need to analyze a few things so what exactly are we analyzing over here we need to understand um like what kind of columns do we have and what is the so if you can see over here we have different types of contracts like it's a cash loan or it's a revolving loan or what person or like you know what gender of uh what gender has actually taken that loan and how many of them are uh eventually going to be responding to you and what is the amount of credit or the total income of that person and there are more stuff like that so the reason why we need these um information is so that we can eventually analyze so if it's a ma if it's a male or a female then is he paying the loan back and if he is then how how sooner is he going to pay that back and also you can see another column here like uh if he's working or not and what work does he do and if he has completed his education or if he's having a family what kind of home he's living in so all of these data will be seeing how to use that afterwards now to understand um to get just more information about this uh data frame we're going to also check i'm going to use the verbose as true so what verbose means is like you know you just want more information if you said verbose is equal to false you wouldn't really get the data types and stuff so i'm going to get more information from here so as you can see these are usually you know the integer values then we have the float values and uh objects as well so these all data types uh we're going to have to work with the reason why this is important is because then you know what kind of data you're dealing with and accordingly you can actually convert that into um you know if if it's a categorical data or not and then you can you uh plot the graphs further on so uh the shape of the data is um three zero seven five one one comma one two two so that's like uh the rows and the columns of the data and then we're just going to check with the target variable so uh we're gonna analyze the target column and from the target column you need to understand like um you know you want to see if whatever values of you know the target column if it's one or zero and if it is one so if you see here it is one or zero and um how much of it is one and how much of it is zero so you want your data to be you know uh balanced in both the classes and ones and zeros but if it is not then you need to ensure that it is balanced but here first we need to know if it is uh you know balanced or not so as you can see here most of it is zeros which is uh over you know you can easily see that it's over 80 but we need the exact value of how much of it is over 80 so um when i say target variable what i mean by that is how many of the loans were actually paid on time so the loans that were paid on time will be having the variable zero and the loans that were not paid on time will be having the variable one so if you see here most of them were paid on time except for a few so if we understand the unique values uh we are just having one and zeros and then when we understand like what percentage of the uh loans were paid in time and before time so 91 of the loans were paid on time and eight percent wasn't paid on time now the next part we will be actually dealing with the missing data so many columns over here are going to be having the missing data only a few of them would be like you know um you know like we will not be actually having missing data so in your entire data frame we have huge we have a lot of amount of columns and from that like satyajit mentioned if there is about 35 or 30 of missing columns then usually someone might drop it because it's not really useful but um if it is um if it is actually more than that then you might want to drop it because it's not giving you more information but on the other side um you could also like you know have uh those columns giving you the most information so you want to make sure that you are you know finding ways to use those uh values and also fill in or deal with those missing data so for the features with less missing values we can actually use regression and uh or other or the other way to deal with that is to take the mean of those values and depending on those features we can fill in with the mean or the other way was if you're having more than you know 40 of missing values then it might be very uh it might be just easier to drop those or you know by other ways you can actually predict them so if you're building an ml model then you might want to like you know use those features as the labels that you want to predict and based on that you can actually get those but again it's it's no thumb rule so we will first do some analysis and then we'll discuss so first of all we're going to check if the columns have more than 30 or 35 percent of the missing values and then we'll be combining those columns which have more than 30 and then we'll see how to uh you know deal with them so currently i am using the variable empty columns for uh for understanding how many uh missing or how many null columns do we have in the dfab data frame which is our application data so um this is essentially like from empty columns all of the values that you have if they are more than 30 0.3 signifies 30 times the length of the entire column then you just want to um you know you want the count of that so you're gonna uh you're going to create a data frame of empty columns and then you are also going to reset the index so what i mean by that is um you usually have you know over here there's an index and then you will be having the other row so instead we're just going to rename them the index uh would be called as the row and the zeroth column over here it will be called as the num count and then we just print it out so as you can see the number of count for you know more than 30 percent are for amount good amount goods price the name type suit and own car age occupation type and so on now what we are going to do is we are actually going to redefine the data description for the column with application data so for that we're going to go back to application data and in the table section as you can see here um so in application data we actually have the table so from table we're going to pick application data and then we're going to use this description to then correlate with the rules that we have so the way that we do that is uh we're going to take uh we're going to use the loc parameter from the pd data frames and from that we're going to access the table parameter which has to be the application data and when we said that so i've put this information into the description application data and then i'm analyzing the empty columns data frame with uh so what what i mean by when i'm analyzing it is that i'm actually uh you know just merging the description with the rows so that i know what rows has which description for that the left one is going to be the empty column df um as you can see here this one this is the left one and for the right one we need the description so i'm just going to put that over here and it has to be an inner join and we will be putting that on the row column so uh the row column over here as well you can see and it will be based on the row column so on is just like you know what columns do you want to compare so it's row over here and also in the application data we had the row column and then when i do that um so what we understand from this is we have actually gotten you know all of the rows with the descriptions and the reason why we want to have uh those null values greater than 30 is because um so if you can see then most so since this is 64 it means that the total columns that have more than 30 of the null values are 64 out of the all the other columns that we have which means like not much of our data is null values and then since there is no rule of thumb that we're going to drop those uh 30 percent of the null values we're actually going to be normalizing most of the information so for instance if most of the columns are normalized information and the description is unclear then i'm just going to go ahead and remove the normalized information values or in the latest stage i can also analyze the other 61 columns so currently we're going to be working with amounts good price the own garage as you can see here and then the occupation type so what kind of occupation does the client have the age of the client's car and the consumer loans that it is for the price of the car uh price of the good or rather you know just right now car so um over here we are going to consider these three columns and the rest of the 61 columns from those 64 columns we're going to be working on them afterwards um so okay so i'm going to be analyzing what are the missing values so first of all we need to plot another graph which will tell us like what is the um the number of missing values and what is the percentage of missing values for every single category for all of those 64 columns so as you can see here we have all of these columns over here and then then we are also analyzing the uh percentage of the missing values so for these few columns we have about 70 of the missing values and for the rest of them there's like almost zero percent but so what we understand from this is there is enough missing values but just not for every column and most of the columns do have the entire data after that we need to actually list out the missing values which are greater than 30 and then we're just going to remove these three columns because we need to use them and uh for the rest of the uh 61 columns we are going to just um you know drop these columns for now and then we also need to analyze like from these uh df app the 61 columns how many of them are going to be having more than 30 of the missing values so once you do that um so as you can see the amount yeah so we're now going to be uh analyzing the amount immunity which is having a lot of uh less you know uh missing values and from amount immun uh immune any unit annuity sorry uh we are going to you know try to impute the missing values and from here since the column you know is it's also having outliers so since it is having outliers and it is very large we wouldn't really be um you know it wouldn't really be the best way to fill those with means because outliers can you know outliers like strategies explained are not exactly in the data but they're outside most of the concentrated data so you can't really fill that with mean because that wouldn't be really the way you want to deal with it that's why median over here will be much helpful now the reason why you want to uh use median is because um unlike mean median is actually you know it's going to consider those outliers as well which mean wouldn't really be doing so for that um we're first going to fill the missing values with median so as you can see here we're actually taking the median and then we're going to fill this with the median and then uh for the columns that have null values we are going to be uh checking like again the columns which have null values because we just filled the rest of them with the median and then we're going to uh you know since all the columns now have uh with zero values we're going to remove the ones which are equal to you know 30 so now we don't really need those columns so i've removed the unwanted columns from the data set over here now these are the unwanted columns and then i'm not really going to work with these so i just remove it over here by just dropping them uh on axis one and now in the next step we have this x n a value now what this x and e value signifies is that it's not available which means we need to find the number of rows and columns and then implement the suitable technique on them to fill those missing values or if we do not want to fill them then you just drop them again for that you actually need to find now since xna is not one or zero it's obviously not numeric it is categorical so we need to find you know like how many of those are x or you know uh for every gender column like what is the xna value or how much of those are xna so if you uh use this piece of code for organization type if it says like you know even for code gender which is male or female then the xna value is like 4 comma 31 and over here uh the organization type and shape the xna value is much more than that so for organization type there's and so there's four from the gender column and then there's 55 000 from the organization type column so over here we need to describe the gender uh column to check the number of females and the males now now since the female is also having um okay let's check here okay so if you see here then uh what we do is we're actually finding the value counts like how many of these um so what value counts is going to do for you is it's actually going to find out what are the unique values so uh for in case of unique values if you just want to find out the unique then you can actually do by uh so this will just give you the number of unique values and it wouldn't really give you the counts whereas value counts will actually give you the number of unique values which could be male female or xna in our case and in case of females as you can see there's about um two lakh two thousand four four eight people like who are male uh who are female and then we have about one lakh people that are male and over here um we're going to also uh you know analyze on the organization type so when you analyze on the organization type what you get here is that for count unique top and frequency so for the organization type we have just a second yeah so for the organization type uh we have a total of three lakh rows of which 55 000 are having the xna values which is not a good idea so uh what we are going to do here is um we're actually uh you know going to be working with this 80 and then we are going to drop all of these 55 000 which will not you know obviously because it is 18 so it wouldn't really have an impact on our data if we just drop it out so uh from once we drop this uh then we're going to analyze it again so for code gender the name family status and for the income type and occupation type so once you've dropped it you just drop it over here again for you know the organization type and if the variable uh associated with organization if it is in the category of x and a then you just want to drop it so you dropped it over here and you keep those indices and then you just want to take you know like you know uh you just want the shape of your data now once that is done now we will be actually starting with the categorical analysis and over here we're going to be using the bar plot which again satyajit has explained really well so um as you can see we have first of all the code gender which is if it is a male or female then we have the name family status income type and the occupation type so let's understand this a little bit um so first in this case we have male and female so in here we have 65 to 62 of which are female 61 you could say and then for um male we have about 39 and then we have the next uh the next um the next column that we are plotting is the name family status in this case which is uh married not married if it is a civil marriage separated widow or we just don't know where did they fall so for that uh as you can see uh for married we have the most and then it just keeps going down for singles and um civil marriages separated widows or the people that just we don't know so it's zero for the people that we don't know and then we have the name income type which in our case is if he's a working person if he is uh you know an associate a state servant and student and businesses they don't really um you know take those things so we're not having any uh information about these and then we go back um to the next column which was the name income type so for name income type we have these columns which are the uh laborers then we have uh sales stuff core stuff managers drivers and then so on until hr stuff and id stuff so for these again as you can see the highest percentages for the laborers then we have the sales staff course stuff managers drivers and then it goes down for the least which is secret secretaries now once we have done this um what we did understand from here as we wouldn't find much insights related to you know the defaulters so here we just get an idea of which category of people are present in abundance so most of our people from here in this data is like you know the laborers then we have the working people involved mostly with the company and then we have again the married people and unmarried but unknown there's nobody from unknown it's just a column that we have we do not really have any students any businessman prisoners and then we also do not have any reality agents hr staff id stuff so most populated ones are these the one you can see here and also uh most of them are also females rather than males so what we uh get here is just we're analyzing what is our what are our customers like what who are we actually serving so now we will be starting to plot the features now when i say we will be plotting the features what i mean by that is we are going to be using uh the application data and then from that we're going to be seeing the percentages for each of those um target variables so whatever people that we just targeted we're just going to see more of um you know who paid the loan or who didn't pay the loan back so for that um you have um you want to target the features which is the temp index over here and temp is just we've i've just created a temporary variable from where the feature is if you i'm just creating a function right now where which is called plot features and then it is a type of feature that you want to target over here the label rotation which currently is false but you can set it as anything that you want so you can put it as true or you can put it as false if you want a horizontal layout then it's already true right now so horizontal layout is like you know your bar plot or anything in the horizontal direction now in temp data uh so in the temp variable i'm just using the df data frame which was the application data so df underscore app was our application data data frame and then from that data frame we're accessing feature and then we want the value count so like again as i said uh the unique values and then what number do we have corresponding to that unique value and then for df1 it is the same feature for that we will be using the absolute values and then we will be setting the absolute values to the temp values which means the values that we got from here so once we do that uh we need to also find the t1 percentage so the t1 percentage is essentially the features across the target variable and then we're going to group that with the features and uh you're going to take the mean of it so once you do take the mean of it you're going to start sort those values in the descending format by the target variable so you want to also like when you get the t1 percentage you want to sort those values as per the target variable now if you remember the target was you know the people who paid the loan back and who the people who didn't pay the loan back in ones and zeros so you're going to sort them in the descending values and then in the horizontal layout if you have a horizontal layout then this is the way you want to plot it like the number of columns is going to be 2 and this is just your figure size and we're going to use subplots over here so subplots are like you know multiple plots that you can plot together and again we have the number of roses also too um so the other way that you can also do it is like you know if you have a horizontal layout then it is obviously going to be columns but if it is a vertical layout then it's the number of rows and then we use yeah i think even i think you can easily plot them and show them that piece of code explanation will be little bit tough for people to understand uh for the time being i could say that just take the code and you can run it like you can utilize this code in any of the use cases to understand more what what exactly is happening in background probably you could uh see line by line and try to understand it okay maybe if you can see the plot features people will more understand how it is actually plotting it okay yeah i think that's a good idea yes yes okay so yeah like you can uh understand this much better if you try it for yourself and feel free to you know um also modify the code according to your needs so we're just going to um in the end i'm just going to skip this part for now and maybe i can come back to explaining it again just a brief so we have the categorical values we're just going to select the data type which was object so if you remember we analyzed the data types up in the first uh column like not in the first column but the initial code blocks where we saw it was integers floats or objects so we're going to understand like the categorical types which are usually the string but over here we have objects so um with this we have these uh categorical columns which have the categorical values and then we're just gonna go plotting them so if you can see there's married single civil separated widows and unknowns again we're just plotting the same thing but over here what we are doing is we're going to plot those features the same features that we had plotted above so um as you can see here uh the percentages of applica applications where the loans were not repaid so like they're still you know um people haven't still paid those loan backs and most of them were the single or not married people and then we have the civil marriage people and the least of them were the widows so maybe you know just they didn't take the loan that much as much as these guys did or there could be some other reason so we need to also understand that afterwards and then we are going to pick the name income type so the name income type was um if it is a working person or if it is you know uh working as a state servant if it's a student or business man or stuff like that so we're also going to uh check that from here so the people uh the percentages of people that we actually have here the absolute values and then uh the ones who actually paid the loan back so for the ones who were on maternity leave the most of the time they didn't really pay the loan back like even like you know 40 percent of them didn't the working people only 10 percent of them didn't pay the loan back but in case of state servants only it was like you know um six percent or something around that um around that line so most clients are the married people civil marriages and the single people have higher chances of you know defaulting than the others and then in case of students and stuff the working class usually applies most of the for most of the loans and a very low defer and you know have a very low default rate therefore they are the most reliable customers that the um you know the company is going to have the other insights that we get from here is that the clients who are unemployed and and also are on maternity uh leaves they have the higher default rates although they are the minority as compared to the other incomes and um also the commercial associates and the state servant persons as you can see here are fairly you know more more reliable because they're actually paying the loans much faster than the maternity leave guys so then we're also going to target the occupation type which is as you can see here um just a second yeah so as you can see here is the laborers the staff the hr staff id staff or secretaries and stuff so if you can see here then the low skill people like the laborers the waiters drivers especially the ones which have uh the lower income ranges they are most likely to be the lone defaulters whereas the higher peop higher paid staff are you know uh have less chance of defaulting so they are more reliable and now we're also going to be plotting uh the commun the education type like you know based on uh the kind of education that these guys have so uh we have uh the people with you know lower secondary grade who have done higher education who has an incomplete higher education and who have or who possess an academic degree so for the academic degree it's really less so it's near to zero and um then we also have you know uh the secondary special people or the higher education people are in complete higher education and if you see here this plot is really interesting because the lower secondary people are the ones who are mostly in the debt for the loans and then we have the least one for the academic degree people so most student loans are for their secondary education or the higher education this is the first um understanding that we get from here and then the lower secondary education schools are mainly for you know uh like the most risks that the company has from here is from the lower secondary education loans which are you know uh generally followed by these groups the lower secondary or the secondary and the secondary special although these could be eventually paid since they are going to be leading to a higher education or an academic degree but these two are much more risky for the company and then we are also going to plot the features for the organization type so for the organization type have if it is a transportation business if it is you know a security industry university police military uh religion or any of those other types if it is an insurance company or a bank or stuff like that so if you see here then the absolute values which is the highest it is for you know transportation type three so it's not really clear over here i'm just going to scroll down yeah for transportation type 3 and then we have industry type 13 and then industry type 8 and then restaurants constructions and so on so this graph just goes down because we said that if you remember in the code we set it as descending so for ascending we said that as false that's why we have everything in the descending order um so as you can see here uh what we get from this data is uh most of the loans are taken by the transportation type three people and then in the second uh people who are taking most of the loans is industry type 13 and then we have industry type 8 and then the least alone is taken over here which is um yeah industry type five and uh type one and then the security industries or universities and police citizens police and stuff but if you see here then there are a lot of people who haven't really paid their loan back so the organizations which you know like if you have the transportation type three people or those organizations and industry type 13 then the these are the highest defaulters as you can also see we have a certain range so until restaurants until these people there's a higher risk until you know you could also like you can set a default range for where you want the defaulters to be and the ones who are not risky for the company so i'm just gonna keep it maybe until um relators and from relators these are the ones which are the highest defaulters and then the rest of them are actually you know not that risky for the company or you would say like the trade type 4 is not at all risky for the company and these are the ones that uh the company can rely on the most uh once we have done that we also have the name housing type so in case of name housing types we have uh what kind of uh housing is the specific person having or the people that have taken the loan having so there there are municipal apartments and there you know they are also living with the parents or it could be an office apartment or a co-op apartment the one that you share and there's also a rented apartment so most of the people uh that have applied for the loan are from the rented apartments very less in comparison to the absolute values for the rented apartment very less are from the rest of the values and for the co-op apartment there's not really any so um the housing type really affects a lot because then you also need to check um so uh yeah for the apartment there's the highest amount of loans and the people who haven't paid the loan back in comparison to these are the people who are living in the rental apartments and then the next one are the ones with like you know the ones that are living with parents and the least ones are the office departments which means the apartments that the offices have provided them with so over here um eight percent of these are like you know the default rate and um the people living in a rented apartment or whether parents are you know but very rare they're like very less but they're also very risky so these are not you know uh the ones that the company can rely on but the office apartment people these are the ones that the company can very easily rely on for the loans and then we also need to check the weekday process start so like when does their week actually start does it start from tuesday wednesday monday thursday friday saturday or sunday so most of these for most of our customers the week starts from tuesday which is the highest and then for very less of those the week starts from sunday so also over here as you can see the weekdays are more common but it's not really giving an insight for the defaulters because most of them are you know almost the same range so we do not really understand which one is more reliable or which one is not so most clients are like you know um not accompanied by anyone when they apply for a loan so there's not any significant difference in the people who are uh going to give who are actually going to repay or not so this is a redundant feature this is the first analysis and the second one is when you're actually creating any ml model or something this is the one that you can just exclude because not we're not really getting an information from here and then we are going to be you know checking with the region uh and then we're also going to check with the work region so for the people whose permanent addresses match with the contact or the work addresses and if they are going to be risky for uh our company or not so we're going to be checking the regional region and from that what we understand is uh the regular region like all the people that are uh where they do not live or where with where they do not work so the ones that who live in the same region is zero uh and then i mean like you know who do not live in the region we're going to specify them with zero and who do live in that region we're going to be calling them as one so as you can see the ones who do not live over there is very high um as comparison to the ones who do live there but again this is also not very useful for us since this is also almost like a similar graph and also for the work region like as their work address is corresponding to the address that they live on so for them also it's not um you know like it's not giving us any more information like we cannot really understand if it is something that we can benefit from for um telling your you know if you have to tell your manager if this is these are the people who are most risky then we do not really have any uh information from here and then um we're also going to be checking with the regional city so when i say that we're going to be checking uh in the regional level we saw that the same people are also um risky over here so if it is a city or if it is not a city as a city level it has significant difference in default rates but we don't really have to consider the regional level as a determining factor because it's not really helping out much it's just a two percent rate over here and over here as well we have a four percent rate so this is something that we might be able to use but it's also um not very useful in this case now we're going to be checking all of the categorical features so in case of the categorical features we need to check the um genders so first of all let's go with the contract types so let's see for cash loans and then we have the revolving loans so if you can see the revolving loans and as compared to the cash loans the cash loans are the ones which are not paid very um you know most of those most of the loans which are cash loans there's a higher rate that they will be paid very late or they are not being repaid and in case of revolving loans they are paid much faster than the cash loans so this is some valuable information we have here and the other one is um if you see here so okay now here so over here the females are the people who are taking most of the loans and mail other people who are taking less amount of loans but in case of the percentages of paying the loan back males are the ones who are not going to pay the loan very often but females on the other hand as compared to males then in the contrast females are more likely to pay the loan much faster than males and the next one that we have over here is uh the clients who own a car or not so the ones who own a car car are less likely to not repay the loan so like the ones who do have a car they're going to be more inclined to paying the loan back so if you can see no and a yes so flag on car which means do you have your own car or not and if you don't then as you can see for no there's a higher chance that they're not going to pay it back so soon and for the ones who do have a car they might be able to pay the loan back much sooner and also uh the next categorical data that we are going to understand so the reason why this is categorical although it is just two tables uh n and y you can convert them to zero and one if you want to convert them to numeric eventually while feeding the data to your ml models you might want to switch them to zeros and one so right now it is categorical because it has yes or no so y or n again so um the reality estate people are you know the ones so most of the people um who own the real estate are more than you know the double of the ones who don't really own the real estate but both the categories have around eight percent chance of not paying the loan back so again this is not you know a very valuable insight so from the numeric columns i am uh converting all the variables into numeric over here and uh there's a two two numeric um like you know the uh what do you call it the two numeric method that you can apply just straight to the entire data frame and it will convert everything for you so um [Music] yeah so once you do convert them into the numeric values then i'm going to be having uh the certain derived battery so let's talk about the derived metrics right now now creating the bins for continuous variables like for example we have amount income total and the amount of credit that people have taken and there's um you know like it's not exactly um one value for that it's given in the range so i want to convert those range into the bins like the slots so i'm just going to convert that you know if it is in the um if it is in this range then i just want to convert this into here and then i'm just going to slot it with the labels so as you can um see here i've we have converted these values uh for amount income range and uh for amount income total and then once we have done that we're going to divide the data set into two targets the people who have paid the loan or the people and the people who haven't paid the loan or who are having difficulties with paying the loan so for target one is the one that people are having difficulties and for target two is all of the others so like you know also the ones who have paid the loan or who are not having difficulties with the loan and stuff like that so for target zero i'm just going to convert like all of the values from the um so from this uh data frame the app data that we had we're going to target the feature called the target and then if the value of uh that particular um variable or the feature is 0 then we're just going to put that into a variable called um target zero df app and then the same for target one df so as you can see uh we have we have calculated the imbalanced percentage and most of them is target zero and there's very less which is target one so we understand like most of them have actually uh you know most of them are not actually having being uh with the payment difficulties so then uh we're gonna go down and plot some more information like the count of target variable per category so uh we have one and then we have zero and then we're actually going to go ahead and plot a uniplot now why do you need a unit plot you need to understand like from the above uh you need to understand like you know maybe understand what if it's a female then what is the occupation type and if it is a male then what is the occupation time so maybe you know we might have females under the occupation um whatever is the occupation like maybe cooking stuff or hr or other professions who are paying the loans pretty fast or their mail counterparts like you know um other uh private service staff or secretaries and stuff who are not being able to pay the loan much faster or who are able to pay the loan much faster so far we are actually going to be using a uni plot and what i say by the uni plot is i'm just going to start like this is basically where you're going to just set how you want your plot to look like so rc params is just um the specific parameters like generally if you're using um if you're familiar with bash shells or vim shells then you have the mrc that you can configure this is something similar to that you just are able to configure your entire plot over here and then um this is the data set that we're using we do not have like you know the data is set to hue and then we have the figure and the axis now this is uh specifically important because we're again plotting subplots and then uh we're setting like the y scale is going to be log and then uh we have a count plot for all of these because we are counting um like what uh what is the female and the male ratio for each of those um occupation types like again for private service stuff or cleaning stuff or cooking stuff or stuff like that so let's plot the uni plot the function that we just defined over here and then uh we have the target zero with the column occupation type and then this is just i'm going to be defining um the distribution and for target zero so this is mainly for target zero which means the people who are not really having difficulties with paying the loan so as you can see in laborers um for female it's blue for males it's orange so in females uh less number of those are laborers and as compared to males and in sales stuff there are more females again as compared to males and then in core staff there's more females but over here in drivers there's less females compared to male and so on as you can see also in the hr stuff and the hr staff in the secretaries this is having much more of a difference like most of the females are secretaries and also they are not having any difficulty paying the loan back so most of the males who are in the security stuff as you can see over here then they do not really have a difficulty in paying the loan back and also if you under if you look over here for drivers and most of the males who are drivers even they do not have any difficulty paying the loan back and the same for medicine staff but in this case the uh women who are in medicine stuff they pay the loan much faster than the males who are in medicine stuff so this is um how you're going to be actually reading the data that we have currently and then also for the people who are to target one bff which means the people who are having difficulties with the payment so over here if you see for the medicine stuff females are the most people who are also having difficulties with paying uh with the payments um but the key feature to understand over here is the a y uh the y label which means you also need to understand like what number of those are having the difficulties and these two plots are going to be understand uh these are going to be analyzed side by side so um as you can see here in case of drivers the males who are drivers they're also having difficulties uh paying the loans as compared to females and then in case of high skill tech staff then over here um females are having more difficulties paying rather than the males and in case of hr staff the males who are in hr uh staff they're not having much difficulties but on the other side the females who are in the hr stuff then they're having more difficulties in uh payment payment of the low pay uh back paying the loan so repaying the loan basically and then we're going to be uh having a pivot table where we are going to just uh put the occupation types and the gender side by side and then we're also going to be targeting uh the target column as the values like zeros and ones and then we're going to plot a heat map so as uh explain heat maps are really uh important when you are going to be analyzing uh based from so over here we have the genders versus the occupation type and they're going to be analyzed against the target variable so in case of the other plots the values might be replaced with another parameter called hue and then you can just set it but over here we have values so um over here if you can see then the females were the occupation accountants or the private services as you as you see over here then the private services people they're not uh they're more likely to be paying so um for green it's uh the ones that are you know the cooking staff or the hr staff and also the females they are the most defaulters so okay sorry about that so they are the most defaulters which means they're not being able to pay the loan much easier than their male counterparts um and then for the low skilled uh male workers they're also having a higher rate of defaulters and when i say defaulters it's just the people who are not uh paying the loan on time or who are exceeding the time limit or maybe you're just not paying the loan back and you can also see that from here so from the yes or no values uh you can see that um so what we can actually conclude over here is that as per the heat map that we also saw above and um we're going to be plotting certain heat heat maps over here as well so distribution of the car owner flag now over here uh we're going to be actually checking the car owners and um as you can see uh the people who are females haven't really paid it and the people who are who are males and who also have a car are more likely to pay it back and then it's the same for target one which is the people who are having difficulties or who are defaulters and then we are again plotting the heat map over here so what we conclude from here is that as per the heat map over here as you can see uh the male candidates without the car are the ones which are not going to be able to pay the loan as compared to the other sub categories that we see over here and then as per the bar floor as you can see over here the people without the cars uh so if they do not have a car then these are the people who are going to be the most defaulters and then if we just check it out furthermore then the females without the cars are the most defaulters like over here as you can see then as compared to the males without the cars now moving forward overall uh if you just have to understand it then most of the people without the cars are the defaulters so like if you have a car then you're most likely to pay the loan back this is the key insight that we get get from here and then we're also going to do the same for the contract type like if it is a cash loan or the revolving loan and when we analyze it we're again going to be analyzing it across you know the male or the female genders and then we're also going to plot uh the graph over here like the heat map and then these two bar plots so what we understand from here is uh that for the revolving contracts or the cash contracts target one is the key graph for us so we're not really getting much information from the cash loans but from the revolving loans females are the ones who are most likely to not pay the revolving loans and as compared to males they males are going to be uh you know the first ones to pay the revolving loans as compared to the females so even from the uh heat map we understand like the male candidates with the cash loans are the ones with the highest default rates like they're not going to be paying the loans much easier and then also um if you analyze the same heat map along with the bar plot then we understand that it's clear that you know both the males and females having the cash loan are almost equal to the same ratio so but looking even at the revolving loans although the numbers are less but the females who have taken uh the revolving loan over here then they're the most likely ones who will be having difficulties paying the loan back shivani can i give you some suggestion yeah can we ask the students to go through the code themselves because i i pretty much think that they at least have the python knowledge in case they don't have they can reach out to us to understand the code we can hand over the code to them and what we can do is uh uh i had shared you the ada use case ppt right so we can share the insights with them so that they will understand what kind of insights we're able to get how about that yep yeah sounds good to me because this is almost like you know we're just analyzing with different types of features and it's something that they can also understand yeah actually one one uh one odd thing was that i should have used a easier use case uh this is this is pretty big use case and it has a lot of rows and lot of columns that is the reason it's bit complicated to explain and people are also getting it complicated so we'll send the code to all of you uh will upload the code and you just go through it in case you have any questions you reach out to us uh probably you uh shivani you can you can talk about that ppt maybe that will be interesting for people to get insights okay do you want me to open yeah can you please do that okay i'll do that let me present it yeah okay just uh okay let me present it okay okay i'm already sharing it nice so uh guys this uh this is the overall uh use case my voice is audible and my screen is visible right shivani yep it is okay so uh yeah guys this is very wide use case so for us also explaining the use case is pretty difficult at least in one hour so the business understanding is clear i think from uh this point of view that the loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history because of that some consumers or customers use it as their advantage by becoming a defaulter suppose you work for a consumer finance company which specializes in lending various types of loans to urban customers you have to use eda to analyze the patterns present in the data this will ensure that the applicants capable of repaying the loan was not rejected so we are doing eda to get more insights and using those insights we need to take the next step of creating the model okay in case you want a easier use case i'll also send another video of mine i think two months back i had given a session on telecom churn prediction that use case would be for a beginner level so you can go for that as well okay so the first thing whenever we are analyzing these kind of classification problems be it fraud versus non-fraud or churned versus not chant or this versus that we need to analyze the target variable so what is our target variable here that whether a customer is defaulter or not right so the first intuition which we get is the target variables ratio was 92 is to eight okay these plots are basically taken from the code itself okay and this is just a summarized ppt uh so data is highly imbalanced ratio is 92 is to eight so most of the loans were paid back on time which is target zero target one are the defaulters so our main task as a data analyst is to analyze these areas only these records okay we'll also analyze these records the zero records but more information we need on this area the people who are defaulters okay so we need to analyze the data with other features while taking the target value separately to get some insights so from the missing data what we got we got a lot of information right a lot of variables are there which are having more than 50 percentage of missing values and there are many variables which are in the range of 0 to 0 to 1 or 0 to 2 okay so as i told you there is no such intuition that if it is more than 50 missing value ignore that column no it is not that way i mean that is the most easiest way and the most naive approach but as a data analyst there is also an expectation that you should know what kind of domain you are working in and based on your domain knowledge you will take that action okay so many columns have a lot of missing data 30 to 70 percent some have few missing data 13 to 19 right some have in this range this one this one okay this one these ones these ones are having 13 to 19 percent and many columns also have no missing data this is our initial finding so for features with less spacing values we can either use regression to predict the missing values or fill with the mean values okay that we already learned for features with high number of missing values it is better to drop these columns as they give very less insight on analysis as i have already mentioned there is no thumb rule on what criteria do we delete the columns with high number of missing values so we have done a small analysis and have taken decisions okay so anyways once you have the code with you uh the recommendation would be you run the code and you watch this video in parallel so that you will understand more okay this is the initial intuition from the data total columns having more than 30 percent of null values are 64. there is no thumb rule to drop the variables having more than 30 null values as i already explained you the scenario right car is car and car type we cannot blindly remove those columns right so it's up to you how to how do you deal with it so i have already dealt with this scenario detailed explanation is already present in the code okay once you see it then comes your categorical analysis we have to do the univariate analysis universe analysis is the most basic analysis and it usually doesn't give any much information okay so just by analyzing single variables we won't find much insights related to the defaulters as here we will just have an idea which category of people are present in abundance other than that most of the insights are gathered in analysis of multiple features or variables with target variable okay so these kind of analysis we have also done income type versus target variable these are our findings that you can see income type working commercial associate and you can see maternity leave these so the working class applies the most of the loans which is quite obvious right it has the most number of numbers and have a very low default rate you can see working default rate is here almost 10 percent of the people have left like have not paid that means it's very less but look at this maternity leave so even though maternity leave number of people are less i don't know the exact numbers could be 500 could be thousand it is very less but 40 percent of them are not repaid because of reasons because they are they might be on maternity leave or could be of some other reasons right so these are the insights which you start getting once you analyze multiple variables okay similarly i have also analyzed occupation occupation type versus the target variable so low skilled laborers and lower car class staff are most likely to be the loan defaulters than high skilled staff and accountants which is understandable right you can see laborers sales staff core staff these are having the most number of percentage of not paying the roads could be of multiple reasons could be of monetary reasons or could be of multiple reasons right which is quite understandable so better the occupation lesser the chance of defaulting this is the inside which we got okay education type versus target now education type you can see secondary has the highest higher education second highest than incomplete higher let's see which is the education type which has the most defaulters you can see lower secondary this category the fourth category lower secondary has the highest percentage of not paying the loans defaulters again which is quite understandable i mean not quite understandable but yeah this is what the data is speaking most student loans are for their secondary education or higher education lower secondary education loans are most risky for the company followed by secondary or secondary special i don't know the exact intuition how it is possible but this is what the data is telling okay i'm not drawing any conclusion out of it again we have organization type versus target variable you can see it's very messy because there are lot of numbers lot of values type three self-employed others these are the highest values transport type three industry type 13 industry type 3 so organizations with highest percentage of loans defaulters are transport type 3 industry type 13 and blah blah blah you got got my point right so these are the most defaulters similarly what is the occupation type versus defaulters so these are the analysis we need to do if you have more number of columns your eda process will be stream link it will be lengthy if you are having less columns you can easily do idea in couple of days right so laborers you can see male and female so what are some good insights female under occupation accountants private service staff secretaries realty agents hr staff are the most defaulted sub categories where are they so laborers accountants where are accountants accountants you can see the female ratio is very high okay cleaning stuff very high okay so these are the insights which we can get and what are the insights from male candidates low skilled laborers you can see laborers are the highest defaulters similarly drivers are the highest defaulters okay so these are the most defaulted sub-categories against their counterparts similarly these kind of analysis we need to do a lot of things are there here like income range versus defaulters you can see for target one male counts are higher than female okay income range from this to this is having more number of credits where is that okay male counts are higher than female okay so male counts are higher yeah see in the low you can see this inside for the lower salary range and when the candidate is male they are having the highest default ratio like 14 of default ratio right so obviously if the salary range is less the default ratio is more which is quite obvious this is a very quite obvious insights right similarly all these kind of insights has been done with a lot of variables and the entire code will be available to you for a further analysis you can check that and next one will be numerical analysis similarly you can see age analysis we have converted it into uh age ranges and then we are analyzing it so age range of let's say so you can see 20 to 30 age group are the people who are the most defaulters right you can see age distributions as clients get older they tend to repay their loans on time more often could be because of one reason that they are responsible and 20 to 30 are youth so maybe they are very rel i mean they are not they are not reluctant to pay the loans probably younger clients are less reliable than older clients right so even though the correlation is less significant it does affect the target okay this is how the age distribution is done so all this kind of analysis has to be done there are a lot of features here more than 100 features to be analyzed so that is the reason it is very complex you can take time to go through the code and you can reach out to me in case you have any questions okay so yeah i mean these are the insights i'll also send you out this particular ppt so that you can understand and yeah ppt plus the code i think it will clear most of your doubts so these are our final thoughts from the entire ada which we have done so banks should focus more on contract type student pensioner and businessman with housing type other than co-op apartment for successful payments so these are the target areas who are not paying so banks should focus more on these target areas banks should focus less on income type working because if they are working they are more likely to pay right as they are having most number of unsuccessful payments okay unsuccessful or successful i think it's a mistake successful payments okay also with loan purpose repair is having higher number of unsuccessful payments so these are all the insights which we are getting see laborers sales staff drivers seem to be the most defaulters as we concluded earlier further digging into the female candidates most of the waiters private staff reality agents hr staff i.t staff secretaries are the defaulters i mean i'm not targeting the hr people but i mean this is how the data is telling okay so i think that's it female applicants without car are the most defaulters so you you got the idea right this is how your idea has to be done eda has to be done in this way we have to analyze each and every variable uh with each other and numerical analysis has to be done and that's how we get the outcome of an atm okay so that's it from today's video i i it was lengthy i know that but i hope you get it i will pass the code and pass the entire presentation and next week also we are having some uh like similar kind of webinars in case you have any doubts guys just i'll keep the call open for next two minutes yes uh yeah shubham was asking about the sessions like we often do it this time only sunday 8 p.m but not every weekend most of the weekends but yeah in case you are part of the community or in case you have already subscribed to my channel you can easily get the notifications okay next next weekend class is going to be on power bi so let me conclude in case you have any doubts i'll keep the call open for two minutes thanks for joining me for the session siwani the use case part was really nice and i hope the rest of my part was informative too hopefully we can collaborate for similar sessions in the future as well and just wanted to tell something to the viewers some people have already dropped which is still okay actually i have recently posted about an ai data science see first of all my youtube channel is all about free courses free webinars free workshops i keep doing all these stuffs next week also we have free webinars on power bi but this is something off topic i'm asking that i recently posted about an ai data science and analytics program conducted by bing datum and it's basically an end-to-end data analytics and data science program obviously not for free as it's an end to end five months instructor-led program with unlimited one-to-one guidance from me in case you are interested for this program you can fill out the google forms available in the description below or else feel free to ping me on whatsapp or linkedin in even if you're not interested for these programs i mean you are open to join all my webinars and workshops for free okay thank you uh for joining this session i also have few sessions lined up on various topics so please subscribe to my channel to keep me motivated do share my videos as much as possible and like the videos uh shivani do you like to tell something it was good although i know the code was really long and that's why it took a lot of time explaining to us actually didn't do justice because it's a long topic it's actually a subject so if you guys have any doubts i've put up the code on my github feel free to check that out and it also has the link to the data set um also satyajit will also be sharing the data and the code with you uh separately but yeah if you're really curious to check that out now you can go check out my github and apart from that if you have any doubts feel free to reach out to us as you mentioned he's already put our linkedin profiles below and yeah it was really fun joining for another session let us know which more sessions you're interested in and what would you like to hear from us again so that you know we provide what what do you want and yeah thank you for having me in here again yeah maybe we can yeah we can wait for two more minutes and ask people if you have any specific requirements that you need session on some specific topics just leave something on the live chat i'll just look at it and accordingly i'll just finalize so then yeah i think abdul has a question he's asking like if we can use matlab as a tool software yes you can um so you can actually perform eda with matlab as well because it provides you know a lot of examples and applications so matlab usually there's there's more algorithms that are in built and it also has a lot of pseudo code which you can straightaway start using so yeah matlab is easier but i will personally still stick to um you know python because it gives me more freedom to do the same thing yes yeah i mean there are a lot of analytics tool but yeah yeah it's a debatable topic so debatable topic so let's not talk about that guys if you have any specific requirements you want sessions on a specific topic uh don't don't write deep learning or machine learning because it's huge any specific topic let's say cnn or any anything like i i don't have anything in my mind any in his uh python session i i have an entire python playlist in my channel there are like 17 to 18 videos go through that the immediate next week will be on power bi so there is a guest joining in for that sangamesh so i i'll share all the information over the groups in case you are not a part of the groups probably i will force you to subscribe the channel and press the bell icon to get notified time series bala it was covered in one of the previous sessions but i will also keep one more session for time series because it's very important we'll have a live session on time series soon probably in two weeks support vector machines okay we can we can do a session on entire classification algorithms i think that will uh cover the svms as well any specific doubts any other doubts okay that's it thank you guys it was nice uh having you all i think this was the first webinar where we had the highest number of participants joining in i think at some point of time i also saw 61. i was really shocked for that computer vision i'll try to take that jagdish but before that are you good at deep learning if not i i already have the videos for deep learning cnn all the topics are covered in depth and yeah we can have a basics of computer reason very soon shivani has already left which is okay i'll still be talking any any questions any doubts anything specific to your anything specific to your career or something like that in in that case also i am open because yeah we created this community called as being datum have a look at our website as well being datum.com so that's basically my own uh community we started with a vision of free education and that's where i started giving all these lectures and all those things but yeah apart from the free stuffs we also have some paid stuffs but i'm not doing a marketing of that from where we can get the raw examples to practice so raw examples as in the raw use cases this is one of the raw use case i will share the code and the data sets with you so that you can practice if you're not confident on python you can use the data sets and analyze on power bi or on tableau or on excel it's up to you jagdish could you contact me on linkedin maybe i will be the right person to talk over there okay any other questions guys still any other questions also for the ones um who are asking about machine learning and computer vision and stuff so this is completely unrelated but i am going to be starting a free course um in collaboration with a well-known foundation from april 1. so if you're interested in the same yeah you can reach out to me and i'll obviously be posting up on linkedin as well and um yeah for so we'll be covering everything from the basics so you will also get a better idea of the same great okay guys two and a half hours very lengthy session thank you guys thanks for joining and yeah let's catch up for the next week please have a look at my channel in case you are new there are pretty much good stuffs there trying to put more contents as well but yeah managing a youtube channel is very hectic thank you guys thanks good night thank you thank you shivani thank you shivani bye bye and yeah everyone please go ahead and check out his channel it's it's amazing he's doing great so yeah all the best and have a great day bye bye thank you all thanks thanks thanks jefferson jeffrey thank you everyone thanks ranvija bye
Info
Channel: Satyajit Pattnaik
Views: 3,190
Rating: undefined out of 5
Keywords: eda using python, exploratory data analysis python, eda use case, python, python tutorial, satyajit, live learning, data science, data insights, business insights, data insights analyst, business insights and analytics, data insights solutions, learn data insights, data science insights, data analysis insights, analytics, data, power bi, tableau, ai, artificial intelligence, machine learning, insights, satyajit pattnaik, data analysis, satyajit eda, satyajeet pattnaik
Id: TomrEJdULxo
Channel Id: undefined
Length: 151min 32sec (9092 seconds)
Published: Sun Mar 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.