Data Analytics For Beginners | Introduction To Data Analytics | Data Analytics Using R | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone the society on behalf of Adi Rekha and I welcome you to the session on data analytics for beginners so the session will help you understand how you can start with data analytics so before I begin with session let me just quickly cover the concepts that we gotta cover in today's session so we'll start today's session by understanding the introduction to data analytics and then I'll tell you what is statistics after that I'll tell you how you can perform data cleaning and later manipulation and also data visualization once you understand the basic skills that is a statistics data clearing data visualization I'll tell you the plus point for data analyst that is a machine learning so I'll just talk about machine learning a little bit and then I'll tell you the roles and responsibilities and Salvio for data analyst once you understand all the theory part of this session I'll end this session with the hands-on part where we will see how you can perform data analytics on a specific data set right so I hope that you know the agenda is clear to you guys so now as I said the first topic is the introduction to data analytics let me just quickly cover why do we need data analytics so with the presence of humungous data around us it's obvious that you know we need to analyze the data for our benefits either for gathering hidden insights of a generating reports these analytics benefits the enterprise's by performing proper market analysis and improving the business requirements so in today's market this field has gained a lot of popularity in terms of number because it lets you gather hidden in size generate reports perform market analysis and also improve business requirement so with the nod of this let me tell you what exactly is data analytics so as the word data analytics such as data analytics refers to the techniques to analyze the data to enhance the productivity and business gain leaders extracted from various sources and is categorized to analyze different behavioral patterns now the techniques and the tools used to you know perform data analytics vary from organization to organization or you can say individual to individual right so if I have to defined data analytics for you then data analytics is the process of inspecting cleaning transforming and modeling the data with the goal of discovering useful information suggesting conclusions and supporting decision making so in short if you have an understanding of your business administration and also have the capability to perform exploratory data analysis to gather the required information then you're good to start the carrier in the data analytics feed so talking about catering and data analytics once you understand and you have the capability of performing Business Administration with exploratory data analysis you would become and data analyst so now let me just quickly tell you who exactly is a data analyst so a data analyst is a professional who collects the data from various sources and analyze the data on various aspects and then finally generates the reports now these reports are distributed to the respective teams to use the analyze data and provide improvement in the business so if you have to become a data analyst and you need a set of skills as you can see on the screen so the basic skills that data analysts should possess are the ability to perform statistics data cleaning and also have the capability to perform exploratory data analysis and data visualization apart from these skills if a data analyst also has a knowledge of machine learning then that would obviously add a bonus point of his or her skill set as he or she would be able to build the model and then test the model also right so don't worry guys I will be talking about each of the skills one by one starting with star Stix star Stix is a mathematical science pertaining to data collection analysis interpretation and presentation it is used to process complex problems in real world so that the analyst can look for meaningful trends and changes analysts review the data so that you know they can reach conclusions and several statistic functions principles and algorithms are implemented to analyze the raw data build a statistical model and infer a predictor result so if you just have to understand statistics in the single sentence then statistics is a branch of mathematics dealing with the data collection and organization and then performing analysis interpretation and presentation right so statistical analysis has basically two categories the descriptive statistics and the inferential statistics so let's get started by understanding each one of them one by one so starting with descriptive statistics descriptive statistics uses the data to provide descriptions of population either through numerical calculations or graphs or tables so now descriptive statistics helps organize data and focuses on characteristics of data providing the parameters so as you can see the example on the screen suppose you wanted to sting wish the objects based on the color then you can see that this type of stylistics that is basically the descriptive statistics divides the data into two sections based on the colors so that's black and red over here now if you have to make it more generalized for you then suppose you know you want to study the average height of students in the classroom in descriptive statistics what you would do is you would record the heights of all the students in the class and then you would find out the maximum minimum and the average height of the class right now this was just a simple example guys if you look into the enterprise level then you may have a large data set you know which way this number of columns right now you can just pick up one column and then you can find the minimum maximum and the average of that particular column right also in this lift of star six we try to represent the data in the form of crafts like histogram blind cloth scatter plots and so on but yes the data is represented based on some kind of central tendency now when I say central tendency I mean that you know particular graph represents the distribution of mean or the measure of spread on depends on what kind of a graph you're using or what's on your graph right so for that you have to understand few measures and statistics so those are basically the measures of center are the measures of spread so talking about the measures of center there are mainly three terms that you need to understand which are the mean median and the mode so starting with the mean mean is basically the measure of average of all the values in the sample so suppose you know if you consider the example in the screen then if you want to calculate the mean of the sample that is present on the screen you just have to add all the numbers and divide it by the number of numbers right so since we want to find the mean of eight values we're going to divide the complete sum by eight and that's how you can calculate the mean of the sample all right moving on to the next term that is median median is basically the measure of central value of the sample set so if you consider the example on the screen then you can see that you know there are eight values right now to calculate the median you have to consider the fourth value and the value and then / - so since our fourth value over here this 22.8 and the fifth value is 23 I'm just gonna add both these values and divide it by 2 so the value that you get that is 22.9 is the median of the sample right now moving on to the next term that has mode mode is nothing but the value most recurrent in the sample set so if you consider the example on the screen then out of all the numbers that you see on your left hand side you would see that you know - and if I occurs the maximum number of times right so 25 would be the mode for this particular sample set so as and when the samples are changes the mean median mode values also change right so those were the measure of center guys now moving on to the next measure that is the measure of spread the measure of spread again has basically four terms that you need to understand that is the range interquartile range variance and the standard deviation starting with the range the H is basically the given measure of how spread apart the values are in the data set right so suppose you know you have 10 values then if you want to calculate the range of these 10 values you just have to subtract the minimum value from the maximum value right so that's what the formula is guys that is maximum - minimum now moving on to the next term that is the interquartile range interquartile range is basically the measure of variability based on dividing the data sets into quartile now to understand quantize that we just consider the sample set of eight values right so let me just quickly open my notepad and show you how you can calculate the interquartile range so let's say you know we have eight values 1 2 3 4 5 6 7 8 now to calculate the interquartile range what you simply have to do is first you have to calculate the quartiles right now to calculate the quartiles you have to find the average between two numbers so when I say two numbers you have to calculate the average between 2 and 3 then you have to calculate the average between 4 and 5 and then you have to calculate the average between 6 and 7 right so let me just quickly calculate the average so what I'll do is I'll just add these two terms and then I'll divide it by 2 right so this would be equal to 2.5 so that's 2 plus 3 by 2 that's 5 by 2 is 2.5 similarly I would calculate 4 plus five that is nine nine by two is four point five so I'm just going to put four point five over here so let me just write that and then finally let's just add six plus seven that is 13 by two that is again six point five right so basically you have three values that is two point five four point five and six point five so basically these values would define your quartiles so the first quartile would be after two the second quartile would be after four and the third quartile would be after say right so what will happen is your sample set would be divided like this right so as you can see on the screen we have four quartiles now the difference between the first quartile and the third quartile would be your interquartile range so if you just want to calculate interquartile range what you simply have to do is we have to first calculate the quartiles for your sample set and then the difference between two quartiles would be your interquartile range right so I hope I'm clear with this part now moving on to the next term that is variance variance basically describes how much a random value differs from its expected value right so basically whenever you want to calculate how much any random value differs from its expected value then you're basically calculating variance it basically entails the computing squares of deviations now with that let's move on to the final term that is the standard deviation so standard deviation is basically the measure of dispersion of set of data from its mean right so whenever you calculate the mean and then whenever you want to calculate how far is the dispersion of the set of data from its mean you would basically calculate standard deviation so folks I'm not gonna go into depth of you know how you have to calculate each of these terms if you want to learn more about statistics I leave a videos link in the description box and you can refer to that video that is basically statistics for data size where you will understand all these terms based on statistics and you'll understand how you can calculate the values right so guys that was all about various measures that you need to go through in descriptive statistics now moving on to the inferential statistics that is the second category in the statistics this is basically used to build the model and then give a probable solution so inferential statistics basically generalizes a large data set and applies a probability to draw it allows us to infer data parameters based on statistical model using a simple data set so again let's just take the same example of you know we have to segregate few objects based on the color now when you implement inferential statistics on these same objects what would happen is a statistical model will be built and based on that a conclusion would be given right so that's how you know inferential statistics this if you have to understand if I have to generalize this for you then you can again take the example of calculating average height of students in the classroom over here what would happen is we would take a sample set of the class and then if you want to apply inferential statistics what we would do is we would group the students in took tall average and short height and then based on this we would build a statistical model and expand it for the entire population right so guys that was all about descriptive statistics and the inferential statistics now inferential statistics has one more term that is hypothesis testing that you need to understand so hypothesis testing is an inferential technique to determine whether there's enough evidence in a data sample to infer whether a certain condition holds true for an entire population or not so what basically happens is under the characteristics of general population we take a random sample and analyze the properties of the sample we test whether or not the identified conclusion represents the population accurately and finally we interpret their results right so whether or not to accept the hypothesis completely depends on the percentage value we get from the hypothesis so hypothesis testing is basically conducted in the following manner that you can see on the screen it starts with the state of hypothesis and this stage basically involves stating the null and the alternative hypothesis then we basically formulate analysis plan where the stage involves a construction of the analysis plan and then we move on to analyzing the sample data so in this stage it basically involves the calculation and interpretation of the test statistic as described in the analysis plan that we formulate before and then we finally interpret the results where you know this involves basically the application of decision rule described in the analysis plan right so guys that's how you conduct the hypothesis testing so guys that was all about inferential statistics so now let me just quickly press you with the differences between script the statistics and inferential statistics so descriptive statistics is basically concerned with the properties of population whereas inferential makes inferences from the sample descriptive presents the data in a meaningful manner whereas influential compares and predicts the future outcomes descriptive statistics outcomes are shown in the form of charge tables and graphs and inferential statistics outcomes are basically shown in the form of probability scores in descriptive statistics it basically describes the known data but an inferential statistic it tries to make the conclusions beyond the data as available and finally coming to the last difference descriptive statistics has the measures of central tendency and spread of data whereas inferential has hypothesis testing and analysis of variance which is basically the N over model right so I hope guys you know the differences between the descriptive and inferential statistics are clear to you guys so now let me just quickly move on to the next skill that is important for a data analyst that is basically data cleaning and data manipulation now once you get your data the first step would be to remove all the unwanted data so the process of detecting and correcting corrupt or inaccurate records from a database is basically said to be data cleaning or data cleansing or data wrangling so you need to make sure that you know all the null values the corrupted values or a column from the data are removed before you start analyzing the theta and once you remove such values from the data the next step to perform is basically data manipulation which is nothing but exploratory data analysis so if f you just defined data manipulation for you then it's the process of changing data to make it more organized and easier to read and this is nothing but data manipulation so whenever you know you make your data in a more organized manner you make all the tuples clear and then you have all the column values in a clear manner and it's easy to read and analyze it's basically known as data manipulation it's not required that even if you have five thousand tuples you view all the five thousand tuples together you can use some functions to just view the top ten tuples at the bottom ten tuples so that you understand what are the weights parameters or what are the columns that are present in the data set right so in this particular step what you can do is you can completely reorganize the data into various forms based on whatever call names that you well sure you know you can just add few columns or delete few columns you can remove your couple or you can just merge few columns and so on so don't worry folks I'll be showing you a demo where you know you will be performing data manipulation now moving on to the next kill for a data analyst that is basically data visualization so data visualization is nothing but the representation of data and forms of charge diagrams etc right so you can represent your data either in the form of bar graph scatter plots pie charts box plots line graphs and so on not only this but you know you can also visualize the data in form of complex plots like histogram and use two to three different plots together like you know you can have a scatter plot which has a line graph or you can have two different plots of you know histograms and then you can have a line graph over it and so on right so it's completely based on your understanding of what kind of plot that you want that it's completely based on you know how you want to see your data or how you want to which lies your data right so guys these were the basic skills that you know a data analyst must have we started with statistics and then I told you the various categories of statistics and how you can calculate the waste measures in both of them after that you know I told you about data cleaning and data manipulation which is nothing but the exploratory data analysis where you clean your data remove all the null values or the corrupted values and then you know you can reorganize your data or reorganize your columns and finally we moved on to data visualization where you know the columns between which you want to visualize the data right so these were some basic skills that every data analyst should have but yes if you have an understanding of machine learning then that would be obviously a plus point to your skills right so now let me just quickly tell you what machine learning is in short so as you understand what machine learning is and then we'll quickly shift to the demo part so when machine learning is basically a concept which allows the machine to learn some examples in experience and that too without being explicitly programmed so what basically happens is whatever data that you pass to the machine learning algorithm the algorithm learns from the theta and then you write predicts the output for you so as you can see on the screen we have a training data and then the data is passed on to the machine learning algorithm after that the model is created now over here even we add a new input data the natural pass to the machine learning algorithm which was created and a prediction would be given as an output now once the prediction is given you can just check you know whether it's a successful model or not based on the requirement that you have right so that's about ml guys if you want to learn in depth about ml I'll leave a video in the description box and you can understand from there what exactly ml is and where is it use how is it used what companies are using it and so on so guys these were all the skills for a data analyst as I'm talking so much about data analysts let me just quickly tell you the roles and responsibilities of a data analyst he or she should be able to determine organization goals he or she should have the capability to mine the data or this basically performing data mining and also perform data cleaning once unit data is being cleaned he or she should be able to analyze the data properly and pinpoint the trends and patterns or you can see the behavioral patterns and finally creates reports with visualizations right so basically if you become a data analyst these are the various roles and responsibilities that you would be responsible for you should be able to do by in the organization goals you should be able to mine data you should be able to perform data cleaning or wrangling you should be also able to analyze data and pinpoint the trends and patterns in the particular data and finally create the reports with visualization right so I hope that you know the roles and responsibilities are also create to you now looking at so much of roles in the sponsibility is you might be wondering right what could be the salary of a data analyst well the average salary for a data analyst in the u.s. is around 83,000 dollars and in India is around 4 lakhs right so guys these are the starting salaries but as and when you know you have a good expertise of data analysis and you're able to perform machine learning algorithms you would be great in the market and you would have a great market value right this was all about the skills roles responsibilities about a data analyst now let me just quickly move on to the need of our because the hands-on that I'm going to show you is based on our so basically our is again a programming language that is basically used for data analytics so art is basically open source and it's freely available it is cross-platform compatible so you know you can have are with power bi so as you know power bi is a visualization tool you can create reports using both of these tools together art is also a powerful scripting language it is highly flexible and is a ball and let me tell you it's really simple to code and art because you just have to install few packages and then you just have to understand how these packages have various functions to take care of your deeds right so guys this was all about the theory part of the session I hope that you know you've understood what exactly data analytics is now let me just quickly move on to the demo part so what I've done is I've basically chosen a dataset which has various columns like the person way the person height body mass index pulse rate and so on so let me just quickly open the dataset and show you what exactly the data set is about so as you can see on the screen this is basically the data set that I've chosen I have a person ID the gender the person age raised education marshal status the relationship status and then insurance all the poverty number of house own if it's rented the rooms the person body index all straight and then various other health parameters right so we're going to basically perform analysis on this particular data what we're going to do is we're going to first import the data set into R and then we'll perform descriptive statistics on it that I was talking about in the theory path after that we're gonna deal with the missing data and then we'll perform data visualization so we'll basically form data visualization between two columns so that you understand how the values in the two columns vary with each other and then we'll move on to inferential statistics so you know where I will perform t-test for you guys after that I'm going to just apply a machine learning algorithm that does perform linear regression so that you understand how that's a plus point for a data analyst right so I hope that Dealer the process is really clear to you guys so let me just open my our students quickly show you the code all right so as you can see on the screen this is the code that I've previously written so I'm just gonna show you guys by running each of these commands one by one so the first step that I mentioned is to import the data set right now before that let me tell you that you know we're going to use the power package for this complete code so that the Platt package basically offers various functions by which you know we go to perform various functions on our data set so you can just install this package if you don't have it installed on your a studio but yes if you do have it installed you just have to use the libraries right so I have already installed in my art studio so I'm not going to do that again now what I'll do is I'll just import the data set directly so to import the data set you have to mention the read dot CSV file that's because you know my sample data set is in the form of a CSV file or if you had any other types of file then you have to mention accordingly okay so if you just type in read you can see various options for various kinds of files so since I have a CSV file I'm just gonna type in CSV right and then what I've done is I mentioned the directory of the file right so since my file is present on desktop I have just mentioned the directory and then I can just execute this particular command but yes let me tell you once the file is read what I've done is I've just assigned it to a variable example over here that will basically make it easy for me to use this variable again and again to perform various steps in the analyzing part I have just assigned is complete file to our example variable now let me just run this particular command so what I'm going to do is I'm gonna just press on control enter so that's basically a keyboard shortcut to run in this particular line if you just want to run the complete our script that you've written then you can just press on control shift enter right so I'm just gonna press on control enter and you can see that you know this particular command has got executed now if you just want to view the data set what you can simply do is you can just use the command view example and then you can click on run and you can see that you know your complete data set that is basically your sample data set which was present over here has got imported into the our studio where you can perform various analysis functions right so you can see all the columns and all the values have come over here right now if you observe this particular data set you have 5000 entries right now it cannot happen that you know you observe 5000 and phase in your console right so to just observe the top 10 entries what you can simply do is you can use this function TBL - DF that is basically creating a data frame for your data set so what I'll do is I'll just type in TBL DF I'll mention this variable name that is basically example over here and then I'll again is assign it back to example right so if I just run this particular command you can see that even this gets executed and then if I just type in example you will see that you know only 10 rows get printed so let me just drag this out for you you can see that you know we have ten rows printed to check all the values and we also have a table that is basically 5000 into 32 which says that you know we have 32 columns and five thousand entries right so that's how guys you can use this particular function to see the top ten rows now what we're going to do is we're going to just use simple commands like head tail dimension names the glimpse to just look into the data set more clearly so what the head function does is that it basically shows the first few rows the tail as the name suggests shows the last few rows the dimension shows the number of rows in the number of columns so just in case if you just want to look into the names of the columns then you can just use names and glimpse basically shows the structure of the data set so I'm going to first run head so that you see few rows so as you can see you know it has printed six rows and it is printed around one two three four five six seven eight columns right so it has printed six rows and eight collars and it has also mentioned that you know there are total around 32 columns now similarly if I'd run tail then you can see that you know the last few rows are printed so these are few last rows of the data set that we have chosen now moving on to the next command that is dimension so as I mentioned dimension will basically give you the number of rows into the number of columns so that is basically 5000 into 32 now names would be basically giving us the column name so I'll just turn in name and you can see the 32 column names that we have this will basically help you understand what are the different column names and what are the different columns names that you can use to find the relations between the values so we have an ID sex age race education level status relationship status insurance and so on right so these are few steps guys that are not really important but yes these will help you understand the data set more if you're new to data analytics because you know you have to start understanding if there are any null values or you know what are the number of rows or what are the columns if you have larger number of rows like you know if you have around 50,000 tuples then you cannot use 50,000 tuples to run each function right so that would require a lot of memory so for that you can just use a sample dataset like you know you can just use 5,000 rows and then you can use the built model for the 50,000 won right now moving on to claims clear again just shows the structure of the data set as I mentioned before so let me just want that command and show you so as you can see it is printing out all the column names and then it is printing out all the values so you can see that in a person ID is an integer sex is in fact there and has male/female values age is again an integer race is again in Factor poverty is again in decimal points and so on right so you know the data type you know what kind of values are present they were there and then you can also know and what are the columns which have different kind of values right if you observe here there are a lot of any values right so it may go to deal with them soon so don't worry now let me just minimize it again all right so now that you know the basic commands in are you know how you can start for analyzing the data let's start with the second step that is the descriptive statistics so if you remember in the descriptive statistics part I told you that your this measure of center which has mean median mode and there is measure of spread which has interquartile range quartile range and so on right so what I'm gonna basically do is first I'm going to display all the race values and then I'm going to display the unique values of race and the length right so to display all the race values if you see over here in the column we have different race we have Asian we have black we have white we have Mexican and so on right now to display all these race values what you simply have to do is you have to use the variable for the data so that is example and use the dollar sign with the name of the column right so we're here it's example dollar raise I'll just run this particular command and you can see the output that you know we have got all the waste values now if you want to find out the unique values what I'll simply do is I'll use the unique function and then I'll mention example dollar raise similarly if we want to find out the unique values for any other particular column we just have to mention that particular call upstream over here I'll mention unique example race and then I'll run and you can see that you know we have Asian black white Mexican Hispanic other as various races present in the column right now if you want to find out the length of all these unique values you just have to use the length function outside the unique function and then you have to run this particular query and you can see that here the length is basically 6 so basically what's happening is instead of counting Asian black white Mexican Hispanic and other on fingers what you can simply do is you just have to use the lens function to find out how many unique values are present in the race column right so that's how you can use the lens function guys now moving on to the next part if you just want to calculate the mean median and range of any particular column let's say age over here what you simply have to do is you have to use the inbuilt functions in R that is the mean median range and then you have to mention the variable named dollar and the column name right so if you have to calculate the mean what you simply do is you'll type in mean and then you'll mention the variable name that is example for the data set when should taller that is to put specifically access this particular column that is to specifically access the column age right so I just run this particular command and then you can see that you know 36 is basically the mean age similarly you can calculate the median median is also 36 and the range is basically from 0 to 18 so that's how guys you can use the script of statistics in R to find out the mean median and reach now if you just want to get a summary statistics on each variable in the data you can just use the function summary and then you have to mention the variable name right so I've mentioned summary example and then if I run this particular command you can clearly see that you know we've got summary of all the 32 columns that we have in our data set so when I run this particular command you can see that you know for all the columns with the data type end you get the minimum value the first quadrant the median the mean the third quadrant and the maximum value for the columns which have the data type factor like for example sex we have number of females to be two four nine five our number of males to be two five zero five and similarly when we see four rays we can see that you know the Asians are at 288 the blacks are at 589 and so on right so that's how you can have an idea about you know different values present for your data set so you can get the number of values the minimum the medium what are the various kind of values that are present in the column and so on with this particular function that is summary right so this is also one particular important feature that you need to analyze in any data set that you get to handle too you have to understand what are the various factors so you know what are the various types of values that are present in your call and what is the number of those values right so if I look into the employment status you can clearly see that you know the looking is 166 they're not working is one four one three the working is that highest at is two two six three and not applicable that means it's null values it is at one one five eight right so these are various features that you need to understand now that you know you've implemented the script of statistics let's move on to the third step that is basically dealing with the missing data so to deal with the missing data first let me just calculate the mean of the salary and then I'll explain you why we have to deal with the missing data so when I calculate the mean of the salary you will see any now if I just open my dataset and if I show you the salary column you see that either you have various columns which are not any right but yes we've got the output as n/a that means it's definitely wrong you need to get the mean of the income now since we have a lot of columns that could be n/a in this particular column that is the reason that you know we're getting the output as any right so what we can simply do is we can just find out you know whether there are missing values or not here I had showed you because the missing value occurred at the 29th couple itself right but just imagine you know you have 5,000 tuples and you don't know where is the any value coming you need a way to find out that right so that's by using the na dot RM parameter so na dot RM basically indicates if you know whether there are missing values present or not by default this parameter is assigned to false but yes I'm mentioning it over to be true so that you know whether if we have any missing values then this would return an output for us so with this particular command you can see that you know if the example data said salary has any null values that is basically true over here then obviously we can calculate the number of null values present so what you can do is if you just run this particular command you can see that the mean value is basically 57,000 now apart from this you can also use s dot any function which basically tells you if your value is missing or not again so what I'll simply do is I'll just mention this is got any function and inside that I'll mention the data set all it the column name since I want to find it for salaries column I'll just mention that particular column and I'll run this particular command so this would basically written all the tupple values with either false or true false means that you know there is no any value with there and true means that there's an any value with it so that's how you know you can find out if there is null value present in which row or which tuple that you have to understand if you know if you want to just add all the number of null values to calculate how many null values are present in total you can just use the sum function that is basically the aggregate function in front of that is dot n a function to calculate the number of null value so that's basically it would count the number of true values and then a foot prints for you so you can see that you know we have 377 values which have null values in the column of salary right so now what I'm going to do is I'm gonna just remove all these null values and then I'm going to replace them with the zero right so for that what you simply have to do is you just have to use this function of s dot na mention the data set name and then assign a zero to it and then assign it back to the data set right so I've put a zero and I've said if is that any example that means if there is a true value in the example data set just replace it with a zero and then assign it back to the same register that is example right so I'll just run this particular command and now let's just view the data set again so I'll just view the example again now you must have observed one thing over here that is all the column with any values to don't get to place with zero only the values you know which had the datatype of n and they had any values got replaced with zero right for the other factors like you know your education level marshal status relationship status you can handle the null values based on various parameters like you know you can just remove the tuples completely you can just compare two factors and then mention the value suppose you know if I just compare educational level and marshal status so I can see that you know all the females belonging to the Mexican race can go to educational level of high school and I'm asking students are married right so I can just replace these values any values like that so it's completely based on you how you want to deal with the missing data set initially I've just shown you how you can just remove the values and then how you can replace null values with something right so I just leave that part to you so that you explore how you can do that now moving on to the next part that is basically the exploratory data analysis since we have analyzed the data and we have understood what are the various columns that can be analyzed and you know what are the various columns that have null values and how some values got replaced with zero what I'm simply going to do is I'm going to perform the next step that is basically the data visualization where I'm gonna plot histograms for different column values right so what I want to do initially is I'm gonna use this ggplot2 library just in case you know if you do not have it installed in your art studio you can again just install this particular package by just using the command install dot packages and then just mention the package name and then you just have to load the library since I've already installed the package in my art studio I am just gonna run this particular command to just load the library and then I'm going to perform data visualization right so if I just have to plot a histogram for you what you can simply do is you can just use GG plot and then you have to mention the data set name that is basically example it's variable and then you have to mention the parameter based on which you want to plot the data so initially I want to plot the data for body mass index and then I'll mention the pins to be 30 right so I'll just run this particular command and then you can see the plots so if I just zoom in this particular plot you can see that in a body mass index at the count around from 20 to 40 like let's say you know around 30 or you know less than 30 is at the maximum right so with this we can get the count off you know what is the least value of Mori mass index what is the maximum value of body mass index that is basically the least count than the maximum count right similarly I can plot a histogram for a person weight so I will again just use the GG plot function I'll mention the data set variable and then I'll mention person weight that is basically for which I want to visualize the histogram for so let me just run this particular command and let me just zoom in this plot so you can see that it is a person weights ranging from fifty to hundred let's say you know this is around 75 that is less but it more than 75 200 is at the maximum values so most people way around you know from 75 to hundred and less people lay around from 50 to 200 on you can see none of them almost weigh around 200 right very few people we so with such kind visualisations you can have a clear idea of you know how many people way more than let's say hundred or how many people way less than hundred or what is the maximum weight around the samples data said that is given to your what is the minimum count of a width of the sample so that is given to you and so on right so now what I've done is this person weight histogram was basically plotted on based on kgs right now if I have to just convert into Elvis what I've just done is I've just multiplied it with two point two and then I've given a lot number of bins so that you know we get a clearer visualization so I've again used the ggplot function and then I've mentioned the example data set and after that in the ace column I've mentioned person weight into two point two so that we want to convert the kg is tooth lbs and then we have mentioned the number of pins so I'll just run this particular command and then what we'll do is we'll zoom in so you can see that you know the weight around 150 to 200 lbs has the maximum number of count if you just continue keep zooming in and you keep increasing the pins you'll see the outputs more clearly right so what that was about this plot guys now similarly what I've done is I have again plotted a histogram for H to know whether you know there are kids present at the age distribution or not so what I'll do is I'll just run this particular command again where I've mentioned GG plot that is basically the GG plot function the K and eat a sick theme the person is that is basically the H call him and then I mentioned the pins so when i zoom in you can see that you know there are a lot of number of children's that are present in the sample dataset whose age is less than 18 so this is one category that you can again use to visualize or you know need to analyze the data set you can just divide your data set into children's data set and then adults data set and perform various actions so that's how you can use this particular visualization for your benefit now not only histogram rice so as I mentioned in the visualizations part in the theory section there are weirdest number of plots like bar plot scatter plot line plot box plot and so on right it's completely based on your understanding and your need which kind of plot that you want to plot now suppose I want to you know compare the person's height with the person's weight and then I also want the parameter of gender to be present what I can do is I can just simply plot a scatter plot where I'll compare person's height - person's weight and then just color the scatterplot with the help of gender right so for that you can again use the chichi plot you can mention examples that is basically the dataset and then I'll mention person height person weight and then color by sex that is basically a gender right so when I run this particular command you can see the output that you know we've got a plot you know of person's weight person's height so person's weight is compared with person side that is on the right hand side that you can see that you know the pinkish red color represents female and the blue color represents the male if person's height was around hundred and that would be maximum of you know male people present in this particular section let's say you know the person's height is around 150 and the way it is also around you know hundred we can see that you know there's a lot of female section that's it right so guys that's how you can use data visualization to perform and understand various relationships between data these were just few columns that I compared them and these were just few columns or examples that I showed you well you can play around with data as much as you want so this is this one more trick that whenever you create a visualization just make a note of it what are you getting out of it right so out of the visualization that you can see on the screen what I'm basically getting is the person's weight - person size comparison and how many female and male population is present in that particular scenario so as you can see the maximum population is present in this particular range you know where the person's height range is around 150 to 200 and the person's weight range is around again you know let's say 125 250 right so that's how you can analyze data and gain some hidden insights now moving on to the next part that is basically the inferential statistics what I'm going to do is I'm gonna perform t-test so when I say t-test dynastar a type of hypothesis testing by which you know you can compare means so each test that you perform on your sample data basically brings down your sample data to a single value that is the T value so as you can see on the screen this is basically the formula of T value so now what I'm going to do is I'm basically you don't go to filter the data set first so two filters as I said what I've just used is have used the filter function I've used the data set name that is basically example and then you have to mention the parameter based on which you want to filter the data set now what I'm going to do is I'm going to just the adults so I've just mentioned persons age to be greater than or equal to 18 after that the complete data would be assigned doing different variable that is example a right so that's how you can filter the data set so now if you have any question that you know why have I filter this data set well this is just like a precaution measure to you know to make us prevent from making any mistakes downstream when we keep analyzing the data right so now what I'll do is I'll just run this particular command and then I'll view the data set so when I save you example a you can see that you know the a person's age is obviously greater than or equal to 18 so there's no child present in this particular dataset now what I'm going to do is I'm going to perform t-test so as I mentioned before t-test is basically used to compare the means so what I'm going to do is I'm going to use this predefined function in R to perform t-test that is T dot test and then I'm going to perform the t-test between person age and sex right so the person's age is basically the response and the sex is basically the crew so this is basically a formula method to perform t-test in R and data is basically your data frame that is example a over here so when I run this particular command you can see an output that you know we have theta to be person age by sex and T is one point nine one to two that is basically the T value that is calculated and the p value is zero point zero five five nine three so when I say T thirsty this is basically used to calculate the p value that is basically the conference value so this value basically tells us that you know if the p value is less than 5% that is 0.05 then the t-test between these two values or you can say the alternative hypothesis between these two values is getting rejected but yes if it is more than that then you can consider that unit is around 95% correct or apt - the sample estimates right so basically with the t-test that I perform for persons age and sex what I'm basically trying to find is there any difference for the age of males versus females in this particular data set so since the p-value is greater than 5% you can see that you know the mean and group of female is around 47 and mean and group of male is around 45 so yes you can see that you know it - there is a difference in age of males versus females this particular dataset now moving on to the next t-test that is basically the t-test between body mass index and the diabetes status while I'm performing this data's to identify whether the body mass index the first between diabetics and the non diabetics patients so again what I'm going to do is I'm going to choose this function T dot T test and then I'm going to mention body mass index over thigh Bertie state this and then I've mentioned the data frame right so I'll just run this particular command here you can see that you know the p-value is less than 5 person that means that obviously your hypothesis is going wrong and then obviously the answer to this question is no that means the body mass index does not differ between diabetics and non diabetics patients moving on to the third details that I want to perform that this liquor here verses relationship status so basically this t-test is trying to answer the single of married people drink more alcohol or not so again I've just mentioned t dot theta some men should liquor your relationship status and the data frame name and then I'll run this particular command and you can see that you know p-values again less than 5 person that means obviously there is no connection between both these values so that's how folks you can understand how various columns differ over each other various columns can be related with each other well so that was all about t-test guys now these were the waist kind of skills as I mentioned before our data analyst should possess now moving on to the last skill that is basically the machine learning so as I said before knowing machine learning is a bonus point now obviously how would you perform machine learning that is basically you have to understand algorithms and you have to build a model for it so what I've done over here as I am going to just use the linear regression over here so as you know the linear models are basically the mathematical representations of the process that we think give rise to our data right the model basically seeks to explain the relationship between a variable of interest that is basically our Y variable which is also the outcome response or the dependent variable and one or more X predictors are the independent variables so if you have y equal to B naught plus b1x then you know the X is basically the independent variable and Y is the dependent variable and B naught is the intercept and B is basically the coefficient that describes what one unit change in the X would do the outcome variable Y right so that was a basic about linear regression guys now building a linear model basically means that you know be proposal e in your model and then estimate the coefficients and the variance of the error term so as you can see if you want to build a linear model in R then you can just simply use the inbuilt function for linear regression that is LM function and then you have to mention the columns between which you want to calculate these values right so basically I've decided to find the linear regression model for a bound person's weight upon persons height and then over here in the data I've mentioned my data frame right now after that I'll just assign it to fit so let me just run this particular command now after that we'll just find the summary of fit right so when we find the summary of fit you can clearly see that you know the p value is less than 2 point 2 into 10 power minus 16 so that means the intercept term is not very useful most of the time so this basically shows us what the value of weight would be when the height would be 0 right so the value of weight would be around minus 19 which is impossible so that could never happen but if we observe the height coefficient we can see that you know it's really meaningful because each one unit increase in the height result so they're on point six increase in the corresponding unit of the weight right now if you just want to visualize this particular model what you can simply do is you just have to use the ggplot function and then you have to mention the data say that is example a over here mention person height person weight Jim point and Jim's good method to build LM right so if I just run this particular command you can see that you know we get a plot so by default you can see that you know this is only going to show the production over the range of the data which is very very important to know so for example if you clearly see the linear model tells us that you know the weight would be around minus 19.8 kgs when the height is zero now let me also tell you that you know you can extend the predicted model's regression lines pass the lowest value of the data down to hide zero and also the pants of the confidence intervals basically tell us that the model is apparently confident within the digits that is defined by the great boundary but if you just think about one thing over here we would never see a height of zero right so basically predicting past the range of the available training data is not a great idea I would say because there's no point is predicting the past range of data as the height would never go zero right so guys that was about linear regression so what we did in this particular demo was we started by importing the data said and then we learned how to view the data set after that we went through few simple commands that is the head tail dimension name's glimpse and then we understood how to perform descriptive statistics on a data set so we understood how to use the unique function the lens functions the mean median and the range function after that I also told you how you can deal with the missing data that is basic by using the n a dot RM parameter and also the s dot any function and finally you can replace all the any values with the zero but only for those column with the theta type int and then we performed exploratory data analysis where we performed visualizations between various columns and we understood the relations between them and then we finally performed t-test to understand and answer a few questions and perform linear regression that is basically a machine learning model right so I hope that you know you've understood this particular part of the session so folks that was all about data analytics now if you wish to master data analytics then you can enroll yourself into data analyst master program provided by Adi Rekha the start will start by letting you learn the statistical essentials such as probability Bayesian interference regulation making and stance text once you get through statistics you will learn data manipulation visualization EDA mining sentiment analysis with R and after that you will get a proper SAS training where you will be taught advanced statistical techniques like proc SQL SAS odious advanced as procedures and so on after that once you get through all these you will learn a visualization tool tableau to generate proper reports and perform integration with a apart from the learning path we also offer various electives such as QlikView advanced ms excel 2010 basics of our analytics of retail banking decision tree modeling using our machine learning with me out and advanced predictive modeling in our so folks get ready to master the complete package to become a data analyst with Ed Eureka so that's all for today's session thank you and have a great day I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to any rekha channel to learn more happy learning
Info
Channel: edureka!
Views: 407,064
Rating: 4.8884578 out of 5
Keywords: yt:cc=on, data analytics for beginners, data analytics tutorial, introduction to data analytics, data analytics tutorial using r, data analytics using r, r tutorial, r for data analytics, data analytics tutorial for beginners, data visualization, machine learning using r, data analysis using r, data analysis tutorial, data analysis tutorial using r, data analysis r, r programming language, programming in r, data manipulation using r, data analytics training, edureka
Id: fWE93St-RaQ
Channel Id: undefined
Length: 51min 47sec (3107 seconds)
Published: Thu Jan 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.