Learn Statistics for Data Analytics & Data Science from Scratch | Part I | Satyajit Pattnaik

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
statistics one of the most crucial skills needed in the field of data analytics and data science and even in the field of AI but do we have enough videos enough contents on the internet around statistics well if you go and search YouTube and various other platforms you will definitely find some materials on statistics but those videos are mostly theoretical they don't really explain you how these are practically implemented in the field of data analytics or data science everybody talks about sampling techniques everybody talks about mean median mode or normal distribution or skewness cutes but where do we apply it in practicals where do we apply it in real world so I have come up with an end to end statistics course all of my videos that are going to be a part of this video are not published anywhere apart from my official data analytics and data science courses which means you will not find these videos either on udem me or on my YouTube channel this is the first time I'm making all the statistics videos publicly available to you at no cost so what I've done is I have divided this beautiful end to end statistics uh course into three sections this is the part one which is going to be a two hours video I have covered multiple topics in this and in the following weeks I will be publishing part two and part three in the next week part two next to next week part three and combining together it will be around 6 to 7 hours of contents comprising of multiple topics related to sampling techniques uh dis descriptive analytics descriptive statistics inferential hypothesis testing types of testing and so on bunch of topics with theoretical knowledge and the code part as well all the codes will also be given to you that's it see you in the video and if you are liking this video please like share and subscribe the channel and do not forget to comment something in this video because that helps me also share this video with your friends on LinkedIn on various other social media and spread the word see you in the [Music] video hi welcome to this module on statistics in this video we shall be talking about the introduction to statistics what exactly is statistics and how statistics plays a v role in the field of data analytics and data science basically the statistical analysis is meant to collect and study information available in large quantities statistics is simply a branch of mathematics where computation is done over a bulk of data using charts tables graphs etc for example you want to analyze two numerical columns let's say you have data there are multiple columns in it there could be name gender tenure age salary so many things imagine you are analyzing to numerical attributes numerical attributes are those attributes those have data in format of numbers for example age let's say age of 10 customers that you have are something like this let's say we'll consider four examples similarly we also have salary let's say the salary at age 25 is this as 29 is around this 31 is around this and 35 is around this and you start analyzing using some charts there is a chart called as scatter plot for example age goes into the x-axis and salary goes into the y axis so let's try to plot each one of them roughly let's say we have 5 10 15 20 25 30 35 around 25 we have 5,000 10,000 15,000 20,000 now if I plot this age 25 and and 10,000 salary which is somewhere around here age 29 and 12,000 is somewhere around here age 31 and 16,000 is somewhere around here age 35 and 20,000 in somewhere around here if I start plotting the graph this is how the graph looks like which clearly indicates that there is an increasing Trend that means if age increases your salary increases as well this is what the data is telling us using this particular chart so statistics as I told you is nothing but a branch of mathematics where computation is done over a bulk of data using charts tables and graphs the data collected for analysis here is called measurements now if we have to measure the data based on a scenario a sample is taken out of a population population is simply nothing but imagine if I ask you to find the average age of people in Japan or Indonesia or Malaysia wherever you are coming from let's say how do you find the average aage age of people in Japan or in Hong Kong or in China now definitely you are not going to not going to ask each and every one of that country right for an example we'll take an example I'm not pretty sure about the population of Japan but imagine the population of Japan is around 40 million imagine okay now I'm not going to ask each one of them now each one of them asking each one of them is going to be this entire thing is called as population the entire 40 million and taking feedbacks taking outputs taking statistics from a population is completely a different scenario and it not it is not feasible usually what we do is we kind of do some sampling out of it what if I go to different location I talk to different kind of people and I take a small chunk maybe 1,000 or 2,000 people and then I ask each one of them what is their age and then take an average out of it and just imagine the average age is 39 now this is the average age of sample not of the population right so this is the difference between population and Sample definitely there are many and many sampling techniques that we will be covering in in the future based on which you can understand how samples are created in different scenarios but for the time being I hope you understand the difference between population and Sample then analysis or calculation is done for the following measurement I hope you got an initial idea about statistics statistics is very vital for data analytics and we are going to learn a lots and lots of concepts related to statistics in our future videos in this video we shall be talking about types of data and statistical analysis this is how the agenda looks like we'll get started with the types of Statistics which is descriptive and differential then we'll get started with each one of them what exactly is descriptive what exactly is inferential statistics and lastly we'll also be talking about the different types of of data which is majorly categorical and numerical otherwise known as qualitative and quantitative data see you in the next video in this video we shall be focusing on descriptive statistics in the previous video we learned that statistics is majorly of two types which is descriptive and inferential but what exactly is descriptive statistics descriptive statistics as the name says that means describing that means descriptive statistics summarizes or describes the characteristics of a data imagine you have a data set and that could be a customer related data let's say you have thousand customers information here you have age salary gender etc etc so many features the what part is basically nothing but the descriptive statistics that means what is the data telling you what is the average age what is the total population what population are males what population are female what is the average salary all these information are nothing but a part of your descriptive statistics that means the describing the characteristics of a data set descriptive statistics consists of three basic categories of measurements measures of central tendency measures of variability or spread and frequency distribution measures of central tendency describe the center of the data set which is mean median and mode what do you mean by each one of them in short let me explain you each one of them using some examples let's say I have five people and the age of these five people are 25 30 15 25 35 now what if I ask you what is the average age we already know what is average average is nothing but the sum of all these numbers divided by the total population this is nothing but the symbol for summation now what is that this is nothing but 25 + 30 + 15 + 25 + 35 ided 5 which is 55 15 70 50 12030 divided 5 and this becomes 26 so the average age becomes 26 let me confirm it using a calculator 25 + 30 + 50 10 + 25 + 35 which is 130 5 is 26 now this is the average the mean is nothing but the average mean is nothing but the statistical name of average so here the mean is called 26 mean is also denoted by the symbol mu what will be median median is simply nothing but the middle value in order to find the middle value the first thing that you need to do is to sort these values let's say if I start writing these numbers from smallest to the largest numbers first one will be 15 then comes 25 25 30 and 35 and the middle value is 25 so this becomes your median what if you have six numbers let's say you also have 40 now now you have six numbers 25 cannot be the middle value the middle value is now 25 and 30 both so here the median will be the average of these two numbers which will be 27.5 mode is something that is usually applicable for categorical data but in in this particular example the mode will also be 25 why because it has the maximum number of occurrences mode is mostly used for categorical data for example you have gender you have 10 customers and or else five customers one of them is male one of them is female one of them is empty one of them is female one of them is is female now what if you have data like this out of these five records one of these records have a missing value that means you have to deal with it now you can deal it using mode here the mode will be female why female has three occurrences male has one so normally mode is nothing but that particular attribute which has the largest amount of occurrences which has the maximum amount of occurrences what if you have two females and two males in that case your mode could be both of them this scenario is called as Bodel Bodel okay so anyways we will be covering measures of central tendency in depth in one of our future videos next is going to be measures of variability describes the dispersion of the data set which is variance and standard deviation now what exactly standard deviation and what is variance definitely we will be understanding in depth in some of our future videos we using mathematical examples as well in simple terms if you want to remember what is standard deviation the formula of standard deviation which is symbolized as Sigma is nothing but summation again summation is nothing but addition summation of x i - mu s ided n root over now trust me do not get panicked using this formula do not get panic attack by seeing this formula this formula will be explained in depth in some of our future videos the internal part of it without the square root part is nothing but your variance okay again theoretically and practically we will have separate classes for that and the next part is measures of frequency distribution describe the occurrence of data within the data set which is called as count now this is a high level idea about descriptive statistics we have indepth classes on each of these topics in the future that's all about this particular video on descriptive statistics in the next video we shall be explaining you what is inferential statistics and then we'll jump into these respective videos see you in the next video we shall be talking about inferential statistics in the previous videos we understood what exactly is statistics what are the different types of Statistics which is descriptive and inferential we somehow got a basic understanding on descriptive part however we still have to go through multiple videos on the subtopics of descriptive statistics but what exactly is inferential statistics in simple terms inferential statistics is nothing but a branch of Statistics that makes the US of various analytical tools to draw inferences about the population data from the sample data the two main areas of inferential statistics is estimating parameter and hypothesis testing well a set of method that is used to draw a conclusion about the characteristics of population based on the sample of the data as we discussed one of the examples what if I want you to get the average age of an entire country Let It Be Japan Korea or any of these countries let's say country X is it possible to ask each one of them about the age no it's not possible so what we usually do is we infer the same from a sample out of it for example we talk to 100 people from different locations of that country different age groups different types of people and then try to understand what each of their ages are and then we find out oh the average age is probably 40 so this average age is definitely not from the population but from the sample data so in short we are inferring that the population the average age of country X could be 40 that is what we are inferring and that is all about inferential statistics a set of method that is used to draw a conclusion about the characteristics of population based on the sample of the data used to find the population parameter when you have no initial number to start with we will be covering these topics in depth that's all about the basics of information itial statistics see you in the next video in this video we shall be talking about the different types of data one of the most basic concepts to understand before getting into the future topics we already know the importance of data we already know that data is vital for solving a data analytics or a data science problem but what exactly is data data is majorly of two types one is going to be categorical data and one is going to be numerical data categorical data some of the examples could be imagine you are analyzing some data it could be an Excel sheet or something like that multiple columns are there one of these columns are gender now gender has values like male female and so on now this is an example of categorical data because they are in categories another example will be location let's say multiple customers location are captured in the data somebody stays in Japan some stays in Korea some stays in Indonesia some in Malaysia so those kind of data are also called as categorical data let's say you start analyzing a column which is H which is having values like 25 29 it basically has your age of multiple customers or your C or multiple people now these kind of columns are your numerical columns now talking about the statistical terminology we usually term categorical features or categorical variables as qualitative data similarly numericals are termed as quantitative data let's try to get into qualitative data what exactly is qualitative data and what are what are the further subcategories under it so qualitative data or categorical data qualitative data also known as categorical data describes the data that fits into categories qualitative data are not numerical of course categorical information involves categorical variables that describe the features such as a person's gender Hometown Etc gender could be male female home town could be any any country Indonesia Malaysia Japan whatever it is these are all your categorical or qualitative data sometimes categorical data can hold numerical values as well but those values do not have a mathematical sense for example examples of the categorical data are birth date now what if somebody's birth date is something like this now these are combination of numerical values these are all your numbers but it is termed as a categorical data not as a numerical data we also have favorite sport School post code here the birth date and school post code hold the quantitative values like this they have numbers in it but they don't give any numerical meaning there is no numerical meaning to it so these kind of data are considered as quantitative data sorry qualitative data even though they have quantitative values right so inside qualitative data majorly there are two types of data so data as I told you data is of two types your qualitative and your quantitative inside qualitative comes your nominal and ordinal data nominal data is one of the types of qualitative information which helps to label the variables without providing the numerical value nominal data is also called the nominal scale it cannot be ordered and and measured this is the most important feature of nominal data that it cannot be ordered but sometimes the data can be qualitative and quantitative examples of nominal data are letters symbols words gender now for example we consider gender male and female we cannot order it we cannot say males are superior than females or vice versa so there is no ordering Okay so the nominal data are examined using the grouping method in this method the data are grouped into categories and then the frequency or the percentage of the data can be calculated these data are visually represented using pie charts for example you start analyzing gender a better chart will be a p p something like this let's say female and male let's say 45% and 65% sorry 45% and 55% my bad 55% they could also be represent in a bar chart male female and male so something like that right now of course in this example male are 55% so they should look like this but I'm just giving examples right that they can be represented visually using pie charts using bar charts and so on now moving towards the ordinal data of course as the name suggests ordinal has something to do with orders or ordering the significant feature of ordinal data is that they can be ordered the first and foremost feature of ordinal data ordinal data sometimes categorical data can hold numerical values as we already discussed but those values do not have a mathematical sense right we already discussed this ordinal data are some of these variables which are used mostly in surveys Finance economics questionnaires and so on what are some of the examples of ordinal data one example is economic status economic status could be high it could be m it could be low and we already know it's in an order right it starts from low to high right or else let's say you are analyzing students grading system some people have got a some people have got B some people have got C some have got D and we already know that it's in an order so these kind of data are called as ordinal data how do we represent it visually ordinal data is called commonly represented using a bar chart these data are investigated and interpreted through many visualization tools let's say you start plotting the graphs for economic status high low and medium right so these kind of graphs can give you insights that you know majority of the people are falling in this category low number of people are falling into the low category like less number of of people right and so on so bar chart is going to be a twoo visualization uh kind of technique to plot and visualize and analyze the ordinal data the information may be expressed using tables in which each row in the table shows the distinct category so that's all about your basics of types of data and the qualitative data we also discussed about nominal and ordinal data in this video we shall be talking about quantitative data in the last video we talked about the different types of data in which we discussed about qualitative and quantitative in this video we shall be focusing more on the quantitative type of data what do you mean by quantitative simply quantitative data is also known as numerical data which represents the numerical value that is how much how often how many numerical data gives information about the quantiles of a specific thing some examples of numerical data are height length size weight and so on the quantitative data can be classified into two different types of categories the two major classifications of numerical data are discrete data and con data what exactly is discrete data discrete data can take only discrete values if you have quantitative data like number of workers in a company let's say we are talking about a company a company X so for example you have quantitative data like a number of workers in a company could you divide every one of the workers into two parts it's not possible because the number of workers is discrete data so let's try to Define discrete data in a better way discrete data is a count that involves integers only a limited number of values is possible for example the number of students in a class the number of students in a class could be 40 could be 45 could be 50 it cannot be 40.5 it cannot be 39.5 it cannot have these numbers so discrete data can take only certain values the data variables cannot be divided into smaller parts how do we plot discrete data we can plot using any kind of graphs it could be a pie chart or it could be a stem and a leaf plot it could be anything going forward some other examples of discrete type of data could be apart from number of students in a class it could be number of workers in a company number of languages an individual speaks for example I speak six different languages one cannot simply speak 4.5 languages or 3.5 languages it's not possible so these type of data are called as discrete data now what is continuous data continuous data is data that can be calculated it has an infinite number of probable values that can be selected within a given range now for example you can measure your height at very precise scales it could be meters let's say my height in meter is 17 21 m in cm it is 171 cm in feet and Ines it is 5' 8 in so you can measure your height at very precise scales you can record continuous data at so many different measurements width temperature time and Etc this is where the key difference with discrete data lies so this is the example of a of a continuous data and I hope everybody is clear with the difference between discrete and continuous data that's all about this video on quantitative data hi welcome to this module on statistics in this video we shall be talking about sampling techniques this is how the agenda looks like for our next few videos it all starts with population and samples we'll talk about population and samples what exactly is population population simply is nothing but the entire group that you want to draw conclusions about and Sample is the specific group that you will collect data from the size of the sample is always less than the total size of a population for an example if somebody asks you to find the average age of the entire population of Ja Japan now it's not possible to talk to each and every each and every person from Japan about their age rather we'll start working on a smaller chunk of data start capturing data and then find oh the average age is probably around 40 so this is the difference between population and samples which we will be covering in depth next we shall be discussing about the types of sampling technique there are so many types of sampling Technique we will be discussing them random sampling non- probability sampling population sampling and so on we will be in short studying about various sampling techniques population samples and all these related topics in the next few videos see you in the next video video we shall be talking about population and Sample what exactly is population and what is sample the simplest definition of population is population is nothing but the entire group that you want to draw conclusions about the universe of objects that is required to be analyzed is known as population and Sample is nothing but a subset of population it's a part of population it is a specific group that you will collect data from the size of the sample is always less than the total size of population in research a population doesn't always refer to people it can mean a group containing elements of anything you want to study such as objects events organizations countries species organisms Etc now some of the examples of population what if as I was giving this example in the last video what if somebody asks you that what is the average age of a particular country it could be any country it could be Japan Korea India Philippines it could be any country imagine a country X if somebody asks you can you find the average age of a country no it is it is completely impossible to do it you cannot simply go and talk to each one of them ask their age the moment you probably go from one individual to another individual already be you know a lot of time will be invested in this process so it is it is logically not possible to interact with each one of them get a list of all their ages and then find their main value or the average value right instead of that what you should do is you should divide your country into specific areas let's say what if you divide your countries into four particular sections or else could be North could be West could be East could be South and could be Central or you can divide it as per your own wish again as per your divisions how about you go and small talk to a small chunk of data probably 10 15 people from each chunk and then you take around 100 or 200 samples and then ask what is their average age and then probably you find it out that the average age is around 40 now this is what the average age is from the sample data so that's how the difference between population and Sample looks like on paper in logical sense in real world scenarios we never work on population data whenever you start working on a data analysis problem we always starts with a sample data once our sample data is created once we start analyzing it then we move towards the population data like theoretically it's not possible to work on the entire population but we usually work on Sample data I hope you understand the concepts between population and Sample this video we shall be talking about why sampling is important we already know what exactly is a sample and what is a population data so population is nothing but your entire thing let's say you are analyzing the people of Japan or people of Korea which could be in millions I'm pretty much not sure about the population of Japan now what if you have to analyze them or you want to find something the average age of a person in Japan obviously you cannot go to each one of them and ask about it right so you always have to work on samples you always have to work on samples because of this reason you cannot simply go ahead and talk to the entire population but what you can do is you can probably got to go go and talk to different categories of people in Japan probably 100 100 people and then start collecting information and then come to a conclusion that oh the average age from this sample is 40 so we can also infer that the average age of the population entire population in Japan could be 40 that is the advantage of sampling right Gathering data from entire population is not possible sampling is applicable in such situation where Gathering data from entire population is not possible using sampling one can make information faster surveying and measuring everyone is not cost effective we can easily analyze the data when using the sample of the data imagine you have a data worth gigabytes of data it could be like 100 GB of data and somebody has asked you to start analyzing and creating some dashboard some visualization dashboard or some sort of graphs right now obviously if you get started with 100 GB of data if your computer is not fast enough probably you might take a lot of time in analyzing the data and then a lot of time in visualizing it because it could be slow right so a better approach here will be kind of create a sample data probably of some MBS maybe 500 MB or 200 MB and then get started with your analysis analysis and your visualization process once this is done and then you can go ahead and change it to the population data it's possible right so it is usually an easier step to analyze a sample data a smaller set of individuals often results in lesser data collection error again when you're working on small chunk of data you might have to deal with less data collection issues as compared to population data right so these are some of the advantages of why sampling is important hi in this video we shall be discussing about the different types of sampling techniques so far I hope everybody understood the concepts behind the population and Sample we also know why sampling is important now the next question is how do we sample how do we create samples what are the different sampling techniques so sampling techniques are majorly of two types one is probability and one is non-probability sampling probability sampling basically means every member of the population has an equal chance of being selected Ed imagine you have a entire population you want to select some samples out of it let's say these are my entire population and maybe this data this data and this data are part of my sample now here selection of a data point from population to the sample is a probabilistic approach there are further many types of probability sampling one of them is simple random sampling stratified sampling cluster sampling systematic sampling what are each one of them we will be covering in the next videos but the major concept behind this probability sampling is that they are based on probability every member of the population has an equal chance of being selected in the sample data while talking about non probability sampling here samples are selected on basis of judgment or the convenience of accessing data now for example you have a population you want to only access the data points or for example let's say this is your entire population let's say you stay in Japan now you only want to for example you stay in Tokyo what if you only want to access the Tokyo population and then from that you have to select a smaller subset and then you are selecting a sample so here you are selecting based on your convenience as you stay in Toyo it is better for you to talk to people from Tokyo and take views and that's how you are creating your sample you are not concerned about other cities in Japan so that is non probability sampling what is the probability of another city another City's data to be a part of sample it is completely zero because the user is giving priority to Tokyo I hope you're getting my point so samples are selected on basis of judgment or convenience of accessing data largely depends on a research sample selection skills some of the non- probability sampling are convenient sampling purpose if sampling voluntary response sampling and snowball sampling we'll be discussing about each one of them in our future videos as well we discussed about what exactly is a probability sampling and what is a non-probability sampling we also kind of talked about the various probability sampling techniques which is random systematic stratified and cluster similarly we also talked about the various non-probability sampling which is convenient sampling purposive volum response and snowball this video is dedicatedly going to be on probability sampling we will be talking about each one of them in this video so let's get started talking about random sampling random sampling is simply a type of sampling where you random ly choose a member from population imagine this particular example of population we have person 1 2 3 4 till 12 now randomly you select four records from this particular population it could be this this this this it it could be this this this this it could be any random samples it could be any random element from your population this is simply called as random sampling every member and set of member has an equal chance of being selected now the probability of getting selected for this person in the sample and the probability of this person to be selected in the sample is same and that is one of the reasons why it is considered as one of the probabilistic approach one of the probability sampling technique secondly we will discuss about systematic sampling now again you see this population 1 to 12 but here what if you select the every third guy imagine the first cursor is on the second person and then you select every third guy that means five8 and 11 what if you select all the persons who is at the odd location maybe 1 3 5 7 9 11 so this becomes a sample this type of systematic sampling is called as this type of systematic way of selecting samples is called as systematic sampling this is also called as systematic random sampling it is very similar to simple random sampling but in a systematic manner put a member of the population in some order and a starting point is chosen as random the every nth member is selected to be in the sample after systematic sampling comes stratified sampling now this is one of the most important sampling techniques now what it tells you is that first of all we have to divide the population into into groups imagine I'll give you an example I have a population of th000 Records what if I want to create a sample of 100 records if I talk about simple random that means randomly I take 100 records if I talk about systematic random that means systematically I pick every 10th element or else from one till th I pick every 10th element 1 then 11 then 21 then 31 and so on is this the right way of creating a sample probably not what if in my thousand records imagine I have th000 records and in that let's say I have a combination of male and a female let's say male are 600 and female are 400 if you randomly select a sample of 100 records I cannot guarantee that it will also have the male is to female in the same ratio of 600 is to 400 I cannot guarantee that it will follow the same ratio I cannot guarantee in simple random sampling I cannot guarantee in the systematic random sampling as well so what stratifi sampling tells you is that if you have different kinds of population like this different groups how about you choose 60 random data points from male and 60 random data points from female and then create your sample data now in your sample data you can see the male is to female ratio is same as the total population data that means your group distribution is maintained so the explanation goes like this first divide the population into groups then from each group we select members randomly by maintaining the ratio that is called as stratified sampling and it is one of the most important type of sampling right because you should always maintain the original distribution right after the stratified sampling comes cluster random sampling what do you mean by cluster random sampling as the name suggests we have to create some sort of clusters and then probably from these clusters we have to randomly pick some data points right you can see divide the population into groups then randomly select the group from all the groups imagine again I have th000 customers and let's say you are working on a segmentation problem you want to segment your customers into different groups how many groups you don't know so probably what you do is you use a clustering technique so clustering technique is one of the Advanced Techniques which is covered under machine learning a clustering technique simply means the way you create clusters for example if you divide thousand records into multiple clusters it could have three clusters it could have five clusters cluster basically means all the customers all the people inside this cluster are similar in nature all the people inside this cluster are similar in nature all the customers are this cluster are similar in nature similar in nature and similar in nature and then you randomly pick some records maybe 600 from here sorry not 600 60 from here 30 from here five from here three from here two from here and then becomes your 100 samples this type of sampling is called as cluster random sampling so you have to create clusters and then from the Clusters you have to pick samples that's all about this particular video on probability sampling techniques I hope you understood each one of them practically we'll try to go through each one of them probably in some of our future videos where we will show you how to create samples and all those things in this video that's it in the next video we shall be talking about the nonprobability sampling techniques see you this video we shall be talking about the non probability sampling well in the last video we discussed about the probability based sampling selection of a sample from a population is based on a probability value this data point being a part of sample or this data point being a part of sample the probability Remains the Same that kind of sampling is called as probability sampling in case you are not aware of this concept please go back to the previous video and get your doubts clarified in this video we shall be discussing about the non probability sampling techniques the first technique is convenience sampling what exactly is convenience sampling a convenience sampling simply includes the individuals who happens to be most accessible to the researcher imagine you have your data communities probably on WhatsApp groups or telegram groups and you quickly need to fill up a survey form what you do is you quickly send your survey links in various WhatsApp groups that means based on your convenience you want people to fill the records that type of sampling is called as convenience sampling for example you are researching opinions about student Support Services in your University so after each of your classes you ask your fellow students to complete a survey on the topic this is a convenient way to gather data but as you only serveed students taking the same classes as you are at the same level the sample is not representative of all the students in your University that type of sampling is convenient sampling next is purposive sampling purposive sampling means selecting a sample based on the purpose of the research researcher selects the sample by using their expertise and knowledge example you want to know more about the opinion and experiences of disabled students in your University you have your University and there is a small chunk of people who are having disabilities some disabled students so you purposefully select a number of students with different support needs in order to gather a varied range of data on their experiences with student services so purposely you are selecting this group of people to take some feedbacks this type of samp Ling is purposive sampling the next type is voluntary response sampling similar to convenience sampling a voluntary response sampling is mainly based on ease of access instead of the researcher choosing participants and directly contacting them people volunteer themselves example by responding to a public online survey let's say you have an online surve survey and people are responding to it one easiest example of voluntary response sampling is LinkedIn polls if you are creating a post on LinkedIn and you are having a poll whether data analytics is the best domain or data science and people start voting on them that's kind of a voluntary response sampling next comes snowball sampling if the population is hard to access snowball sampling can be used to recruit participants via other participants it is used where it is hard to find the potential per population for research for an example you are researching experience of experiences of homelessness in your city you are working in a city let's say Tokio or indon Jakarta or any kind of cities you are from and you are kind of researching experiences of homelessness in your city since there is no list of all homeless people in the city probability sampling isn't possible so you meet one person who agrees to participates in the research and she puts you in contact with other homeless people she knows in the area for example you meet Mr X and you talk to him or her that hey you know what I want to talk to more homeless people in order to get some some uh survey data now she gets in touch with you and she basically recommends you that I know some sort of people homeless people who can help you out so this kind of sampling technique is called as snowball sampling that's all about the various non-probability sampling with examples I'm pretty much sure that you now have a basing understanding on these topics that's all about this video see you in the next video in this video we shall be talking about the population sampling what do you mean by population sampling analyzing or testing entire population is impossible and also a cost and time-taking process to save our money and time we use the subset of the entire population called as sample population sampling is the process of selecting a subset of the objects that is representative of the entire population the sample must have sufficient size of objects to Warrant statistical analysis must be performed correctly since errors can lead to inaccurate and misleading result now this particular table will give you an initial idea about the mathematics behind it you can see the First Column which is population or sample the first category is population and the second category is the sample there are various terms inside population population size population mean and population variance similarly we have sample size sample mean and Sample variance now before talking about variance let's talk about standard deviation standard deviation well I will have a detailed class on standard deviation in one of our future modules in one of our future classes but standard deviation simply means it's a measure of how spread out numbers are it is usually symbolized with this which is called as Sigma the Greek letter Sigma the formula of standard deviation is very simple standard deviation is nothing but the square root of variance but what is variance variance is nothing but it is defined as the average of squared differences from the mean now if you talk about this formula variance is simply summation of x i - mu s ided n where summation simply means the combination the addition of all the values XI means the original data point and this symbol which is called as Mu is nothing but your mean value or in simple terms is called as average so variance formula is this it is the summation of x i - mu s divided by n that's all about variance size basically means the number of items or elements in the population and mean is the average value if you closely look at the formula of population variance and Sample variance there is a minute difference in the denominator in population in the denominator we have n but in Sample we have n - 1 the reason behind this will be explained in one of our future videos but as of now just remember this thing population variance is something like summation of x i - mu s by n if you have 100 records in your population this n becomes 100 if there are 10 records in your sample out of population then the variance becomes x i - mu s by 10 - 1 which will be 9 okay so that's all about this particular slide on population sampling well I'll try to quickly explain a small example using the mathematical formulas let's say I have few records 5 10 15 and 20 what is the addition of these numbers the addition of these numbers are let's say one more 30 5 10 15 20 30 if I sum it up it becomes 50 55 65 it becomes 5 + 10 15 + 15 30 80 so it becomes 80 now 80 divided by 5 what will be the average value it will be 16 now what will be XI minus mean for the first record it will be -1 for the second record - 6 third record Min -1 fourth record 4 fifth record 14 what will be x i - mu squ so this becomes 121 this becomes 36 this becomes 1 this becomes 16 and 14 Square becomes I think 196 I hope everybody knows that the square of a negative number is always a positive number right so this becomes my X IUS mu s for each of these entities what will be the sum of these values so the sum of x i - mu s will be 121 + 36 + 1 + 16 + 196 if you add it up 196 + 16 + 1 + 36 + 121 becomes 370 becomes 370 so if these five records are part of my population then the variance becomes summation of x i - mu s divided by 5 root over this becomes sorry not root over if it is root over this becomes a standard deviation let me write it down standard standard deviation okay which becomes 370 by 5 root over which becomes 370 by 5 is simply 74 so 74 is my variance and my standard deviation is the root over of 74 which is somewhere around 8 something 8.6 this becomes your stand standard deviation now if these five records are a part of your sample data not a part of your population data in that case the standard deviation formula becomes 370 divided by 4 root over so 370 by 4 is around 92.5 root over will be sorry 92.5 root over is 9.6 92.5 root over is 9.6 so here the variance becomes 92.5 the standard deviation becomes 9.6 so that's the mathematical way of understanding different concepts well the Y part is still not clear but hold on we have a separate video for that that's all about this particular video on population sampling and trust me the mathematics and the all these kind of calculations part is going to be even simpler and simpler in our future videos as well I'll try to simplify as much as possible so that's it about this particular video see you in the next video in this video we shall be talking about why do we have n minus one for the sample data and N for the population data so far we already understood the formula for of variance and standard deviation variance which is nothing but standard deviation square is nothing but summation of x - mu s / n now this exact formula changes for the sample data in case of sample data the variance is nothing but summation of x- mu^ s / nus1 now why do we have this n minus1 for sample data and N for this population data well there is a theory behind it we have to take a small example and try to understand why do we have this this is basically called as bessels correction this method corrects the bias in the estimation of the population variance it also partially correct the bias in the estimation of the population standard deviation the idea behind this is that this is a more unbiased measure of variance than the usual definition let's try to take a small example and try to understand y n minus one for sample and Y N for population imagine you have a book self imagine you have a book self let's say the total thick NE thickness of the first six books imagine you have six books turns out to be 158 mm so let's say here we have 158 mm as the total thickness which definitely means the mean will be 158 divided 6 which will be 26.3 that's my mean thickness now you take out and measure the first book's thickness which is called as 1 degree of freedom and that is 22 mm let's say the first book the first book thickness was let's say 22 mm now you have five books what is the total thickness of Five Books it is 158 - 22 which is 136 mm now we have five books now you measure the second book which is second degree of freedom and find it to be 28 mm let's say the second book is 28 mm now here what will be the total thickness left 136 - 28 which will be 108 mm and now you have four books and the same process goes on in this way by the time you measure the thickness of the fifth book individually which is here here you are selecting the fifth book when you are doing that which is the fifth degree of Freedom you automatically know the thickness of the remaining one book let's try to take examples now here we have four books and 108 mm let's say the third book is like 30 mm now here the total width will be 78 mm and we have three books then the fourth book let's say is 25 mm so here the left out will be 53 mm and we have two books imagine at this point of time where you have two books the total thickness is 53 mm here you are picking the fifth book and let's say the fifth book which is called as the fifth degree of freedom is found to be let's say 20 mm when you find the thickness of the fifth book it is quite understood that the sixth book the final book's thickness is going to be this minus this which will be 33 mm so you don't have to know about this final book you are easily getting the final book's thickness when you are picking the fifth book that means you automatically know the thickness of the sixth book even though you have measured only five extrapolating this concept in a sample of size n you know the value of the nth observation even though you have only taken n minus one measurements that is the opportunity to vary has been taken away for the nend observation and exactly this is one of the reasons why we do not consider n in terms of sample data so in Sample data we consider n minus one that's all about this particular video about y n minus one for sample data and Y not n and see you in the next video byebye hi welcome to this module on statistics and this chapter on disc descriptive analytics and statistics in this particular chapter we are going to cover everything related to the descriptive part there are majorly two concepts which is studied inside the descriptive statistics one of them being measures of central tendency and one of them being measures of dispersion now these two topics are going to be the most important topics under descriptive and we will be doing a deep dive analysis on these kind of topics under central tendency we'll try to understand the concepts like mean median and mode and under dispersion there will be Concepts like range interquartile range variance standard deviation mean deviation and so on each and every topic is going to be covered in details and in this video we shall be focusing on measures of central tendency so far we already know that statistics is of two types one is your descriptive and another is your inferential and inside descriptive comes this topic called as measures of central tendency what exactly is measures of central tendency as you can see it measures the center value of the data set it give us the idea about the concentration of the value in the central part of the distribution in simple terms a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within the set of data as such the measure of central tendency is also called as measure of central location they are also classed as summary statistics the mean often called as average is the most likely measure of central tendency that we are most familiar with we all know what exactly is mean what is an average right apart from mean there are others as well which includes median and mode the mean median and mode are all valid measures of central tendency but under different conditions some measures of central tendency become more appropriate to use than others now what exactly is the usage of these topics why are we discussing these things first of all let's try to understand that each one of them is a central tendency is a measure of central tendency which gives us an idea about the central part of the distribution going forward we have separate slides on mean median and mode where you will understand that using examples we'll try to understand which type of concept is going to be used in which type of scenarios so that's all about this particular video on the basics of measures of central tendency going forward we have dedicated videos on mean median and mode that's about it see you in the next video mean is nothing but the statistical definition or statistical term that we use instead of average what exactly is average we all know what is average an average set of observation of the data it computes the sum of all observation present in the data sets divided by the total number of observation for an example you have three numbers 1 2 and three what will be the sum sum of these numbers the sum will be six what will be the average of these numbers the average will be six divided by 3 which becomes two now this is nothing but your mean so mean or average is nothing but add or sum each number or observation present in the data set calculate the total number present in the data set divide sum of observation to the total number of observation using mathematical formulas which looks really complicated this is nothing but the this xar is nothing but your mean value mean value is nothing but sum of all X ided by n this symbol is called as summation which is nothing but sum of all numbers so if you are passing X here that means 1 + 2 + 3 which is six so in the numerator becomes six divided by how many entities we have 1 2 and three three numbers so here the average becomes two so mean or average is the most popular and well-known measure of central tendency it can be used with both discrete and continuous data although its use is most often with continuous data the mean is equal to the sum of all the values in the data set divided by the number of values in the data set so if we have n values in that case and the values are X1 X2 X3 dot dot dot xn so here the mean value will nothing but X1 + X2 plus dot dot dot xn ided by n and this is nothing but this particular formula okay so summation of X ided by n now when do we use mean and why do we why do we use mean where is the where is the scenario where we start using this mean imagine a data set let's say you have a column called as age you have a data let me just rub this off you have you have a data set and imagine one of the columns is age and as you have captured the data some of the records are null let's say one of these ages is 25 29 null 31 25 so out of five members one of them is a null record now in your data analysis while you prepare your data handling missing values is one of the critical task that we handle because if you don't handle the missing values your analysis could be you know you could go in a wrong direction of your analysis so it is usually recommended to deal with your missing values which we will be covering in depth in the future classes while we talk about the exploratory data analysis but what is the right way of imputing this value one of the most idiotic approach is randomly impute it with some random numbers let's say impute it with 30 is this a right approach no what is a better approach a better approach is imputing using mean or median so if you imput using mean here what will be the mean here 25 + 29 + 31 + 25 / 4 which will be 25 + 9 which is 54 + 31 + 25 should be 110 divided by 4 which is 27.5 now as this column is age can somebody's age be 27.5 well logically it could be 27 years and 6 months but as everything is a whole number without any decimals probably we will use a rounded number here here which is 28 so 28 becomes my imputed values that we are imputing in the missing values this is one of the applications of mean that's all about this particular video on mean I hope you understood what is mean what is average in the next video we shall be talking about median mode and various other topics see you in the next video in this video we shall be talking about the second type of measures of central tendency which is median in simple terms median is nothing but your middle number that means the center number found by ordering all data points and picking out the one in the middle if there are two middle numbers taking the mean of those two numbers will give you the median value mathematically speaking there are two possibilities if you have a set of numbers you could have odd numbers or you could have even numbers what do you mean by odd numbers let's say you have five numbers in this list that means you have odd numbers now if you have odd numbers it is easily understood that the middle value is going to be the third number but what if you have six numbers because in six numbers you have 1 2 3 4 5 6 here there are two middle numbers now in this case what happens is you have to take the average of these two so mathematically speaking the number of observations is odd in that case the median is the middle value which is at position n + 1 by 2 in case of even numbers the median is the average of two middle values find the value at position n by2 and then find the position at n by2 + 1 and then try to find the average well sounds little bit tricky from the theoretical Concepts point of view let's try to understand using an example let's say I have an example again I have age I have 25 31 27 23 33 32 now now how to find the median of these six numbers how to find the mean of these six numbers it is simple summation of all the numbers divided by n right which means it will be 25 + 31 + 27 + 23 + 33 + 32 divided by 6 which is 28.5 this is your mean value how to find the median first of all to find median you have to order them that means you have to sort them let's try to sort them 23 comes first then comes 25 then comes 27 then comes 31 then comes 32 then comes 33 as there are six numbers and it falls into this category first we have to find n minus n by2 which is the third value 6X 2 is third this value what will be n by2 + 1 it will be this value now we have to take an average of these two so the average of these two will be 58 by 2 which will be 29 so 29 becomes your median imagine you have another number here let's say 19 now if you sort it 19 comes here 1 2 3 4 5 6 7 out of seven we have to find n + 1 by 2 which is 7 + 1 by 2 which is the fourth location which is 1 2 3 and 4 so this becomes my median when you have odd numbers so this is the concept behind median now you must be thinking one question in your mind that somehow median and mode uh sorry median and mean are kind of similar Concepts but in which scenario do we use what I will give you an example of what scenarios we do not use mean the mean has one main disadvantage it is a particularly susceptible to the influence of out layers what if you have multiple staffs let's say staff number 1 2 3 4 5 6 and so on now you have salary so 15K let's say 18K let's say 16k let's say 14k let's say 90k and let's say 95k now the mean salary of these six people will be if you try to calculate the mean salary will be 15 + 18 + 16 + 14 + 18 185 is 248 divided by 6 which is 41k right the mean is 41k however inspecting the raw data suggest that this mean value might not be the best way to accurately reflect the typical salary of a worker right because if you anal analyze these four people they are already in the range of 14 to 18 so this is not depicting the right numbers because most of the people are within the range of 14 to 18K right the mean is being skewed by the to large salary now I used a term called as sked well we will be covering this topic in one of our future videos okay therefore in this situation we would like to have have a better measure of central tendency as we will find out later taking the median is a better option now what will be the median here in this case if I sort the numbers 14 15 16 18 90 95 so the middle value is these two if I take a median it will be 17 because the average is 17 right now this makes sense if somebody asks you what is the median value if you say 17K it sounds okay because most of your employees are falling in the range of 14 to 18 right so in these kind of scenarios if you have outliers in those cases median is a better option to use that's all about this particular session on median in this video we shall be talking about the Third type of measure of central tendency which is mode what exactly is mode the value which occur most frequently in the set of observation can have more than one mode which is called as unimodal bodal or multimodel what are the steps to find the mode it's very easy to find mode of any observation take the most frequent value present in the data set special cases if the maximum number of frequency repeated if the maximum frequency is occured at beginning and end of the observation if there is irregularities in the distribution in all the above cases we find the mode of the observation is using method of grouping let's try to understand using an example of where do we use mode mostly mode is useful for categorical data imagine you have again I'll take an example you are analyzing some customers data you have multiple columns and one of these categorical columns is gender out of the 10,000 records or thousand records that you have in this particular example we'll take five records let's say one is male null female female female so you have five records imagine you have five records I cannot simply write down th000 records here it is not possible so example we are taking five records let's say this is how the records looks like and obviously you have one missing value so what do you impute here idiotically you cannot blindly impute male or female makes no sense right in this case what we usually do is we find the mode here as female occurrences are three and male occurrences is just one female becomes your mode so a better value to inut here will be female so this is one application of mode this is where we use mode but again there are some drawbacks in the mode technique as well what are the drawbacks simply try to understand this question understand this scenario let's say you have thousand records and out of that let's say you are analyzing gender out of these thousand records imagine 200 records are null values or 200 gender records are null other than that the 800 records that you have let's say 640 are males and 160 are females so what is the the ratio here the ratio here is almost 4 is to1 Right 4 is to1 ratio now how do you impute these 200 records the mode is very clear the mode is male but if you impute male in all the missing values your ultimate records will be 840 male and 160 female now the ratio is changing from 4 is to 1 to 84 is to 16 almost almost more than 5 is to1 ratio which is not in sync with the original distribution so in these kind of scenarios what we usually do is we try to maintain the ratio the 200 records that we have so what we will do is almost 160 records we will impute as male and 40 records we impute as female but in certain scenarios where you know you have very less number of missing values in that case you can also use mode I hope everything is clear these are all theoretical Concepts once we jump into the Practical sessions once we jump into the Eda sessions you will have better understand understanding because we will be using some live data and we will be showing you different strategies of how do we use these Concepts in real life that's all about this topic on mode see you in the next video in the last video we talked about measures of central tendency where we understood the different concepts like mean median mode and we also talked about various examples on where these concepts are going to be used well in this video we'll stick to the disperson part and statistics what exactly is dispersion dispersion is the state of getting dispersed or spread statistical dispersion means the extent to which numerical data is likely to vary about an average value in other words dispersion helps to understand the distribution of data now measure of dispersion indicates that how the data is dispersed from the measure of central tendency in statistics the measures of dispersion helps to interpret the variability of data that is to know how much homogeneous or heterogenous the data is in simple terms it shows how squeezed or scattered the variable is now let's try to understand the different types of measures of dispersion the first one is range it is simply the difference between the maximum value and minimum value for an example 1 3 5 6 and 7 here the range is 7 - 1 which is 6 it is simply the difference between the maximum value and the minimum value second comes variance variance deduct the mean from each data in the data set Square each of them and add each square and finally divide them by the total number of values variance is basically termed as V or else we can also term it as Sigma Square what is Sigma we will try to understand it Sigma square is nothing but summation of xus mu squ / n mathematically we'll try to understand what exactly is this but this is how the formula of Varian is where this particular symbol is nothing but which is summation here mu is nothing but the mean value I hope everybody knows about mean and X is your original attribute and N is the total number of entries we also have inter quartile range which is called as IQR now this is nothing but Q3 minus q1 what is Q3 Q3 is usually ter termed as the 75th percentile well with examples we have this particular topic covered in the future videos and q1 is nothing but your 25th percentile we'll mathematically understand each one of them and then comes standard deviation standard deviation is nothing but the square root of variance as I have already mentioned here variance is nothing but Sigma Square now this Sigma is nothing but your standard deviation the square root of variance is known as standard deviation so if I talk about the mathematical formula of Sigma this becomes sum of x - mu s / n root over and then comes mean deviation the average of numbers is known as the mean right now how much deviation is having from a particular mean that's basically your mean deviation again we have it covered in our future slides nothing to worry about that's all about the basics of measures of dispersion we'll have our Concepts more clarified in the future videos see you in the next video and in this video we shall be discussing the first measure of dispersion which is range somehow we already understood what exactly is the meaning of range which is the difference between your maximum value and the minimum value this is one of the simplest measure of dispersion it measures the difference between the highest value and the lowest value present in the data set it is used to construct control chart in quality assurance it is also useful when you want to focus on Extreme values of the data set now what if you have a data obviously you will have multiple columns in it let's say one of these columns is age what if somebody asks you can you tell me what is the range of age which basically means what are the lowest values and what are the highest values what is the difference between them if the age starts from 18 and it goes till 90 then the range here is going to be 72 which is highest minus lowest so the formula of range is very clear range is highest value minus lowest value talking about an example let's say I have values like this what will be the range here the minimum value here is one the maximum value here is seven so the range here is going to be 7 - 1 which is 6 that's all about this video on Range in the next videos we shall be talking about the other dispersion techniques in this video we should be talking about the inter quartile range IQR this is another measures of dispersion technique but what exactly is this this basically measures the middle 50% of the data so interquartile range is a measure of where the middle 50% of the data is where a range is a measure of where the beginning and the end are in a set we already covered range in the previous video what is the minimum value and the maximum value that gives you a range of the data but IQR gives you the range of your middle 50% so IQR formula is simply Q3 minus q1 we'll try to understand what this Q3 and what this q1 means using an example this indicates how the data is dispersed around the mean it is usually the difference between the third quartile and first quartile value of the data set it is helpful to detect the outlier present in the data set formula is clear Q3 minus q1 but let's try to understand how do we calculate Q3 and q1 let me start with an example let's let's say I have an example and I have numbers like 1 2 5 6 7 12 9 15 18 19 and 27 now simply if you want to find the median let's say we want to find the median the first thing that we need to do is going to be calculating the median right so we have to solve these values first how do we sort these values 1 2 5 6 7 9 12 15 18 19 and 27 so what will be the median here the median is here going to be 1 2 3 4 5 6 7 8 9 10 11 11 means 11 + 2 by 11 + 1 by 2 which is the sixth position so 1 2 3 4 5 6 so 9 becomes your median now let's try to place some parenthesis here so 9 is your medium value right the median value so median helps you to spot your q1 and Q3 easily right now how do we find q1 and Q3 think of q1 as a median in the lower half and think of Q3 as a median in your upper half so what will be the median here the median here is going to be 5 and the median here is going to be 18 now if you substract 18 - 5 which makes it 13 that is basically your IQR now what if you have an even set of numbers we'll try to take another example here let's say we have even set of numbers let's say 3 5 7 8 9 11 15 16 how much 1 2 3 4 5 6 7 8 let's say we'll take two more numbers 20 and 21 they're already sorted now make a mark in the center of the data one 2 3 4 5 6 7 8 9 10 so somewhere here right place the p parenthesis around the numbers above so the parenthesis comes here now what will be my q1 value here the q1 value will be 7 the Q2 Q3 value will be 16 now what will be Q3 - q1 which becomes 16 - 7 which becomes 9 so 9 is basically your IQR now I hope that you got a fair fair amount of understanding about IQR which is Q3 minus q1 so be it odd number of values or even number of values identifying Q3 and q1 is very simple and just substract Q3 minus q1 which will give you IQR that's all about this particular video in the next video we shall be talking about some other measures of dispersion see you in the next video and in this video we should be talking about our next measure of dispersions which is variance and standard deviation we are going to cover both these topics in this video because they are related to each other mathematically and conceptually now what exactly is variance simply variance is a type of dispersion technique measures the dispersion of the data around the mean of the data it indicates how the data is dispersed from its mean value if the value of variance is low that means it is closer to the mean value so it's called as low variance that means if the variance is closer to the main value which is called as a low variance scenario if there is significant difference in the value from the mean then it is called as high variance variance is often denoted by V or we also denote it using Sigma Square this is basically Sigma which is otherwise called as standard deviation what exactly is standard deviation standard deviation is one of the most important and frequently used method of dispersion it is simply the square root of variance it indicates how far away the data points are dispersed from the mean and it is denoted by Sigma now talking about the mathematical formula this is how variance is denoted and the formula is like this if I simplify it it is nothing but Sigma square is the summation of x i - mu s / n where mu is nothing but your mean value n is nothing but your number of parameters now this particular formula changes from n to n minus1 if you are dealing with a sample so in short when you are dealing with population the value is n when you are dealing with sample the the the formula is n minus one now let's try to take some examples and try to solve the variance and same goes with standard deviation as well standard deviation is nothing but summation of X IUS mu s by an root over so let's try to take some examples and try to understand what will be the formula of each one of them as in using formulas what will be the values let's say I have some numbers one 3 5 5 6 7 9 and 10 what will be the mean value here that means mu the mean will be one + 3 + 5 + 5 + 6 + 7 + 9 + 10 divided by 1 2 3 4 5 6 7 8 Simple which will be um 4 5 9 14 20 27 36 46 by 8 46 by 8 will be around 5475 40 607 48 20 yeah around 5.75 let's try to validate it 46 divided by 8 yes it is 5.75 now this becomes my mean now as we already know the formula so this is my X let's say 1 3 5 5 6 7 9 and 10 my mean is 5.75 so I'll calculate the numerator first so XI minus mean that means 1 minus mean which is -4.75 - 2.75 - 0.75 - 0.75 [Music] 0.25 1.25 3.25 4.25 now I will calculate XI minus mean Square so these individual Square values what will be the square values you can simply take an calculator and try to calculate it 4.75 multiplied with 4.75 it is 2256 22.56% it is 7.56 see irrespective of the number being negative or positive the square of that number is always positive right similarly we'll have 0.56 we'll also have 0.56 this will be 0 uh 06 0.0625 in short this will again be around 1.25 uh 1.56 this will be 10.56 and this will be 18.06 now what I will do is summation of x i - mu s by n we are calculating variance first so my numerator will be sum of all these values so we'll try to sum all these values if you sum it up the values will be let's try to use a calculator 22.56% 62. oh 62.4 1 second let me just do it again 22.56% 61.4 I have just rounded it up to 61.5 now how many attributes do we have 1 2 3 4 5 6 7 8 divided by 8 this becomes 61.5 divided by 8 which becomes 7.68 if you round it off it will be 7.69 around 7.69 now this is your variance right right now what is the mean you can see the mean is 5.75 and the variance is 7.69 that means the variance is almost very close to the mean like the difference is less than two right so it could be treated as low variance but again these are very few records we have in real scenarios when we analyze with thousands and thousands of records that give us more idea right so so here my variance is 7.69 so what will be my standard deviation this will be the square root of this 7.69 which will be let's take an example let's take help from the calculator it is 2.77 now this becomes my standard deviation so that is all about variance and standard deviation very similar to each other I hope everybody is clear about the definition point of view View and from the mathematical point of view right now if these 10 records are not a part of your population and if they're part of your sample data in that case your variance which is Sigma Square becomes basically we do not consider it as Sigma Square in case of sample we consider it as s which is this is just a change in symbol that's it it is still the variance this particular s is still the standard deviation but it is the standard deviation of sample data right so this is nothing but 61.5 divided by 7 whatever the values are 61.5 divided by 7 which is 8.78 now what will be the standard deviation here root of 8.78 which is going to be 2. 96 okay so try to memorize this thing in case of population it is always divided by n and in case of sample it is always divided by n minus one that's all about this particular video on standard deviation and variance in the next video we shall be talking about some other topics see you in the next video this video we shall be talking about mean deviation what exactly is mean deviation simply we we could already guess what could be the meaning of this mean is definitely understood mean is the average of a set of numbers and deviation somehow gives me an intuition that how deviated these values are from the main value right so the definition of main deviation is very clear it is the average sum of the absolute values of the deviation from any arbit values example mean median or mode Etc it is suggested to calculate from median because it gives us the least value when measured from the median talking about the formula this is how the formula looks like it is summation of X IUS mu / n and the numerator has an absolute value what do you mean by this absolute value absolute value is simply nothing but let's say we talk about a number minus 5 the absolute value is five we do not worry about the minus or plus attribute of it be it five or minus 5 the absolute value is always going to be five it is always positive number now let's try to take a small example let's say we have four people let's say Jennifer let's say my name sat Amelia Oscar these are the four people now and one more person maybe Mr X or could be James right now Jennifer has four apples I have three apples Amelia has three apples Oscar has two apples and James has five apples the question here is find the mean deviation of the data now if you start adding the mean value let's try to calculate the mean value the mean value will be 4 + 3 + 3 + 2 + 5 divided by 5 it is 4 + 3 7 + 3 10 + 2 12 + 5 17 by 5 which is 3.4 so the mean value is 3.4 now what will be the x minus mean value let me just draw it again 4 3 3 2 and 5 so what will be xus mu value imagine mu value is 3.4 we don't have to imagine we already know so the first value is going to be 6 -.4 -.4 - 1.4 1.6 so 6 -.4 -.4 - 1.4 1.6 what will be the absolute value all the positive numbers 0.4 0.4 1.4 and 1.6 now if I have to add this part which is your this numerical part sum of all the absolute values so this is going to be 6 +4 is 1 1.4 1.4 + 3 which is 4.4 4.4 is going to be your numerator value so what is my mean deviation it's going to be 4.4 ided 5 which will be 0.84 0.8 so it's going to be 0.88 so the mean deviation is going to be 0.88 so that's how mathematically we understand the mean deviation part that's all about this particular video in the next video we shall be talking about the next topics see you in the next video [Music]
Info
Channel: Satyajit Pattnaik
Views: 14,154
Rating: undefined out of 5
Keywords: satyajit pattnaik, data science, data analytics, machine learning, data analyst, artificial intelligence, satyajit pattnaik data, learn statistics from scratch, business statistics, statistics for data analytics end to end, end to end statistics, statistics and probability, statistics for data science, statistics satyajit, learn statistics, statistics end to end, statistics for data analytics, stats for data science, satyajit pattnaik statistics, descriptive statistics
Id: IYVEI1EYfPg
Channel Id: undefined
Length: 116min 38sec (6998 seconds)
Published: Sat Jun 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.