Statistics Interview Questions | Statistics Interview Questions and Answers | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey guys welcome to this session by intellipaat so in this session we're going to look into statistics interview questions statistics is really important and one of the foundation of data science so in the session we're going to look into interview questions of statistics from the very basic to the most advanced questions so that it will help you clear an interview and before moving on with this session please subscribe to our channel so that you don't miss our upcoming videos and also leave a like if you enjoy our content now without any further ado let's begin with this session hey guys welcome to this session on statistics interview questions by intellipaat statistics is by far one of the most important concepts in data science machine learning in general so it's an important concept and when you're going for an interview as a data scientist or machine learning engineer you are expected to know a lot about statistics so in this video we'll take a look at some of the interview questions and their answers and try and understand what kind of questions come most predominantly in a data science or machine learning engineering interview and these are the kinds of questions that you get related to statistics so let's take a look at these questions one by one before we begin we'd like to ask you to subscribe to our channel and click on the notification bell icon so that you never miss another update if you like this video click on the like button and in the comment section down below let us know what else you would like for us to cover next let's take a look at these the first question is what is the difference between inferential statistics and descriptive statistics but there are different kinds of statistical measures the different kinds of ways through which you can perform statistics these are two of the most popular ones descriptor statistics and inferential statistics so to give you an exact and precise definition descriptive statistics as the name suggests provides exact and accurate information describes in the data as a whole inferential statistics which is derived from the word inference is the kind of statistical measures in which we provide statistical information from sample of data to reach a conclusion about the population so to give you an example suppose you have you want to make certain kinds of assumptions about a group of people let's say the group of people are 100 million people and you want to make sure that all of them are subscribed to a particular ideology maybe they have certain agendas many kinds of things so in case you wish to do that instead of asking all the 1 million people to submit answers to a questionnaire the better way of doing it would be to get a small subset of these people that's called the sample out of all the 100 million people which is called the population and from there on you perform statistical measures in order to makes draw certain conclusions once those conclusions are drawn for that particular group of people the sample after gaining that kind of understanding then what you can do is you can sort of think of it as being applied to the entire population so that's the way inferential and descriptive statistics work so what is the difference between population and sample and parental statistics so as we have already seen population is the larger context and sample is the smaller context but to broaden it out a little bit from the population we take the sample now uh think of it as an entire body of work and then some piece of work from there so we cannot work on the population either due to computational cost or due to the availability of data points for the population another way of thinking about it would be to think about it in the way that you have some data you have a large number of data about a lot of people but you can't use all of it because it's a huge number of data that's not really possible for you to compute the entire piece of data but what you can do is you can randomly select few sub a little subset of that data and draw conclusion from that so this is what's called statistical measuring in inferential statistics and from here we take the sample sample is the small subset that we chose and population is the largest subset from which we choose the sample the different ways of choosing the sample we have to be very careful when we're choosing a sample data from a population because we don't want to take wrong samples or take samples in a way that causes certain kind of biases we'll cover those in future questions but we don't want to make mistakes in this step because this is the step that forms the foundation of everything that we're going to be doing as we move forward so from the sample statistic we calculate the statistic now after that getting the sample data from the population data we perform the inferential statistical methods and with that from the sample statistics we conclude about the population so this small group of data set that we use in order to draw conclusion about a larger data set all right let's take a look at what is quantitative and qualitative data so these are two kinds of data sets that you might have to deal with quantitative and qualitative it's quite easy to understand what these mean if you take a look at the etymology of the words in quantitative data deals with quantity so it's known as numeric data if you ever dealing with numbers age number of people that subscribe to a particular channel number of people around a particular age uh any kind of quantity the uh number of people who bought a particular phone uh the amount of phone that was sold over a period of time amount of product that was sold over a year these kinds of things they are called quantitative data they tell you about the quantity of something that you wish to know qualitative data on the other hand is also called categorical data this is data that is not numeric in nature so this is someone's name uh identity gender a uh you can also think of it as the age group like a toddler a teenager these are not numeric values because the teenager could be anyone either 12 years old 13 years old 14 years old they are all teenagers whether they are old or whether they are young these are the kinds of things that are known as qualitative they tell you about quality of a person or a quality of a data point so this is what the difference between quantitative and qualitative data is then comes the five number summary this is by far one of the most important concepts and one of the most asked questions in statistics to sort of gauge your understanding of how much you know about statistics this is also one of the most basic things that you should absolutely know so the five number summary is a summary of data extracted usually in descriptive statistics so when you're trying to describe the entire data set you get some data and you create a five number summary of that data this summary usually contains information such as low extreme which is the minimum so if you have if i give you a bunch of numbers and i ask you okay tell me the number that is of the minimum value in this collection of numbers that's the first thing if you are trying to create a five number summary then there's the lower quartile which is q1 which is the 25th percentile if you would like to call that then there's median which is the 50th percentile then there's upper quartile which is q3 which is 75th percentile and then there is upper extreme which is the maximum value now of course five number summary is not the only summary that you can draw you can add more things into it such as the mean the standard deviation and all other kinds of things but uh to get started this is the best way to sort of give you a summary of the data that you have but what is the range of the data that is from minimum to maximum uh how is the data spread out what is the data point right at the middle which is median what is the data point at the 25th percentile which is the data point that is between the minimum value and the median value similarly q3 is the value that is right at the middle of median and the upper extreme or the maximum value so that's how you can sort of show the distribution of the data and how the values interchange and how the values are spread over the entire number line so what is the benefit of a box plot so box plot is by far one of the most uh and misunderstood concepts in statistics box plots are used predominantly when we are trying to give you a visual representation or graphical representation of the five number summary and it also can be used to compare groups of histograms since histograms are just quantitative data we can generate a five number summary from that histogram and then we can create a plot box plot out of it other plots such as pie charts and scatter plots and all of that they are quite easily understood because they're very intuitive you take a look at that you take a look at pie chart you figure out okay this is the percentage of a share from a hundred percent of a value so that's how you can understand that scatter plot tells you the relationship between two values and many other kinds of things do the value cluster up against each other do the values spread out over the entire sheet how does that work so that's the way these things work but box plot is a little misunderstood because it's not very intuitive when you take a look at a box plot it doesn't make much sense so if someone asks you and this is why these statistical interview questions ask about block box plot because as five number summaries are really important and it's important for you to be able to uh visually understand what a five number summary represents you draw a box plot so that it can be shared with different people and you can compare and contrast different box plots so a box plot is basically a graphical representation of the five number summary you get all the data that you have gotten previously in the five number summary but you represent it in a way that's easier for you to comprehend and easier for you to explain to other people so that's what the his five box plots are there for you to do the next question is what is mean now again this might seem like a really simple question but main is actually one of the things that people find it really difficult to define many people define it as simply being a summary of the entire data point which is very vague and not very useful if you're trying to learn something you should make sure you understand what it means since mean is one of the most basic and one of the most important concepts this is a question that is asked most frequently now the answer to this is quite simple mean is the average of a collection of values calculation of it is very simple you just calculate it by dividing the sum of all observations by the number of observations okay guys if you're looking for an end-to-end training in statistics intellipart provides a complete certification training on statistics for data science and those details are available in the description box now let's continue this session so if you have five numbers going from one two three four five you calculate the sum of those numbers which is 15 and then you divide it by the number of observations which is 5 5 divided 15 you get 3 3 is the mean of the values so that's how you create the mean you understand what the mean is and how you calculate it now comes standard deviation even though mean is quite misunderstood standard deviation is a little difficult for people to explain in succinct terms so if it gives off the representation that you don't know what you're talking about when you're talking about standard deviation uh standard deviation is not that difficult to understand as the name suggests it is a magnitude of how far are the data points from the mean or how much do they deviate from the mean so if i have a if i calculate the mean so how many value what is the magnitude by which these values deviate from the mean or spread out around the b it gives you an idea of how the spread of the data is going to be if you draw a curve of data if you want to plot it how far away does the spread go that's what standard deviation is so let's suppose that you have a value set and you get a mean of 5 and the value of minimum and maximum is 0 and 10 so the mean is quite close to the extreme ends of the data and even if you were to plot it you want to you would get simple plot and it won't be very spread out on the other hand if it was like you get a mean of 15 and you have values of negative 35 and positive 157 then the mean is somewhere in the somewhere around the middle but the problem is that the data is quite spread out so it's it's that's how the data the spread of the data is defined so that's one way to think about it as well that's what standard deviation is all about it's a magnitude of how far is that the are the data points from the mean how far is the data spread out from the mean to be exact uh then comes symmetric distribution this is another important concept because it's a little difficult to explain in in a simple succinct manner when you are in an interview people are looking for you to describe these basic topics in the best possible way which is why these questions are asked so often if you fumble in these topics then they know that either you have no confidence in what you're talking about or you don't know what you're trying to tell them so what does symmetric distribution mean this is an important question because symmetric distribution is at the heart of many of the statistical techniques that we take a look at so a symmetric distribution means the part of the distribution that is on the left hand side of the median is the same as the part of the distribution on the right hand side of the median for example and there are many examples of symmetric distribution uniform distribution binomial distribution and normal distribution symmetric for those of you who are unaware is is a term that means that if you were to cut something from the middle create a half portion of it the left hand side portion will be exactly the same as the right hand side portion so this is why it's called symmetric distribution because when we create a distribution plot as you can see in the image here uh it is supposed to look exactly the same from the left hand side portion to the right hand side portion if i were to fold it over onto itself it should exactly cover the other side it should not fill over and it should not leave some space that's the way that symmetric distribution works then comes what is the relationship between mean and median in a normal distribution so this is a question to help you understand what is mean and median and what makes a normal distribution stand out so when you have a me when you calculate some data of in data that is giving you a normal distribution which is a symmetric distribution what is the definition or what is the relationship between mean and median of that data so in a normal distribution main is equal to the median so to again give you an example if you want to create a plot of data such as one two three four five so you have one two three four and five calculate the mean of the data you get five plus four plus 3 plus 2 plus 1 which is basically just 15 divided by 5 you get 3 similarly the median is going to be since it's all in one line and it's all in ascending order we get 3 as the mean this mean is 3 and medium is 3 it's a normal distribution if you want to plot it you will get a normal distribution as well uh of course in real world the data is not so easy to interpret there's a lot of data all of it is jumbled up and it's difficult to grasp you can you make use of certain computer programs you can make use of certain statistical measures but at the end of the day if you want to figure out whether or not plotting something will give you a normal distribution just calculate the mean and the median and take a look at the difference between the two if the difference is not large then the mean and medium will give you normal distribution or almost a normal distribution but if it is too far off from each other then it's going to not give you a normal distribution but it's going to give you distributions that are skewed and we'll take a look at what a skew distribution means in a moment so what is an outlier outliers are by far one of the most common problems that we get when we are dealing with data this is why we get questions like these asked in uh data science and machine learning engineering interviews data science makes heavy use of data sets and if you don't understand what an outlier is and how it affects your entire process of machine learning or how it affects your entire process of modeling the data to create a predictive model then you don't really understand how data science works so before understanding how it affects our models we have to understand what an outlier is an outlier is a value in a collection of values that is either much larger or much smaller than other values in the collection to give an example let's say that you have five values you have one two three and four these are the first four values that you get these all of these values are quite compact they are quite close to each other there's not a lot of difference between the extreme ends of the four values that we have gotten but let's say that the fifth value that we get is 157 this is quite far off from all the other values it's not the same as all the values that we have gotten it's quite far off from what we would like to use so how do you deal with that that's called an outlier similarly if it was like negative 251 it is also an outlier outliers can lie on either end of the number line if you put your data on a number line and some data are at the extreme end of the value spectrum that means that you are facing problems with an outlier they negatively impact your statistical models so if you're trying to create a model that can predict the next number so one two three four if that's the data then the next number is going to be five it's this is quite easy actually but you can see how easy it is now let's say one value is an incorrect value that we get it's 157 now our statistical module model is entirely confused it thinks that 80 of the data is conforming to one pattern and 20 of the data is not conforming to that pattern and it is very difficult for it to make sense of this difference in the magnitude so to deal with the outliers we'll take a look at how to deal with outliers but first you need to figure out how can you find out an outlier an outlier is a difficult thing to understand at first but uh how can you find it out once you know that you have some you might have some outliers when you're dealing with data in a large scale finding out i outlined it can be a bit difficult to do by hand so you need help of certain formulas now this is where the fight number summary comes into play this is why we discussed it in the earlier sections and this is why it's so important to know what a five number summary is and how you can calculate it so you can use the five number summary to identify the outlier any values smaller than q1 subtract it and subtract by 1.5 times the iqr uh these might sound like terms that are not very easy to understand so q1 is the first quartile is the 25th percentile it is the value so if you take all the values and find the value that is right in the middle of the entire collection of data if you put it in ascending order that's the median and if you find the value in the middle of the smallest value and the median value that's the first quartile that's the 25th percentile so you get that value and then you subtract 1.5 multiplied by the interquartile range uh this is the value that is interquartile range means the magnitude of the difference between q1 and q3 so you get 75th percentile you get the 25th percentile and then you subtract them with each other whatever value you get that's the value that you can consider as the inter quartile range or the range of the quartiles then you multiply it by 1.5 which is basically you are expanding the value a little bit more you are making it 1.5 times of what it is you're making it larger than what it is and then you subtract it from q1 so you get some value at the end of q1 and then you sort of expand it a little bit any value that's smaller than that is an outlier it's a data set that shouldn't it's a data point that should not really be there it doesn't make sense for the data to be smaller than this number so that's the number that you can think of but that's only if the data and outlier is in the negatives or is smaller than the smallest value what about the largest number now that is where the q3 comes into play so you take q3 and you add 1.5 multiplied by in iqr which is the interquartile range so on the other hand the number that are higher than what the maximum value should be that's the other kind of outlier and to calculate it you do this so if your number is smaller than first quartile subtracted from 1.5 times the intern in uh interquartile range and if a number is larger than q the third quartile plus the 1.5 times interquartile range then it is considered an outlier okay guys if you're looking for an end-to-end training in statistics intellipart provides a complete certification training on statistics for data science and those details are available in the description box below now let's continue this session so that's how you can create an outlook to deal with an outlier is quite simple you remove the outliers without thinking about it because outliers are really detrimental to our statistical measuring outliers exist because of problems in statistical measuring when we're trying to take the sample some data sets are just too different from what we expect some exceptional data points or it could be the issues in the measuring devices that we used if you're trying to take temperatures maybe the thermometer was broken and it didn't give the right reading maybe it was polluted something happened and we get the wrong inputs and these wrong inputs affect our statistical module models because the reality is not based on the outliers it's based on the collection of data that is quite close to each other now let's take a look at what is the relationship between mean and median in a normal distribution so in a normal distribution mean is equal to the median as we have taken a look at symmetrical distribution normal distribution is just another name this is the question that is used to throw off people from what they are trying to learn so if you know that symmetrical distribution is normal distribution then it's going to be really easy for you to use but these questions come in quite handy because when people are studying these things and they learn that okay symmetrical distribution you have mean and median they are equal to each other but if you learn the normal distribution you learn that mean and median are equal to each other and then you go to an interview and they ask something else so if you've learned symmetrical distribution and they ask normal distribution you should not be confused and thrown off the scent you should understand that these are the same things and they have the same value so mean is equal to the median these questions are quite frequent they try to get a sense of a person's understanding of how much they know about statistical measuring how much they know about symmetrical distributions how much they know about distributions that are the same with different names so that's how this works so what does the bell curve distribution mean this is a term that is used quite frequently when we're trying to describe a distribution in our statistical measurings so what does it actually mean a bell curve distribution or a normal distribution as it might also be called is called a belt distribution because of the shape of the distribution that looks like a bell if you've ever taken a look at a bell at any place it looks in the shape of a bell curve distribution if you take a look at the image that is at the top right next to the calculator you can take a look at it and you can imagine why it's called a bell curve distribution it is symmetrical it is looking like a bell so that's why it's called a bell curve distribution bells are also quite symmetrical so that's one of the reasons why it's called that so how does standard error and margin of error are related so what happens to standard error as and as the margin of error decreases or increases what is the relationship between the two so as the standard error increases the margin of error also increases this is also known as a positive correlation so when one value increases the other value increases as well think of it as the relationship between the age and height as the age of a person increases their height also tends to increase however that relationship isn't carried for a long time after certain age your height stops increasing but as a good threshold you can understand that how that works so that's how the standard error and margin of error are related to each other and now we come to another question that comes into hypothesis testing if you've ever worked with statistical analysis and understanding the data one of the things that we get to understand is what does the degree of freedom mean so as the name suggests degree of freedom is defined as the number of options that we have so if we want to take a look at a new scheme that a cinema hall has introduced it has introduced three kinds of popcorns uh three kinds of popcorn buckets for snack buckets a smaller a small bucket a medium sized bucket and a large size bucket and we want to take a look at what do people prefer to choose most of the time or in a particular cinema hall so this is where we get the word degree of freedom we the person who is being studied has three options three choices the smallest smallest box the medium size box and the largest box so it is used with t distributions and z distributions these are the kinds of distributions that you use when you're trying to make sure that the population sample that you have chosen and the conclusion that you have drawn with it is in congruence with the entire population so you take the entire population and you take the sample statistic and you try to figure out whether or not the conclusion that you drew about that conclusion is in line with the entire population that tells you the sampling was correct that there was no error in sampling and so on so the degree of freedom basically means that the number of options that you have whenever you're trying to create a statistical model you need to provide it with certain options an option could be as simple as a binary option in which either a person chooses to buy something or they choose to ignore it or it could be a larger broader spectrum of options out of the 10 ice cream flavors that were given which one did the most people go for and why what were the genders of the people who who went with the one kind of flavor and what was the gender of people who went with other kind of flavor so these are the kinds of things that were talking about if they would they were given 10 kinds of ice cream flavors then they have a degree of freedom if they were given three kinds of boxes of popcorn then the degree of freedom is three and this is what is used to determine whether or not the conclusions that are drawn prove the hypothesis or disprove the hypothesis so now let's take a look at what is covariance so covariance is a measure of how two variables change together now if you have ever worked with statistical modeling of data if two variables have a strong relationship with each other that means that one variable values and one variable change as the value in another variable change we're trying to create a model that could have a negative impact it may also not but if the relationship is really really strong then it could have a negative impact on the kind of model that we're trying to create so it's important to understand that relationship between different variables now in this one we have what is covariance covariance is the measure of how two variables change together does the increase in one variable lead to an increase in another variable or does it lead to the decrease in another variable is it is the covariance positive negative or neutral neutral basically means that the relationship there is no relationship between two variables one variable increases has no effect on another variable so to give you an example of these kinds of relationship think of it as this you have some data set that has a person's age and date of birth as the date of birth increases the age is also going to increase because these two things are calculated on top of each other so that's really easy to understand then comes the uh after date of birth there's uh for negative correlation we have if or for positive correlation another example could be a person's age and their height so as the age of a person increases the height of the person also increases so one has an effect on another then comes a neutral covariance in the negative in the neutral covariance relationship what happens is a change in one variable has no effect on the change of another variable so a person's uh you can think of it as a person's choice of food and their income level that could be a relationship of neutral correlation because that doesn't seem to be much relationship in these two things then comes negative correlation in which the increase in one thing leads to decreasing another so you can think of it as a person's age and their bone density as the person's age increases their bone density starts to decrease when they start getting older so that these are kind of things that you can take a look at so that's what covariance does it tells us whether or not the change in one variable has effect on other variables uh let's take a look at one sample t-test so in case you're not familiar one sample t-test is a kind of statistical hypothesis testing one sample t test is one of the kind of t test you can do you can also do two sample details and other kinds of t tests as well so one sample t test is a statistical hypothesis test used to determine whether a an unknown population mean is different from the specific value so a good idea to understand this is basically if you are trying to create a data you are trying to draw some statistical influences from a student's data set you have data about each specific student and you want to figure out whether or not that particular data about that student is in congruence with all the population data that you have so you usually use t test when you have continuous data so for instance you have data about a person's age that's continuous data stock prices when you have even you have stock prices when you have blood charge diabetes charts these kinds of things so what it allows you to do is it allows you to determine whether an unknown population mean is different from a specific value so if you have a specific value for a sample data and we want to take a look at the mean of the entire population does that signify a significant difference that we can measure it as an outlier or something like that that's what one sample t test allows us to do then comes alternative hypothesis so alternative hypothesis is also denoted by h1 uh this basically is a statement that must be true if the null hypothesis is false think of it this way if you have an alternative hypothesis that means that the statement is true when the null hypothesis is false so these two things are diametrically opposed to each other now if you think about it in statistical measuring we are trying to prove something that something is written down in a hypothesis okay guys if you are looking for an end-to-end training in statistics intellipaat provides a complete certification training on statistics for data science and those details are available in the description box below now let's continue this session so we have two kinds of hypotheses as we have already discussed null hypothesis and alternative hypothesis null hypothesis is something in where that is in inferential statistic the null hypothesis is the default hypothesis that a quality to be measured is zero to give you another example let's say that you are trying to draw a conclusion about a group of people in a large population and the conclusion that null hypothesis will allow you to do is it allows you to prove the negative it allows you to prove that okay we are trying to prove that this group of people is representative of the entire uh collection of population but in null hypothesis you you write it in a negative view right okay this population is not representative of the entire population on the other hand in alternative hypothesis you will write none of this population or this group of people are representative of the entire population then you perform your statistical influences based on the data that you have and you make the conclusions there it's called the null hypothesis or the default hypothesis because we tend to measure it towards zero or negative things so that's why it's called null hypothesis on the other hand alternative hypothesis is trying to prove something in a positive direction so that's why it's called alternative hypothesis now the in the next question we're trying to understand when you are creating a statistical model how do you prevent overfitting overfitting is by far one of the most common issues that comes with statistical modeling this is the kind of problem that you get when you have some training data and some testing data you create a statistical model using the training data and in the training data when you again try to make the predictions you get 100 prediction on the other hand in the testing data when you try to pour in the data and get the responses you get really very reserved the accuracy is really low around 50 40 so when you are creating a statistical model you get this overheating problem when the data set that you use for training uh leads to a good performance when you use it on the model but on the testing set it is very poorly performed this is because the statistical model essentially analyzes and then remembers everything about the training data so when it's being trained it takes a look at all the values and it draws the conclusion that okay i have remembered everything now when this values come and i will be able to answer it perfectly but if for instance a new value comes in that the model hasn't seen it will have no idea and we'll just try to guess it and it will get it wrong most of the time that's what overfitting means how do you prevent overfitting you do it by cross validation cross validation basically means you get the data set that you want to train your data on put it in 10 different groups so basically divide the data set into 10 parts then create the generate the model and use nine parts for training and one part for testing that one part you are going to alternatively pick all the ten parts one by one and let all the other nine parts be for training purposes that's how you can do it by cross validation uh then comes skewed distributions in case you know you are not familiar with skew distributions or distributions in general distributions basically mean how do you plot a data and how the frequency of the data is determined here now there are two kinds of skewed distributions left skewed and right skewed distributions so what is left skewed distribution a left skewed distribution is one where the left tail is longer than the right so it is important to note that in a loop distribution mean is smaller than median and median is smaller than node for those of you who don't understand what mode is mode is the value that is occurring most frequently if i have five values and they are one two four five five five is the mode of the values because it occurs most number of times now when we take a look at left skewed distribution it is different from symmetric or normal distribution in normal distribution mean is equal to the mode and the distribution is completely symmetrical it's same on the left hand side then it is on the right hand side in left skew the distribution however the tail is longer than that of the right leg so on the left hand side there's a longer tail to the distribution than there is on the right hand side it is important to note that the mean is smaller than median and median smaller than one other than the the ones that we take a look at in normal distribution in which mean was equal to the median here the mean is smaller than the mid then comes what is right skewed distribution so similarly to net square distribution right skewed is the diametric opposite of that you get a right skewed distribution when the right tail is longer than the left one so if the right tail is longer left tail shorter then you get mean that is larger than a median and median that is larger than the move so hopefully that makes it clear in normal distribution mean is equal to the median in left skew distribution mean is smaller than node and in right skewed distribution mean is larger than medium so that's how this works mean and median the relationship between mean and median can be used to describe what kind of distribution will get if you plot the entire data set so if you have calculated the mean and median instead of plotting the entire data set you can simply take a look at you can compare the mean and medium to figure out what kind of a distribution will it provide you with you can also calculate the mode however mean and median tend to have the largest effect on these data sets so here's a question that gets asked very frequently the values might change but it gives you a good understanding of how you can use the quantities in your understanding of a hypothesis if a distribution is skewed to the right and the median is 20 what will be the mean of the data will it be greater or less than 20. so basically the question is is the mean greater than the median or is it less than the median in a right skewed distribution since we saw it just now if the given distribution is right skewed then the mean should be greater than 20 while the mode remains to be less than 20. so that's how it's supposed to work mean is the greatest then this median which is smaller than the mean which is 20 median is 20 mean is supposed to be greater than that how great it doesn't really matter and mode is the smallest value in this entire equation so that's how the right skewed distribution works then comes what are the kinds of biases then you can encounter while sampling for those of you who are unaware sampling is the technique that you use in order to extract some small sample out of a large population of data so different kinds of sampling are available in medical field they use different kinds of sampling than in statistical fields one way to sample is let's say that you are trying to collect some data about medical patients so you can take a look at every third patient that comes in and you use that third patient's data so now your data set that you have will be reduced by a third so instead of if 100 patients or 33 patients come into your clinic every day you only have data about 11 people so now you have only 11 people they are selected randomly whoever comes first second or third you don't change the order you just keep the order and pick the third person so that's how you can do this now there are certain biases biases are basically errors in sampling when you're sampling something you end up choosing one over the other and that leads to some error what are the three kinds of major biases that you could have these are selection biases survivorship biases and under coverage biases these are all important biases and we'll take a look at those one by one but it's important to understand that these biases heavily affect the information that we get and if we get wrong information they're going to draw the wrong conclusions about the entire world about the entire population of the data that we have so let's take a look at selection bias first this is an important question usually people start by asking the different kinds of biases and then they go one by one if you are confident enough and you answer one of the biases within a good level of confidence and understanding and good wordings they might not ask you about different kinds of different kinds of biases because they know that you know what you're talking about but in case you don't they still ask so what is selection bias selection bias is a kind of error that occurs when a researcher decides who is going to be studied so basically selection bias occurs when a researcher who is supposed to research the entire data set gets the gets to make the decision about what are the people who are going to get studied uh this is not done randomly this is done on the basis of preconceived notions by a particular researcher and that could lead to really difficult problems so in order to say that let's say that you are trying to study the population of an entire city and what you do is you just say okay give me the top 10 earning people of the entire city now the population is not going to be represented by the sample you only get the top 10 most paid most highly paid individuals you might not get everyone and you might your study might be really off to what the conclusion should be now it could be possible that the conclusion that is drawn could be correct but you will never know unless your sampling technique was correct so that's a selection bias then comes survivorship bias survivorship bias draws its name from the word survival that's the key word here survivorship bias is the flaw of the sampling technique sampling selection techniques that occurs when a data set is only considering the surviving or existing observations and fails to consider the observations that already ceased to exist if you're trying to study uh about certain species of fishes uh make some conclusion then the fish and the species that have uh disappeared throughout the history of uh living creatures those species are not considered then your results might be a bit off so that's survivorship bias you can take a look at other nuclear disasters uh or man-made disasters in which people uh died and if their data is not considered then you might not get an accurate picture of what the data set that you're trying to get is so that's survivorship bias and then comes under coverage bias so under coverage bias is a bias that occurs only when some members of the population are inadequately represented or undercovered in a sample okay guys if you're looking for an end-to-end training in statistics intellipaat provides a complete certification training on statistics for data science and those details are available in the description box below now let's continue this session so let's say that you are trying to predict certain uh you're trying to make prediction about certain people the population contains 30 of the people in the population are below the age of 18. the rest out of the 30 if you remove the 30 percent out of the 70 60 percent are above the age of 18 but below the age of 40 and rest 10 percent i are above the age of 40. so that's how you get it you get uh 30 less than 18 40 percent out of the 70 percent to be or let's say 60 out of the seventy percent to be will be between the ages of eighteen to forty and ten percent above the age of forty now when you're trying to create a sample let's say out of those hundred people you are only trying to get uh ten people then you need to be able to take into account all of this and create as close to possible the real world situations as you can so and you pick you know many ways of doing this one of the ways is you take a look at the different categories take a look at the distribution throughout the population and then you choose that many people so since there are 13 people 30 people below the age of uh 18 in our population of 100 people we take three below the age of under 18 people uh through the population therefore our sample for the next we have 60 of the people between the age of 18 and 40 so we take six people out of that age group in the population and finally we take one person above the age of 40. so we have an accurate representation of the entire population if we don't do this then we get a bias in our survivor in our entire research and that bias is called under coverage files this is really important to be understood because if the bias is giving us the under coverage bias then our data set is not representative of the population and we get a lot of errors through the entire throughout the entire thing so we need to be able to cover for that under coverage bias and understand if it has occurred if it does occur then we need to understand how to better how to use it in a better way then comes skewness so this is a question yes we have already taken a look at what is right skewed and left skew the question here is what is qnap so it is important to be able to describe what skewness actually is qns is a measure of a lack of symmetry in data distribution if you want to distribute if you want to take a look at how skewness works basically you have symmetrical distribution and if it's not symmetrical then inside the skewed on the left or on the right these are the only three options that you have so that's what skewness is all about it's about a lack of symmetry in the distribution of the data that we have then comes kurtosis this is a concept that not a lot of people take a look at it's a scary sounding word but it's actually quite easy to understand once you wrap your head around the entire idea of qns and kurtosis but kurtosis which is basically used to describe the extreme values in one versus the other table it is actually a measure of outliers present in the distribution so if you have an entire population and you have values on the extreme and of one point so there are values that are below the uh necessary values in the first quartile and below the necessary above the necessary values of second uh third quartile so you have outliers on both the extreme end then there's a high kurtosis kurtosis is used to describe the extreme values in one versus the other tail so what are the values that are on one side of the number uh number line and values on the other side of the number line going in the positive direction and in negative direction how many outliers are there what are the kinds of outliers that are present and that's the kind of thing that kurtosis allows us to figure out when comes correlation now correlation is used to test the relationship between quantitative variables or categorical variables uh they sound very much similar to covariance the important distinction between correlation and covariance is that the correlation is used to measure the magnitude of the relationship so correlation usually ranks from negative one to positive one so the value of a correlation between two variables is going to be from negative one to positive one negative one being an extremely negative correlation positive one being the extreme positive correlation and zero means there is no correlation now what that means is in covariance we try to figure out how whether or not a the relationship is proportional or inversely proportional that is a value increase in one variable is going to lead to the increase in value in another variable or is it going to lead to the decrease in value of another variable on the other hand in correlation what we get is we get to get the magnitude of the covariance so to give you an example how much will a particular variable change in a one variable lead to a change in another variable whether it's positive or negative that's a different thing but how much is it going to be so if it's 1 then it's going to be increasing as much as the positive value increases so to give you an example if i have 7 as a value i increase it the other value that has 3 will also increase by 1 or 2 whatever so that's the thing that we're trying to get across it's the magnitude of how strongly these values are correlated that determines whether or not change in one value is going to lead to change in another value and how much is it going to be now no one usually gets zero correlation each values due to some influences and noise in the data get some sort of correlation between two variables but in order to uh regard these two correlated variables what we do is we take a look at the correlation and we try to figure out what could be causing that correlation instead of focusing on the little details so if you have two variables that are correlated by the degree of 0.2 0.3 then it's not significant and you don't need to worry about it if it's above five let's say zero point six seven eight nine or one or below zero negative zero point five negative zero point six seven eight nine ten then that means that you are running into an error and a lot of correlation of variables are strongly correlated then you need to drop one of the correlated variables so you just need to understand that finally we have the relationship between standard deviation and standard variables these two things sound very similar but they have a really different measures so standard deviation is the square root of standard variance this question can also be asked in a reverse format which is how does standard variance relate to standard deviation the answer would be it is standard deviation squared standard variance is the standard deviation squared so you multiply standard deviation value by itself and you get standard variance these are just used to determine how the values work and how do you use them this question is asked mainly because people get confused between standard deviation and standard variance and that could lead to a lot of confusing issues so these are the kinds of things that you should understand when you're trying to learn these technologies and learn these terminologies because the problem here is that deviation and variance are quite closely related but how do they relate with each other which value is the square root of which value is going to be the deciding factor of whether or not you're right or wrong with that we have come to the end of all the questions that we had okay guys if you're looking for an end-to-end training in statistics intellipaat provides a complete certification training on statistics for data science and those details are available in the description box below okay guys that's it for this session i hope this session was helpful and informative for you if you have any queries leave a comment down below or if you have any new suggestions for us to make as a video please leave that also in the comment section down below so thank you meet you in another session you
Info
Channel: Intellipaat
Views: 104,726
Rating: undefined out of 5
Keywords: Statistics Interview Questions, Statistics Interview Questions and Answers, Statistics Interview Questions for Data Analyst, Probability and Statistics Interview Questions and Answers, Statistics Questions and Answers, Statistics Practice Test, Statistics Interview Preparation, How To Prepare Statistics Interview, How To Crack Statistics Interview, Statistics, Statistics Training, Statistics Interview, Statistics Tutorial for Beginners, Statistics Interview Videos, Intellipaat
Id: LMz2ouNcXUQ
Channel Id: undefined
Length: 49min 25sec (2965 seconds)
Published: Wed Feb 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.