CFA Level 1 Quantitative Methods - Sampling and Estimation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome to this new session of cfl level one today we will be going over a new reading from quant's topic the reading is sampling and estimation now this reading is not exactly carrying over right where you left off in the previous reading rather this reading is using some of the concepts we've covered in the probability distributions topic and it is now building a base for your last reading of quotes which is hypothesis testing so this reading is paramount to understanding the hypothesis testing very well now hypothesis as such is just one reading of your quants level one but your level two quants is starting exactly from hypothesis so having a good understanding of this reading so that you have good grasp on hypothesis is very important so that your entire cfa across level one two three they are very smooth so let's start with some basic parts of this reading now there are a few theoretical details that we have to know before we can get into the calculations part and the more complicated part of this chapter so initially there is some discussion given about various ways in which you can collect the sample in which you can make a sample so that you can then collect information from it so sampling methods you can have a simple random sample where the idea is out of a population at random you select a few people to become part of a sample now there is no basis on which you select those people that's why it is called simple random sampling the chance of getting picked in a sample out of an entire population is equal for everyone so if there is a population of thousand people and i want a sample of 50 people the probability that anyone can get selected in the sample is equal for every single person in the entire population so that is my simple random sampling then we have something known as systematic sampling systematic sampling is a little bit of an extension it's not exactly random as such but in case of systematic sampling out of the entire population let's say i select every 10th person so i have a population of 1000 people and i want to select 100 people for a sample so what i do instead of just selecting 100 people at random i place them all in a line and i say every 10th guy would be part of my sample this is something that if you know in your school days are sports teachers they often used to make teams this way everyone would stand in line and then they would select every other person so that was a way of making teams that is what you have for systematic sampling aside from this you have stratified random sampling stratified random sampling is adding one additional step to our simple random sampling what it is doing is out of the entire population i would create some different groups on the basis of characteristics of my population so if i want to do some sort of medical study i might need to analyze the male population separately and female population separately so what i do is instead of just selecting 100 people from the population i first divide the population into male and female and then out of those groups i'll select people at right so first i create different groups from the population and then from those groups i will perform random sampling so that is my stratified sampling that first i create stratas the groups within the entire population and out of those groups i'll select people at rank so these are the three highlighted methods by which you can do the sampling then the next discussion is focused on what is known as sampling error sampling error is calculated as sample mean minus population b so normally the purpose of analyzing or creating a sample is that i don't want to perform analysis on the entire larger population so i take a subset and i do analysis on that and on the basis of that analysis i can then make some decisions whatever the difference is whatever mean value you get from the sample and the difference it has from what the mean value should have been to accurately represent the entire population effectively mean is a sort of representation of population so the representation i got from my sample if it is different from the representation that accurately represents the full population that difference is known as sampling error that this is by how much my estimates might be off just because i took a sample instead of analyzing the entire population so just a basic understanding that this is the sort of error that can creep into an analysis if the analysis is based on samples rather than population now aside from these discussions about sampling methods and sampling error you also have a discussion about data types so in your syllabus they are highlighting three kinds of data because primarily the focus of your syllabus on quants is not to make your statistics expert rather they want you to be comfortable with analyzing financial information so most of the financial information that we deal with it falls in one of two categories so you have time series data and you have cross sectional time series data is representing one particular piece of information over a longer period of time for example a company's profits for the last 10 years i am taking just the profit but i am taking it over last 10 here so i am representing one piece of information over multiple periods of time that is known as time series data and cross sectional is comparing different entities at one singular point in time so this time the time is not changing in this year i want to compare two companies and their profits so i'm comparing profits across two companies but at the same time that is cross-sectional data so in time series you are checking one entity one value over a multiple time period in cross sectional you are checking multiple entities at one time period now you can also have a combination of these two so if i just add them both i have let's say a data set where i'm comparing profits of different companies over multiple time periods that is known as panel data so panel data is just a combination of these two so that's it for our intro basics part so just some terminologies that you need to be comfortable with let's move on to the main discussion of this chapter which is central limit theorem the next point in focus for this reading and also probably the most important point because if you understand central limit theorem trust me everything in this reading and in a hypothesis it is just a case of slightly different variations of the same calculation so understanding central limit theorem is the key central limit theorem is based on the sampling distributions so before we understand what theorem says we first have to understand what is sampling distribution now so far throughout your entire quants you've looked at samples population you've calculated mean median mode standard deviation all of those things now in reality if i want to do some sort of analysis and if i take just one sample there is a chance that my sample could end up being biased maybe there is a possibility that the sample i took is not very representative of population in the real world when there are financial decisions based on samples that could be a very costly mistake so often in the real world what we do is let's say you have a population of 100 different people so this is your population assume this is 100 different people i want to do some analysis but doing it on 100 people is slightly tough so what i do is i take a sample of 10 people if i do analysis of this sample and take my decision on just this sample of 10 people it might not be the most accurate thing to do in the real world so what i do is i do the analysis on this sample i calculate the mean and i calculate the standard deviation so i have these parameters done then i take another sample of 10 people i'll call this s1 i take s2 same 10 people i calculate the mean i calculate standard deviation i take one more sample again 10 people calculate mean standard deviation now statistics says that the best way to eliminate any sort of errors that you can have if i take a sample of 10 people there is always a possibility it might not represent population because my sample can have some biases the best way to eliminate that is to take multiple samples and if i take this let me call this mean of first sample mean of second sample mean of third sample let's say so on i took five or six samples and i create a next series of just means so on this data series of means of samples taken from the population this is known as sampling distribution i'll repeat this part because once you understand what a sampling distribution is the rest of the chapter is very easy so in my entire population taking just one sample might give some statistical errors so the best method is you take multiple samples for each of those samples you calculate the mean and if i take a new series of just means i'm not taking these 10 samples i'm not taking that data i'm not taking population what i am doing is now i have created a smaller series which is just means of all the samples that i have previously taken this distribution is known as sampling distribution central limit theorem says for sample of size n the sampling distribution which means n in our case is n we took samples of 10. so the key for central limit theorem is all of these samples have to be of the same size if there are different size some central limit theorem would not work so for a sample of size n the sampling distribution will follow two rules first rule the mean of sample will start approaching population mean which means the sampling error would reduce to almost zero we are not saying it will be equal to zero for our exams in all the calculations we will take it equal to zero but the idea here is simple there might have been some biases in this sample there might have been some biases in this sample maybe a few in this sample but when i took the means i've already removed most of the effect of the biases i'll take a very simple example let's say i want to analyze the height of people in the population let's say the taller guys are like six foot five seven feet tall the shorter people are five feet and the average height is somewhere around 5 feet 10 inches if i take one sample where i have two people of 7 feet height this sample would be biased but this one would be fine this one would also be fine and if i take the average here i'm not taking the entire sample i'm taking just the mean of the sample so if the mean was supposed to be 5 feet 10 inches and if i'm in the sample i have two people of 7 feet height maybe the sample mean for this first sample comes out as 6 feet it won't necessarily get to 7 because other people are still at the normal heights so it would be very slightly upward bias so a lot of that bias the difference from 5 feet 10 which is average height to 7 it has been reduced to just 5 feet 10 to 6 and when i take further a distribution of just the means i am getting more and more close to what the population should represent ideally that's why as you take sample size of n the sampling distribution would create a scenario where if i calculated mean of this distribution the mean of this distribution will approximately be equal to population mean which means the best way to estimate population mean is to calculate the mean of this sampling distribution second thing that the central limit theorem is saying is that standard deviation of sampling distribution this will be population standard deviation divided by root of n again same example 100 people i want to analyze what the height of average person is in the population when i take one sample i can have deviations maybe in these 10 people some are very tall some are very short i can have deviations but when i do mean of this a lot of those deviations are removed if i'm making a distribution which is only means think logically wherever you are select any four cities i will take example of india i took a survey of height of people in new delhi in mumbai in kolkata in bangalore there might be taller people in all of those when i took individual samples the effect of those 12 people was there but when i am taking a distribution which is just mean i am taking average height of new delhi average height of mumbai average height of kolkata average height of bangalore this is all my distribution just four things all four are already averages now i will not have a deviation so large that one guy is five feet the other is seven feet averages will not have that distinction averages would most likely be like if the average was supposed to be 5 feet 10 inches maybe in one city it is 5 feet 8 inches maybe 5 feet 11 inches something like that so because the distribution is already of means the deviation is going to be significantly reduced now i don't have very big deviations that's why if i calculate standard deviation of this series it can be calculated as population standard deviation the standard deviation of population divided by root of n root of n simply means the larger the sample size you have the lower the deviations would carry on to the sampling distribution think logically if i take a sample of five people and one of those is a person with seven feet height the mean of that sample would go high but if i take a sample of 50 people and only one of them is seven free time it he will not be able to impact my main or standard deviation as much that same logic is here as the n will increase the deviation of this series will decrease so these are the two rules that you will have to remember there is statistical derivation from them but it is not part of this labels and i would recommend not to waste your time in those derivations keep these two things in mind that throughout your entire remaining quants from this point on we will be dealing with sampling distributions and the sample mean and population mean would pretty much be identical standard deviation of sampling distribution is calculated as population standard deviation divided by root of n this name standard deviation of sampling distribution is basically standard deviation of this data series this is given a very unique name known as standard error so standard error is calculated in this way so i hope all of it makes sense this is just standard error simply means standard deviation of sampling distribution so you will see the name changing from standard deviation to standard error don't get confused it's just saying that it is just standard deviation of the sampling distribution that we have i hope central limit theorem is clear now because if this is clear to you the remaining part is going to be very easy for us to deal with now before we move to the next part i just want to do a quick example so that you are comfortable with this calculation of standard error so let's take a very basic example population has standard deviation of five percent so let's say i'm taking uh a data of returns of stocks on any particular stock exchange the population has a standard deviation of five percent if sample size is 50 calculate standard error so give it a try pause the video give it a try see if you can calculate the standard error so if you calculate the standard error it would come out as population standard deviation which is five percent you can take it as point zero five if you want and then convert it to percentages after the answer is done divided by root of n the size of the sample if you solve for this you will get 0.7071 if you took decimals it would be 0.007071 so i hope the calculations all of them make sense there is one small nuance that you just have to be aware of at times in the real world think logically if i know what the population mean is and what the population standard deviation is do i really need to do sampling no the only reason why we prefer to do sampling is because calculations for population are not really feasible the main thing is solved here that whatever sample mean you get from sampling distribution it would be approximately population mean so that solves one thing that yes population mean we can estimate but if i already know population standard deviation why am i doing sampling in the first place so in the real world this will not be given to you this would often be the figure that is missing if population standard deviation is not known standard error can also be calculated as sample standard deviation divided by root of n so in case population standard deviation is missing you can use sample standard deviation in this same formula to calculate standard error so that is just an iteration of the same calculations i hope all of this makes sense that completes the discussion of central limit here let's move to the next portion now that we have an understanding of some of the basics of central limit theorem let's see how we apply it in our analysis so the next topic in focus is confidence interval now confidence interval before i look at some of the calculations some of the formula i want to have a general discussion so that you get an idea of why exactly do we need confidence interval so let's say a client comes to me and he wants to make some investments and he asked me as a financial expert what do i think would the inflation be for the next year now because i have been doing some analysis i told the client like let's say yes that in my opinion the inflation for the next year is going to be seven percent so i gave him an answer seven percent is the expected inflation in my opinion for the next entire year the next entire year passed and at the end of the year he came to me and he said the actual inflation was 6.9 and he said you were wrong you said it will be 10 but it was 6.9 now think logically do you think the calculations or the estimates were far off actually not if the actual inflation came out to be two percent or maybe 15 percent then it would be far off from seven but if it stays somewhere close to seven because at the end of the day it was just an estimate it was a prediction it was an estimation if it stays close to 7 we'll still consider the analysis the estimation to be correct so as such in reality we don't just have one point that we say okay 7 is right 6.99 is also wrong we don't have this kind of analysis this is very rigid in the real world what we have is that while we give a 7 we also have an acceptable range that okay if i'm saying seven percent so long as the inflation stays between five to nine percent my analysis is more or less on point like i was able to predict pretty much where the inflation would end up if it is beyond 5 and 9 then i will say that yes my analysis by which i got 7 percent was probably not the most accurate analysis if this example makes sense this seven percent this is known as point estimate and this is normally the mean one singular point that we think will represent the entire population this 5 to 9 this is known as confidence interval if the meaning is clear of confidence interval point estimate now the only thing we have left for discussion is how exactly do i calculate this 5 to 9 in different data sets i won't just have a plus two minus two kind of situation i might have different situations for that we have this confidence interval discussion now confidence interval construction it follows a very basic equation so in your syllabus in your material they have used multiple distributions and some detailed examples the reason i won't be going into that much detail in this reading is because i'll cover all the distributions and how to read the distribution tables and to get the data from it in the hypothesis chapter itself so we won't be covering the distributions and how to read them here rather the focus is more on understanding the theory that forms the base so confidence interval is calculated in a simple way i have point estimate which is the mean plus minus now if you remember i went a little below seven and a little above seven so in order to construct my acceptable range i have to go both sides slightly higher slightly lower that's why once you solve it with plus then you'll solve it with minus plus minus z alpha by 2. now what this means i'll discuss that multiplied with standard error now standard error mean we already know all of these parts the only new thing in this equation is this z alpha by 2. now z represents normal distribution which simply means that the value will take here it will be picked from normal distribution now how exactly do we pick that value i will cover that in the next reading in the hypothesis itself so this is normal distribution and sometimes instead of this you can also have another iteration of the same equation which is mean plus minus t alpha by 2 multiplied with standard error now in this case we use t distribution instead of z we will discuss t distribution with an example as we go along in this session itself now the idea here is fairly simple i want to pick some distribution which gives me an idea as to how much volatility my data series has for example if i talk about my stock returns i have a mean return of 10 percent but the standard deviation is 20 which means on average deviations are very high this return can go from minus 10 to 30. in that case i cannot construct a confidence interval as 8 to 12. so effectively what i am doing is in order to see what the acceptable interval is i have to incorporate the deviation that happens so deviation is being multiplied with a factor of distribution if my data follows a normal distribution i will use this normal distribution to see what the ideal confidence interval should be now outright just looking at these symbols these equations these can be slightly technical they might look tough so let's do one thing let's try to solve an example so here i have an example to have better understanding of the entire concept of confidence interval in this example we will be using the z distribution but for the t distribution you can replicate it in exact same manner so after we are done with this example i'll just go over some basic theoretical details that is what exactly is the difference between z and t distribution so we'll just cover those differences for understanding purposes but for the calculation purpose at least for this entire reading z and d work in the exact same manner so let's look at the example a sample with a size of 20 so 20 is the sample size n becomes 20. it has a mean of seven percent so maybe i took a sample of some stock exchange uh listed stocks and i had returned data and from that i have a mean of seven percent and the population standard deviation is two percent calculate the confidence interval at 95 percent confidence level now this entire phase at 95 confidence level for now just ignore it let's try to solve the question and then i'll help you interpret and understand what that 95 means so if i need confidence interval the calculation was mean plus minus z alpha by 2 which is z 2.5 for now just take this value as it is multiplied with population standard deviation divided by root of n this is given as 7 this is given as one point nine six this is two percent and n is given as twenty so if i solve this it becomes seven percent plus 1.96 multiplied with 2 percent divided by root of 20 and 7 minus 1.96 multiplied with 2 divided by root of 20. so once you take a plus sign once you take a minus sign give it a try see if you can solve both of these and let's see what values you get now if you solve you have two values seven point eight seven six five percent and six point one two three five percent this is your answer and the calculations will be exactly same the only thing that can change is if it's a question talking about t distribution instead of this z value they'll provide you with t value or they provide you with the t distribution so that you can calculate the value yourself and instead of population standard deviation they'll give you sample standard deviation that's it now once we have this i want to make it more clear as to what these two values mean now if you remember normal distribution from your previous reading we have a value of me mean is 7 [Music] and then we have a distribution something like this this is sort of a normal distribution now in this particular case what i am trying to do is the question will tell you you have to construct a confidence interval at some confidence level this confidence level means you need to find out two points such that the area of the graph between those two points this area this is 95 percent of the total area that is what confidence level means which means 95 chance there is that the actual return of the stock will fall between this range because 95 of the data will fall within these two points you can interpret it in two ways firstly this 95 chance the actual return of this particular stock in the future might also fall in this range or i am 95 sure that this range is accurately representing what the expected return of the stock is so both ways very similar you can interpret it in either way that you find comfortable now over here we had a symbol known as alpha alpha is called level of significance level of significance is nothing but 1 minus confidence level so the question could give you this 95 percent and you can use this to interpret that level of significance is five percent or it can be the other way around where the question gives you five percent you can use it as it is so be comfortable with getting information of either level of significance or level of confidence i hope the confidence interval the concept is clear because this reading is more about building the base for what we are going to do at hypothesis so it is very important that you have the basic understanding of what confidence interval means what we calculated and what these two values mean and how the confidence level comes into play so effectively what i am saying now from all of this data is that there is a 95 chance that the return of this stock will fall between six point one two three five percent and seven point eight seven six five percent if the returns fall within this range i'll say that my analysis was good if the actual returns end up beyond which is not impossible i have left some area outside so it is possible in a remote chance that the actual return comes out outside this range in that case my analysis by which i got 7 percent probably that was not correct and i need to redo my entire financial model so i hope the basic understanding of how confidence interval works and how we interpret this is clear let's quickly discuss some differences between z and t distribution and some of the biases that can exist in the data and let's wrap up this reading the last discussion of this chapter is going to just be covering some of the miscellaneous topics some of the theoretical discussions that you have throughout this entire chapter so let's quickly wrap all of them up and finish off this reading so the first topic that we need to discuss is t distribution now how to read the t distribution we'll cover it in the next reading when we do hypothesis but for now you just need to know a difference between normal distribution that you covered in the previous reading and what t distribution is now normal distribution outright it has a mean it has a standard deviation and it is uniform on both sides so it is symmetric in terms of its distribution t distribution also has similar properties t distribution is also shaped like a normal distribution symmetric across both sides but t distribution has fatter tails than z distribution which means the distance this distance this is higher in case of t than it is in case of z distribution and this just has one major significance for us from a statistical standpoint whenever we are dealing with samples there is a chance that the sample might not give us the results that we need it might have some biases due to which the results and the analysis could be some could have some sort of errors now in those situations i normally want to do analysis in a way that takes slightly more conservative approach the idea is simple this distance it simply represents what is the probability of having an event so far away from the mean if the mean is the center point what is the chance of having an event far from the mean and t distribution has that higher which means t distribution inherently increases the risk slightly of getting an outlier in your possible zones so whenever we have to use sample data instead of population data to calculate standard error or confidence interval let's say we are using sample standard deviation instead of population standard deviation there can be some errors in the sample calculation so for that reason instead of using z distribution with sample standard deviation i use t distribution so that those errors are balanced out with slightly conservative calculations of the t distribution now aside from t distribution you have a discussion about desired characteristics of an estimator now what is an estimator well a mean is an estimator so any central point mean medium those are estimators what we use to estimate some information about the population that is an estimator the confidence interval that we just discussed that is also an estimator so what would make these means these confidence intervals really good what would be the qualities of a good confidence interval there are three desired characteristics highlighted in your syllabus which means the estimator has to be unbiased efficient and consistent unbiased simply means that it has to be so good that it takes away the effect of a lot of my statistical errors or sampling errors that might exist efficiency means that it should be able to get me as close to the population mean and close to its population standard deviation in the smallest amount of sample size so if i am increasing the sample size to get close to sample mean so if i'm increasing the sample size to get close to a population mean i'll have to collect more sample i'll have to analyze more data but if i'm able to do the same thing with smaller samples it means my estimator is very efficient and lastly consistent simply means that it is applicable across multiple runs so it should not be an estimate that i do it once it works i do it again it fails so it has to be consistent time over time i should be able to apply the same analysis or multiple time periods in a consistent manner the last discussion in focus for this reading is some of the common biases that can exist in data so this is again a theoretical discussion that you have in your syllabus very important because you could get a direct theory question from it just from the meaning so there are certain situations that could create some sort of bias in my data and if the data is biased the analysis wouldn't be biased and my output or my decision made from that analysis would also be biased so let's look at some of these issues so first one is the data mining bias now data mining by itself is just looking at large piles of data and trying to find patterns within those data data mining bias is a specific situation where we try to find the patterns and we might even identify patterns but as such they don't have any economic significance we just find some sort of statistical relation between variables which should not be economically related now normally because data mining has become such a big down often what happens is if we find something by analyzing data we tend to place more weightage on that rather than economic logic so as such data mining bias occurs when more importance is given to a trend which has statistical significance but not enough economic significance simply because it was found as a result of data mining so as such the best way to correct for this is that yes you should analyze data you should analyze a lot of information but at the same time you should only to you should only take data sets which have some sort of economic relation if the two data sets are not economically related even data mining can often give you statistical results if you run just two random number series in excel and you calculate some sort of relation between them it won't come out as zero you will get some relation but we know that those are two random series so that's the data mining bias next up we have sample selection bias which as the name suggests these are biases that can exist in the data simply because the selection of the sample itself who will be part of my sample that selection itself was not done in a proper manner so my sample is not very representative of the population outright next we have survivorship bias very common in financial information mainly when we talk about stocks or companies or prices of securities because if i look at market right now market only has securities which are listed there are companies which have gone bankrupt in the last 10 years there are companies which have failed and they have been delisted all of those are removed so if i look at market right now it is just a combination of companies which have survived and it doesn't as such give me information about companies which have failed so if we don't take the companies which have failed as part of our data our data tends to only represent the good companies the ones which have survived and as such there could be some bias maybe my mean comes out higher than what it should be the return mean could be higher than what it actually should be the standard deviation could be lower so it might give an impression that return is higher risk is lower that kind of a situation then we have look ahead bias look at bias is very common when we deal with financial statements the bias is mainly coming from the fact that when i do analysis the information that i'm using for the analysis it pertains to two different time periods so for example i'm looking at prices of the companies prices are pretty updating on the stock market on a regular basis on real time basis so if i want to check the price of any company right now i can do that but let's say i want to analyze the finance statements of the company the latest annual finance statements that i have access to are as per indian financial year 31st march 2020 as per us financial year you'll start getting financials now for 31st december 2020 but if i take indian financial year the last annual profit of the company that i have reported with me it pertains to almost 10 months ago so i have price right now and if i compare it with earnings which are 10 months ago i have a look ahead bias where basically i'm analyzing two things which are not on the same time zone they are at different points in time i'm analyzing prices in comparison to profits where price did not exist when the profit existed and the profit is not existing right now when the price is there in the market so the both elements are not matching in terms of their time so lastly you have time period bias which is simple case of taking a sample from a time period which is too long or too short for example i want to analyze some sort of inflation environment and i just took any economic data for just two years two years is too short conversely i take economic data for 50 years that is also not very accurate why because the modern economics is slightly more dynamic economic cycle is of 10 years so ideally you should take somewhere around 10 years or maybe 15 years of data to incorporate one economic cycle so that you know that what the recent economic cycle is representing in terms of its information so i hope all the biases the miscellaneous topics that we've covered all of them are clear that brings us to a close of this reading now if you study from curriculum or some other study material you will find they have a few more examples for this particular reading the reason why i haven't focused on that is as i said and multiple times throughout the video the focus that i want to have in this reading is to build a base for hypothesis we'll be spending more time in hypothesis than a lot of the other materials do so don't worry the focus here was to build the base of the concepts we will cover all the distributions and how to read them how to study them as we go into hypothesis so that's it for this session i would suggest start practicing a few questions for confidence interval and some other theoretical details that we covered if you have any doubts or queries you can always contact me that's it from this session i'll see you in the next one goodbye you
Info
Channel: Money Decoding
Views: 723
Rating: 5 out of 5
Keywords:
Id: Gl4q-LaF8bY
Channel Id: undefined
Length: 43min 15sec (2595 seconds)
Published: Fri Jan 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.