R for Data Science - Statistics Full Course - Statistical Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome my name is saya and I have created this course of statistical data analysis using our you might wonder why we need data analysis well we live in a data rich world data is revolutionizing businesses in many ways companies are using data to understand their customers better the insights gained from analyzing data as helping companies identify growth areas statistical learning refers to a set of tools for modeling and understanding data statistics is a study of how to collect analyze and draw conclusions from the data statistics involve identifying a problem collecting relevant data analyzing data and form a conclusion R is a popular programming language adopted for data science and statistics the R programming language is used by professionals and data experts around the world for modeling financial data mapping marketing trends and other analysis and the best thing is our programming language is free this course of statistical data analysis is the best way to start learning data analysis and getting handy with our programming features in this course we will learn about qualitative and quantitative data moving forward you will acquire knowledge about mean median quartiles variance and standard deviation and where these descriptive statistics concepts are applied eventually we will look at bivariate and multivariate data as well as we will learn about probability distribution and hypothesis testing which are one of the most important topics in inferential statistics each of these sections include screencastify ons on application and implementation our is an open-source statistical programming language supported by a vibrant community our is also called a language for statistical computing our runs on almost any computing platform and operating system our provides a rich collection of statistical techniques our functions and has sophisticated graphical and visualization capabilities for plots and graphs this language also provides complex visualization of high dimensional data another very important feature about R is that it is highly extensible in this course we will also cover the ARS basics and programming fundamentals as well as our data structures such as vectors matrices lists and data frames finally we will look at how to import data from various sources and how to handle large data sets and on top of that in this course we will move deeper into the graphical capabilities of our and learn how to create visualization which will help an analysis in this video we will look at different kinds of data and talk about the ways of obtaining that data that is we will speak about the population and samples the very first aspect in statistics is obtaining the data let us say that if you are hungry you have to find the food first in the same way if you want to analyze data you need to first obtain the data they are different ways of obtaining data but the most important is the observation and experiment let us consider a simple example which kind of Pisa is more popular it would be easy to gather all the information if all the pieces are sold at one place however this is not the situation since there are many different eating places where you can find or buy pieces in this situation the best choice is sampling the question is is it possible to estimate the small piece of sampled information that is pieces to the whole big population of Pisa sold at different locations statistics could answer this question statistics is a form of mathematical analysis that uses quantified models representations and synopsis for a given set of experimental data are real life studies a lot of data used by mathematicians academicians statisticians researchers marketers and analysts etc are actually samples not populations a sample is actually a subset of a population the object from which we have measurements or observations from a sample in contrast to this a population is the set of all objects that had the same chance to become part of a sample in principle they are two different sampling strategies random sampling which means that the individual objects for measurements are selected at random from the population for example climate data stock prices etc and stratified sampling which requires that the population is divided into subpopulations these are separately sample and the population characteristics are estimated by using mean values let us now talk about different kinds of data data can be classified into qualitative and quantitative for qualitative data we need to look at the question that is it possible to represent data in zeros and ones if it can be represented in zeros and ones then we can call it as a binary data are else we can call it as nominal data example of nominal data can be gender which can be represented as male and female or we can also have shoe sizes which is a label to data the second classification of data is that is the data result of a measurement here the data can be of two types continuous data or discrete data this type of data is known as quantitative data in this video we will talk about statistics more specifically about the type of statistical methods different statistical measures and categorization of descriptive statistics statistics is a form of mathematical analysis that uses quantified models representations and synopsis for a given set of experimental or real-life data statistics studies methodologies to gather review analyze and draw conclusions from data there are many different types of statistics pertaining to which situation you need to analyze statistics are used to make better business decisions some statistical measures include the following mean is a mathematical average of a group of two or more numerals mean for the specified set of numbers can be computed in multiple ways including or thematic mean which shows how well a specific commodity performs over time and a geometric mean which shows how the performance results of an investor's portfolio invested in that same commodity over the same period the regression analysis determines the extent of its specific factors such as interest rates and the price of the product influencer price fluctuation of an asset this is depicted in the form of a straight line called linear regression skewness describes the degree a set of data varies from the standard distribution in a set of statistical data most data sets including commodity returns and stock prices have either positive skew or negative skew variance is a measurement of the span of numbers in a data set the variance measures the distance each number in the set is from the mean variance can help determine the risk and investor might accept when buying an investment statistics is a term used to summarize a process that an analyst uses to characterize a dataset statistical analysis involves the process of gathering and evaluating data and then summarizing the data into mathematical form statistics is used in various disciplines such as business social science etc two types of statistical methods are used in analyzing data descriptive statistics and inferential statistics descriptive statistics is used to summarize data from a sample exercising the mean or standard deviation inferential statistics is used when data is used as a subclass of a specific population let us now discuss descriptive statistics descriptive statistics is brief descriptive coefficients that summarize a given data set which can be either a representation of entire or a sample of a population descriptive statistics are broken down into measures of central tendency and measure of variability or spread the measure of central tendency focus on the average or middle values of data set whereas measures of variability focus on the dispersion of data these two measures uses graphs tables and general discussion to help people understand the meaning of the analyzed data measure of central tendency include mean median and mode and describes the center position of a distribution for a data set measure of variability includes the standard deviation variance and the minimum and maximum value as well as students measures of variability or measures of spread aids in analyzing how spread out the distribution is for a set of data while the measure of central tendency may give a person the average of a data set it does not describe how the data is distributed within the data set in this video we will talk about qualitative data in our programming language data is generally stored as vectors or data frames of course there are several types of data that we might need the qualitative data which is also considered as categorical data can be stored in our programming language using factors and the quantitative data are continuous or numerical data can be stored in our programming language using numerix in this video we will consider the qualitative data and learn how to work with them in our call it it if data is non statistical and is typically unstructured or semi structured in nature this data isn't necessarily measured instead it is categorized based on properties attributes labels and other identifiers generating this data from qualitative research is used for interpretations developing hypothesis and understanding qualitative data will almost always be considered unstructured data the qualitative data cannot be collected and analyzed using conventional methods some of the real-time examples of qualitative data are categorizing gender based on properties that is male and mail categorizing shoe sizes based on labels and ratings for a product based on some attributes this video we will practically look at how to handle qualitative data let us take an example of sizes of shirts so we will create shirts as an object and using the assignment operator and see function we will assign different sizes as the elements of the vector shirts let us consider we have s for small size M for medium size hell for large Excel for extra large XXL for extra extra-large and so on so forth since the labels s m l XL XXL are all characters so therefore every element of this vector will be in single codes and the vector shirts is considered as a character vector in the next statement we will display the elements of the shirts vector let us execute these statements you can find that we have the output which displays the elements of the shirts vector let us create another object that is shirts underscore sizes and with the assignment operator we will also use factors function passing shirts as the argument this statement will create shirt underscore sizes as a factor in the next statement we will display the contents of shirt underscore sizes when we execute these statements you can find that shirt underscore sizes is a factor and the output displayed contains the elements in the factor and the next statement displayed is the levels which are unique values from within the factor we can also use STR function passing shirts underscore sizes as the argument the STR function displays the structure of the object passed when we execute the statement you can find that shirt underscore sizes is a factor with five levels as well as a displaced the indexes of those levels we can also use summary function passing shirt underscore sizes as the argument the summary function returns our table which comprises of the levels as well as the values are accounts corresponding to those levels occurring within the factor so if you observe the level L occurs twice in the vector in the same way level M occurs three eyes in the vector in the same way the level s occurs twice as well as Excel also occurs twice and even XXL occurs twice our programming language also provides levels as a function which takes an object as an argument in this case let us provide the argument as shirts when we execute the statement you can find that we have the output as null because level function works on factors not on vectors in this case shirts is a vector and not a factor let us provide shirt underscore sizes as an argument to the levels function when we execute the statement you can find that all the levels of the factor that the shirt underscore sizes are displayed we can also use table function passing factor as an argument in this case it will be shirt underscore sizes when we execute the statement you can find that table is displayed where the first row contains the levels and the second row displays the count of the elements corresponding to that level within the vector in this video we will look at how to visualize qualitative data we can use bar plots a bar plot is used to display the relationship between numeric and a categorical variable pie charts a pie chart is a circular statistical graphics which is divided into slices to illustrate numerical proportions of categorical data we will use the example of short sizes as collated of data let us create an object shirts and using the assignment operator and see function we will assign values to the shorts vector in the next statement we will create an object shirt sizes and using assignment operator and factor function we will create shirt sizes as factor in this case we will pass the shirts as the argument for the factor function we will create another object that is shirts table and using the assignment operator we will use tables function and pass shirt sizes as the argument in the next statement we will display the content of the shorts table let us execute these statements you can find that a table is displayed where first row is levels and the second row is count of elements corresponding to the levels in the vector let us now visualize the qualitative data by creating bar plot we will use part plot function and pass shirts table as the argument when we execute the statement you can find that bar plot is created where levels are displayed on x axis and y axis represents the count from the table we can customize this bar plot by assigning colors to different bars using the Co L parameter and assigning the colors using the C function let us assign the colors as blue green red yellow and black we will also use while a parameter and assign the label as count when we execute the statement you can find that we have customized Bart plot with label on y-axis and bars represented through different colors suppose if we have to find the count of various levels without displaying the graph for example I want to find how many shirts are of medium size let us use the statement shirt sizes equals equals M where m stands for medium size when we execute the statement we get the result as a logical vector the values true are displayed where the value M is found in the vector we will now use some function and we'll pass shirt underscore sizes equals equals M when we execute the statement we get the result as 4 in the same way we can count how many shirts of XXL size are present we get the result as 2 when we are using bar plot we need to pass the argument as a table we can also use plot function which takes in the factors as an argument plot as a generic function which displays the graphical summary of the factors when we execute the statement we get the basic bar graph which is same as generated through bar plot function we can also customize the graph using the Co L and y lap parameter of the plot function let us use CL parameter and assign the values as blue green red yellow and black as well as we will assign count as the label 2 while AB parameter we can also create pie chart which is a circular statistical graphics we will use pie function and pass the argument as the table in this case shirts table when we execute the statement we get the basic pie chart displayed which shows the illustrations of numerical portions of the categorical data we can customize the pie chart by using the CL parameter when we execute the statement we get the pie chart displayed in different colors let us consider another example we will create a vector as H and assign various values and convert this into a factor in the next statement we will use table function and pass it as the argument when we execute the statement we get the result as table displayed in the next statement we will use level function and pass the argument as age when we execute the statement you can find that levels are displayed in this case you can find that there are four levels and assume that we have this qualitative data which represents group of ages so let us use level function with the age as the argument and using assignment operator we will assign the vector so we will use C function and let us consider the first age group as less than 14 the next as 15 to 24 and the third as 25 to 34 and the last as greater than 35 we will now display the table function with the age as the argument when we execute the statement you can find that the levels now represent the labeled values we can now use the bar plot function to graphically analyze the qualitative or categorical data we can also customize the bar plots using the co l parameter and while a parameter when we execute the statement we get the following result we can also use the PI function to display the pie chart as well as customize the same using Co l parameter you can observe that the numerical portions of the categorical data is now displayed with the values assigned to the levels function which displays the group of Ages so this is how we can visualize the collective data and our programming language in this video we will talk about how to handle quantitative data that is how to handle continuous or numerical data canta tative data can be counted measured and expressed using numbers whereas qualitative data is descriptive and conceptual contrary to qualitative data quantitative data is statistical and is typically structured in nature this type of data is measured using numbers and values which makes it a more suitable for data analysis the qualitative data is open for exploration whereas quantitative data is much more concise quantitative data will almost always be considered structured data this type of data is formatted in a way so it can be quickly organized and searchable within relational databases perhaps the most common example of structured data is numbers and values found in spreadsheets and is generally preferred for data analysis the quantitative data can be obtained from experiments or measurements or the best example of quantitative data is company stock price that is practically look at how to handle quantitative data in our programming language let us create an object songs and using the assignment operator and the C function let us provide the data values as five point three three point six five point five four point seven six point seven four point three four point three eight point nine five point one five point eight and four point four let us consider that these data items are the length of the songs that is the length of the song in minutes in the next statement we will display the value stored in the vector songs let us execute these statements you can see that songs is created as a vector with all the elements as numeric let us consider another example for quantitative data let us create an object ratings and using the assignment operator and see function we will provide the ratings as 2 4 3 3 2 1 1 2 3 4 so on so forth till 2 comma 4 in the next statement we will display the values stored in the vector ratings when we execute these statements you can find that we have ratings as a numeric vector so these two objects that is songs and ratings are now storing quantitative data which is represented in numerix in this video we will look at how to visualize canta tative data we can use histograms which is a bar graph of raw data that creates a picture of the distribution the histogram represents the frequency of occurrence by a class of data a histogram shows basic information about the data set we can also use box plots which is a standardized way of displaying the distribution of the data based on a summary that is minimum first quartile median third quartile and maximum it also displays the outliers as well as we can use strip chart which produces one dimension scattered plots or dot plots of a given data these plots are a good alternative to box plots when sample size is small let us now practically look at how to visualize quantitative data we will use an example of readings as quantitative data let us create an object ratings and using the assignment operator and see function we will assign values to the ratings vector let us consider that the elements of the ratings vector represents the reviews or ratings of product given by different users in the next statement we will use length function and pass the argument as the vector which is ratings in this case this length function returns the number of items are elements in the ratings vector in the next statement we will use summary function which is a generic function producing the summary results for the sample by passing the argument as a vector in this case it is ratings let us execute these statements you can find that the length of the ratings vector is 22 and the summary result is displayed for the rating sample which displays the minimum maximum median mean and first and third quartiles we will now visualize the quantitative data using the hist function which computes the histogram for the given sample that is ratings the argument we will pass is the ratings vector let us execute these statements you can find that histogram is displayed which has ratings on the x-axis and frequency on the y-axis we can also customize the histogram by using the Co L parameter and assigning the value as blue when we execute the statement you can find that the bars of the histogram are in blue color we can also compute the probability density by using the PR OB parameter and assigning the value as true when we execute the statement you can find that istagram is displayed which has ratings on the x-axis and density on the y-axis instead of frequency we will now plot the histogram for frequency as well as the probability density function on the same graphical layout to do this we will use hist function to display the frequency histogram and in the next statement we will use the line function passing the first argument as a density function with ratings and the second argument which is co L parameter assigning the value as red when we execute these statements you can find that histogram is plotted for frequency and the red line displays the values computed for probability density function we will now plot the probability density histogram by using the probability parameter and assigning the value as true when we execute these statements we get the bars of the histogram and red line displaying the computed density function we can also visualize the quantitative data using the box plots to do this we will use box plot and pass the argument as ratings which is a vector when we execute the statement we get the box plot displayed which consists of minimum maximum median and first and third quartile for the ratings data we can also use plot function passing the ratings vector as an argument when we execute the statement we get the plot we will also display the strip chart which produces one-dimensional scatter plots the argument we pass is the ratings vector then we execute the statement you can find that the strip chart is displayed we can also customize the strip chart by using the Co L parameter and let us assign the value as red as well as we can use the PCH parameter which denotes the parsable character and in this case we will assign the value as 15 when we execute the statement you can find that the points are displayed as a block in red when we use the strip chart function the values of the ratings vector are overlaid on one another let us use the method parameter of the strip chart function and assign the value as jitter when we execute the statement we get the strip chart displayed with the data points as jittered we can also assign stack as the value to the method parameter of the strip chart function and when we execute the statement we get the data points on the strip chart tagged so these are the different ways through which we can visualize the quantitative data let us now practically visualize the real-time stock prices contet ativ data we will use G stock price CSV data set where the first column is the date and the second column is the price on that date we will use the library function to import the deployer package in the next statement we will create an object as GE data and using the assignment operator and the read dot CSV function we will pass the argument as GE stock dot CSV file in the next statement we will create GE price as an object and using the assignment operator and select function from the D player package we will pass the first argument as an object that is in this case GE data and the second argument as price which is the name of the column in GE stock dot CSV data set let us execute these statements in the next statement we will use summary function which is a generic function producing the summary results for the samples by passing the argument in this case it is GE price when we execute the statement we get the summary statistics of GE stock data displaying the minimum the max value median mean first quartile and the third quartile we will now spiritualize the GE stock price quantitative data using the hist function which computes the histogram for the given sample that is in this case GE stock price the argument we will pass is the price vector using the as dot vector function let us execute these statements you can find that histogram is displayed which has price on the x-axis and frequency on the y axis we will also compute the probability density by using PR OB parameter and assigning the value as true when we execute the statement you can find that histogram is displayed which has ratings on the x-axis and density on the y-axis instead of frequency we can also use Co L parameter of the hist function and assign the value as blue when we execute the statement we get the customized histogram for the GE stock price data we will now plot the histogram for density as well as the probability density function on the same graphical layout to do this we will use the hist function to display the density histogram and in the next statement we will use the line function passing the first argument as a density function and second argument which is CL parameter assigning the value as red when we execute these statements you can find that histogram is plotted for density and the red line displays the values computed for a probability density function if you observe you can find that the line displayed for probability density function is out of the plot we will use the while limit parameter that is yl i M which is used for y-axis limit and assign the value as a vector using the C function and passing the values as 0 and 0.025 when we execute these statements we get the bars of the histogram plotted with blue color and the red line displaying the computed density function in this video we will look at different are functions that can be applied on quantitative data let us consider the same vectors that are songs and readings which are storing the contr tative values are provides a function length which takes in one parameter that is the object in this case it is songs this function will return the number of elements present in a vector when we execute the statement you can find that we have the length of the song's vector as 11 that means 11 data items are stored in the vector songs let us apply the same length function to the ratings vector when we execute the statement you can find that the length of ratings vector is 22 our provides a max function through which we can find the largest element in a vector this max function will also take one parameter that is the vector in this case it is songs when we execute the statement you can find that we get the result as 8.9 which is the largest element in the vector songs in the same way we will find the maximum element in ratings vector we get the result as 4 we can also use min function which operates on vectors to find the smallest element in this case min of songs will return us 3.6 are also provides a function sum which gives the total of the elements in the vector for example sum of the vector songs will give the result as 58.6 PR OD function gives the product of the elements in a vector in this case PR OD of songs will give the result as shown in the output the elements of the vector can also be sorted in increasing or decreasing order using the sort function so for example sort of vector songs will give us the vector songs in a sorted order we can also sort the vectors in descending order using the decreasing parameter and assigning the value as true executing the statement you can find that the songs vector is sorted in descending order this video we will look at how to compute mean in our programming language as well as the different types of means mean is one of the pleasures of central tendency there are three types of means such as automatic mean sometimes also called as just me geometric mean and harmonic me the automatic mean of a sample is the sum of all values divided by the sample size here is the mathematical formula to compute automatic mean where n is the sample size our number of items or elements in the sample and this is the summation symbol here X I is the ith element of the sample where I starts from 1 to n let us look at a simple example here we have sample X with the following elements each element is represented as X 1 X 2 till X 8 here the sample size is 8 that is in this case n is a which is the number of elements in the sample now the summation of elements of the sample from I 1 to N are in this case from I 1 to 8 is 30 so now the mean as sum of all the values which is 30 divided by the number of elements in the sample which is a so we have 30 divided by 8 which is 3.75 let us now look at how to practically compute automatic mean in our programming language let us create an object songs which is a vector this vector contains some elements in the next statement we will display the value stored in the object songs let us execute these statements we will now create an object M which will store the value for mean computed with the assignment operator P we'll use some function passing the argument as songs in this case the sum function returns the sum of all the elements in the vector songs we will then use the divide operator and divide the sum by the length of the vector songs using the length function the length function returns the length of the vector our number of elements in the vector in the next statement we will display the value stored in an object M when we execute these statements we get the value of the mean let us consider for example that the vector songs contains the length and minutes of different songs to answer a question that what is the average time I need to spend to listen to a song then mean is used to answer this question and here is the answer I need an average of five point three to seven minutes to listen to a song are also provides statistical function such as me which computes the mean of the elements in the vector and this is same as sum of X divided by length of X when we execute the statement we get the same result let us now talk about geometric mean the geometric mean is defined as the nth root of the product of all data here is the mathematical formula to compute geometric mean where n is the sample size a number of items or element in the sample and this is the mathematical product function here X I is the ith element of the sample where I starts from 1 to n let us look at a simple example here we have sample X with the following elements each element is represented as X 1 X 2 till X 8 here the sample size is 8 that is in this case n is 8 which is the number of elements in the sample now the product of the elements of sample from I equal to 1 to n or in this case I equal to 1 to 8 is 1 1 5 to 0 so now the geometric mean is the nth root that is in this case 8 root of the product of all the values which is 8 root of 1 1 5 to 0 and the answer is 3.2 1 8 7 let us now look at how to practically compute geometric mean in our programming language let us create an object songs which is a vector this vector contains some elements in the next statement we will display the value stored in object songs we will now create an object GM which will store the value of geometric mean computed with the assignment operator P will use the PR OD function passing the argument as songs in this case the PR OD function returns the product of all the elements in the vector songs we will then use the exponent operator and compute the power of 1 divided by the of the vector songs using the length function the length function returns the length of the vector in the next statement we will display the value stored an object GM when we execute the statements we get the value of geometric mean the computation of geometric mean using the PR OD function is not efficient since the product of all the elements of the vector may result in a very large value therefore the efficient way to compute the geometric mean is to use the logarithm function for example we can use exp function which computes the exponential the argument we will call is the mean function and inside the main function we will call log function and the argument we will pass to the log function is the songs vector the next statement we will display the value stored in GM object when we execute these statements you can find that the value of geometric mean is computed which is the same as above in this video we will look at the applications of geometric mean the geometric mean is most commonly applied and used in business and financial applications such as calculating the growth rate our computing the returns on portfolio of securities let us take an example of growth rates the growth rate calculations for stocks as sometimes also referred to as compounded annual growth rate let us consider a stock of a company which grows by 10 percent and the first year assume that we have hundred dollars invested so growing by 10 percent will give the result as 100 plus 10 that is the value of stock as 110 the same stock and the second year declines by 20 percent so now we have 100 minus 20 which is 20 percent and will give us the result as 80 and let us assume that in the third year the stock grows by 30 percent so we have 100 plus 30 which is equal to 130 we can use the geometric mean to compute the compounded annual growth rate so therefore the growth rate will be cube root of the product of 110 80 and 130 if you consider in this case our vector contains three elements that is 110 80 and 130 the length of the vector is 3 therefore computing the geometric mean we get the result as 104.5 eight-six now since we are computing the annual growth rate in terms of percentage so the value one zero four point five eight six minus hundred will give us four point five eight six percent now we can say that the annual compounded growth rate of the stock is four point five eight six percent let us practically see how the growth rate can be computed in our programming language let us consider a vector stock using the assignment statement and see function we will have three elements the first element of the vector will be hundred plus ten the second element is 100 minus 20 and the third element is 100 plus 30 so now we have the vector as 110 comma 80 and 130 in the next statement we will create an object G me to compute the geometric mean using the assignment statement and product function we will pass stock as the argument for the product function followed by the exponential operator and then within parentheses we will have one divided by length of the stocks vector in the next statement we will display the value stored in Z means object when we execute these statements you can find that we get the result as 104.5 eight-six since the annual growth rate is represented in terms of percentage we will have a statement G means minus hundred and when we execute the statement we get the result as four point five eight six percent so we can say that the stock has grown four point five eight six percent over three years so this is how geometric mean can be used to compute annual growth rate of stocks let us now talk about the harmonic mean the harmonic mean is the reciprocal of the mean of the reciprocals of a sample here is the mathematical formula to compute harmonic mean where n is the sample size a number of items or element in the sample and this is the mathematical summation symbol here X I is the ith element of the sample where I starts from 1 to n the result of the mathematical summation when divided by n is again divided by 1 to get the harmonic mean the condition which needs to be taken care in this equation is that the value of X I should always be greater than 0 if it is 0 then 1 divided by 0 that is the value of x I will result an undefined or infinite value and the harmonic mean cannot be computed let us look at a simple example here we have sample X with the following elements each element is represented X X 1 X 2 till X 8 here the sample size is a now each element of the sample will be 1 divided by the value of the element then the summation is computed we get the result as 2.99 1 so now the result is sum of all the values which is 2.9 N 1 divided by the number of elements in the sample which is a so we have 2.99 1 divided by 8 which is 0.37 4 so the harmonic mean has 1 divided by this value of 0.37 4 which gives the result as 2 point 6 7 4 let us now look at how to practically compute harmonic mean and our programming language so now the harmonic mean is 1 divided by mean of 1 divided by songs remember that the songs is a vector so 1 divided by songs will actually be computed as 1 divided by each element and then the mean is calculated in the next statement we will display the value stored in an object hm when we execute these statements we get the value of the harmonic mean which is five point zero three seven seven let us compare the values of all the means computed we will use the mean function and pass the argument as songs let us execute these statements you can find that the automatic mean of the sample songs is five point two three seven two the value of geometric mean is five point one seven one six and the value of harmonic mean is five point zero three seven seven so this is how we can compute the mean that is automatic mean geometric mean and harmonic mean in our programming language the harmonic mean helps to find multiplicative or division relationship between fractions without worrying about common denominators the harmonic mean are often used in averaging like rates for example the average travel speed during different trips the harmonic mean is also used in finance to average multiples like the price earning ratio and other financial calculations in this video we will look at other measures of central tendency that is median and mode median as one of the measures of central tendency median is the middle number in a sorted list of numbers to determine the median value in a sequence of numbers the numbers must be sorted are arranged in a value order from lowest to highest are highest to lowest the median can be used to determine an approximate average or mean but it is not to be confused with the actual mean if there are odd elements the median value is the number that is in the middle with the same amount of numbers below and above are to the right and left if there are even elements in the list the middle pair must be determined added together and divided by two to find the median value the median is sometimes used as opposed to mean when they are outliers the outliers in the sequence might skew the average of the values the median of a sequence can be less affected by outliers than mean let us consider a simple example of number of elements here we have seven elements we need to sort this elements in ascending or descending order after sorting the elements we will find the middle value that is in this case it is 4 so here median is 4 let us consider another example after sorting the N elements we get two middle values that is in this case 3 & 4 so median in this case will be 3 plus 4 that is 7 divided by 2 and the result is 3.5 let us now look at another measure of central tendency that is mode the mode is the number that appears most frequently in a set a set of numbers may have one mode more than one mode or no mode at all the mode can be the same value as the mean or median but this is not always the case for normal distribution the mode is also the same value as the mean and the median let us consider a simple example in this case we have sample of elements if you observe the element to occurs most number of times so therefore the mode of this sample is to let us consider another sample in this case we have the element 2 occurring twice as well as element 4 also occurs twice so therefore here we have two modes that is element 2 and element 4 let us consider another example in this sample of elements none of the elements occur more than once in this case this sample has no mode let us now look at how to practically compute median in our programming language we will use an example of songs as quantitative data let us create an object songs and using the assignment operator and see function we will assign values to the songs vector the elements of the songs vector represent the length of each song or duration of each song in minutes in the next statement we will use mean function and pass the argument as a vector songs this statement will compute the automatic means of songs vector in the next statement we will use median function which computes the median for the vector by passing the argument as songs in this case let us execute these statements you can find that mean for the vector song is computed which is five point three two seven and the median for the vector songs is computed which is five point one let us look at another example we will create a vector for ratings of the product in the same way we will use the mean function by passing ratings vector as an argument in the next statement we will use the median function by passing ratings as the argument let us execute these statements you can find that mean for the vector ratings is computed which is two point five nine and the median for the ratings vector is computed which is three so this is how mean and median is computed in our programming language there is no function defined to compute mode as a measure of central tendency because mode is not useful for statistical analysis do remember that mode as a function defined in our programming language but this mode function returns the type of storage mode of an object and does not compute the mode for statistical analysis in this video we will talk about outliers outliers are highly deviated data values these data values extremely affect the data analysis that is now practically look at outliers we will use an example of salary as quantitative data let us create an object salary and using the assignment operator and see function we will assign values to the salary vector let us consider that the elements of the salary vector represent the yearly salary and thousands of the employees in an organization in the next statement we will display the elements stored in the salary vector we will use mean function and pass the argument as the vector which is salary in this case this mean function computes the automatic mean of salaries of employees in the next statement we will use median function which computes the median for the vector by passing the argument as a vector in this case at a salary let us execute these statements you can find that mean salary is sixteen point six six seven and median for salaries is 17 now if you observe here in this case there is no much difference between the mean and the median computed let us also consider the salary of the president of the company let us say that he is highly paid so the value will be ninety thousand we will execute these statements once more you can find that now the mean salary is twenty seven point one four two and the median is eighteen you can observe that there is very high difference between the mean and the median this is because of the result is strongly influenced by the outlier which is in this case the value ninety that is the salary of the president of the company and observing the result it is very wrong to estimate that the average salary of the employees is twenty seven point one for two since the salaries of all the employees except the president is vey below the mean let us graphically analyze the results we will use the boxplot function and pass the argument as salary we will execute the statement we can observe that the boxplot function displays the median and we can also observe the outlier which is displayed as a small circle let us remove the outlier from the vector and execute the statement you can observe that the boxplot displays the median whereas there are no outliers we will include the outlier that is in this case the value 90 in the boxplot function we will use the range parameter and assign the value as 0 when we execute these statements you can find that the extreme data points that is the outliers are also included in the box plot let us again use the mean function passing salary as an argument the mean function also has trim as a parameter we will assign the value as 0 point 1 when trim parameter is used the fraction of the observations are trimmed from each end of the vector in this case we are trimming with a fraction of 0.1 when we execute the statement you can find that we have the trimmed mean which is same as the above that is 27 point 1 4 2 when the fraction of the trim parameter is used as 0.5 we get the same result as median when we change the fraction of the trim parameter to zero point 2 we get the result as 17.6 which is much better than the untrimmed mean since 0 point 2 fraction of the observations are trimmed and in this case the influence of the outlier is negligible in this video we have seen how outliers affect our have influence on the measure of central tendency that is mean and median do remember that fraction of trim parameter of the mean function lies between 0 to 0.5 in this video we will talk about quartiles and continents which is another measure of statistics and quartile is a statistical term describing or division of observations into four different intervals based upon the value of the data and how they compare to the entire set of observations to understand the quartile it is important to understand the median as a measure of central tendency the median in statistics is a middle value of a set of numbers it is the point at which exactly half of the data lies below and above the central value the quartile measures the spirit of values above and below the mean by dividing the distribution into four groups the quartile divides data into three points a lower quartile median and upper quartile to form four groups of the data set quartiles are used to calculate the interquartile range which is a measure of variability around the median the median is a robust estimator application but says nothing about how the data on either side of its value is spread are dispersed this is where quartiles are used the quartiles measures the spread of values above and below the mean by dividing the distribution into four groups each quartile contains 25% of total observations generally the data is arranged from smallest to largest the first car tile represents the lowest to 25% of numbers the second car tile lies between 25.1% to 50% that is up to the median the third quartile lies between 51 percent to 75 percent above the median and the fourth quartile is the highest 25% of the numbers let us look at an example of a sample data to compute the quartiles we have seven elements in a sample data set this data set is sorted and the first element will be of minimum value and the last element will be the maximum value the middle element is the median the values below the median and the minimum value are used to compute the first quartile which is in this case the average of two numbers and the result is 2.5 the values above the median and the maximum value are used to compute the third quartile which is in this case the average of two numbers which is 5.5 let us now practically look at quartiles we will use an example of songs as quantitative data let us create an object songs and using the assignment operator and see function we will assign values to the songs vector let us consider that elements of the songs vector represents duration or length of the songs in minutes in the next statement we will use mean function and pass the argument as vector British songs in this case this mean function computes the automatic mean of duration of songs in the next statement we will use median function which computes the median of the vector by passing the argument as a vector in this case it is songs we will use summary function and pass the argument as songs the summary function as a generic function which displays the summary of the quantitative data let us execute these statements you can find that mean is computed which is five point three to seven and median is five point one the summary function displays the brief summary of data that is minimum value maximum value median mean first quartile and the third quartile we also have function condyle which produces condyles corresponding to the given probabilities the first argument is the vector that is songs in this case and the second argument is the probabilities that is in this case we will have 0.25 and 0.75 when we execute the statement we will get the values for 25% and 75% condyles in this video we have seen how quartiles and con tiles are computed do remember that the probabilities of con tiles lies between 0 to 1 in the condyle function in this video we will talk about variance and standard deviation which are the measures of variation and spread the measures of variation are used to measure variability in the data or to qualify the accuracy of statistical parameters the most common measure of variation is the variance and its square root that is the standard deviation variance in statistics is a measurement of the spread between numbers in a data set that is it measures how far each number in the dataset is from the mean and therefore from every other member in the set the standard deviation is a statistic that measures the dispersion of the dataset relative to its mean and is calculated as the square root of the variance the standard deviation is a statistical measure in finance that when applied to the annual rate of return of an investment sheds light on the historical volatility of that investment the greater the standard deviation of securities the greater the variance between each price and the mean for example a volatile stock has the high standard deviation while the deviation of a stable stock is usually rather low the variance is calculated by first finding the deviation of each element in the data set from the mean and then by squaring it finally dividing the result by the degree of freedom which is the sample size minus one this is the mathematical equation for computing the variance where X I is the ith element in the data set or the sample and X Dash is the mean of the data set our sample and n minus 1 is the degree of freedom which is the sample size minus 1 let us now practically look at how to calculate variance and standard deviation we will use an example of songs as quantitative data let us create an object songs and using the assignment operator and see function we will assign values to the songs vector let us consider that the elements of the songs vector represent the duration or length of the song in minutes in the next statement we will use var function and pass the argument as the vector which is songs in this case this version computes the variance of the duration of songs in the next statement we will use SD function which computes the standard deviation for the sample by passing the argument as the vector in this case it is songs let us execute these statements you can find that variance is computed which is 2 point 1 3 0 1 and the standard deviation is 1 point 4 5 9 5 we will take another example this time we will be using ratings as the quantitative data let us create an object ratings and using the assignment operator and see function we will assign the values to the ratings vector let us consider that the elements of the rating vector represent the ratings of a product given by the different users in the next statement we will use wire function and pass the argument as the vector which is ratings in this case this fire function compute the variance of ratings in the next statement we will use SD function which computes the standard deviation for the sample by passing the arguments as a vector in this case it is ratings let us execute these statements you can find that variance is computed which is one point one one zero three and the standard deviation is one point zero five three seven if you observe the variances computed for songs and for ratings the variance for songs is more than the ratings and this means that there is a large variation between the song's data sample when compared to the ratings data sample the same difference can be observed in standard deviation for both the samples let us now look at an example of real-time stock prices we will use the D player package by importing the D player library using the library function and passing d player as the argument in the next statement we will create GE data as an object and using the assignment operator and read dot CSV function we will fetch the stock price data of GE stock by passing the GE stock dot CSV file name if we view the GE stock dot CSV file you can find that it consists of date and stock price on that date in the next statement we will create an object G price and using the assignment statement and select function we will read the price data by passing the first argument as GE data and the second argument as price which is the second column and the data set in the same way we will create an object IBM data and using the treat dot CSV function we will read the data from the IBM stock dot CSV file the next statement we will have IBM prize object and using the select function we will pass the first argument as IBM data and the second argument will be the price column before we execute these statements let us set the working directory to the directory where we have the above CSV files let us execute these statements we will use the var function which returns the variance passing GE price as an argument as well as in the next statement we will call the var function and as the IBM price as an argument let us execute these statements you can find that the variance for G stock price is five seventy five point six four two five and the variance for IBM stock price is seven seven one two point seven one seven the greater the variance for stock price the greater volatility of the stock so here I be n stock price is more volatile than the GE stock price in the same way we will compute the standard deviation for GE and IBM stock price using the SD function the argument that needs to be passed to the SD function is a vector and in this case the GE price and IBM price are the data frames therefore we need to convert the data plane to vectors using the AZ dot vector function when we execute these statements we get the standard deviation for GE stock price that is twenty three point nine nine to five and for the IBM stock price we get the result as eighty-seven point eight two to zero here also you can observe that there is a huge difference in the standard deviation for both the stock prices and IBM stock price is more volatile than the GE stock price in this video we will talk about correlation and covariance correlation and finance and investment industries is a statistic that measures the degree to which two securities are stock price move in relation to each other in finance the correlation can measure the movement of a stock with that of a benchmark correlation measures association but does not tell us which factors affect the correlation among stock prices the correlation is computed using this mathematical formula where R is the correlation coefficient and X Dash and y dash are the means of the two observations the result of correlation coefficient can be positive or negative but the value will always be between minus 1 point 0 and plus 1 point 0 a positive correlation means stock prices move in the same direction when the correlation coefficient is 1 it is a perfect positive correlation this implies that as one security moves or the stock price moves either up or down the other security or stock price moves in the same direction a negative correlation means the stock prices moves in the opposite direction a perfect negative correlation is that the two assets are stock prices moves in opposite directions while zero correlation implies no relationship exists at all between the stock prices let us now practically look at how to compute correlation and covariance to have better understanding of the concepts of correlation we will use sample vector to perform correlation let us create an object X and assign some elements let's say the elements are 10 20 30 40 and 50 we will also create another object Y let's say that we are assigning the same elements we will now use the COR function which computes the correlation between the two samples that is in this case sample X and sample y our vector X and vector Y we will pass the arguments as what is in this case x and y to the COR function when we execute these statements you can find that the correlation of sample X and sample Y is 1 this is because the elements are same in both the samples so there is a perfect positive correlation let us look at another scenario let's assume that we have the elements of Y sample as 50 40 30 20 and 10 when we execute these statements you can find that the correlation is minus 1 this is because the elements in sample X and sample Y are exactly reverse so there is a perfect negative correlation let us now assign some arbitrary elements to the sample or in this case vector X as well as vector Y or sample y we will use the same cor function that is the correlation function to compute correlation between two samples when we execute these statements we get the correlation as zero point six four eight one we can observe that there is a positive correlation between the samples let us consider another example we will use another set of elements and sample why when we execute these statements we get the correlation as minus zero point four seven nine nine we can observe that there is a negative correlation between these samples we will now find the correlation between the stock prices we will use the GE and IBM stock prices data set this data set contains two columns the first column represents the date and the second column represents the price of the GE stock on that date let us import the D player packet using the library function and passing d player as the argument in the next statement let us create an object my GE data and using the assignment operator and read dot CSV function V will pass the argument as GE stock dot CSV file in the next statement we will create GE dates as the object and using the assignment operator and select function we will pass my GED tah as the first argument and date as the second argument which is the column name in the data set in the next statement we will create another object that is GE price and using the assignment operator and select function we will pass my GED da as the first argument and price as the second argument which is the column name in the data set in the next statement we will read IBM stock price data so let us create an object my IBM data and using the assignment operator and read dot CSV function we will pass the argument as IBM stock dot csv file next statement we will create IBM dates as the object and using the assignment operator and select function we will pass my IBM data as the first argument and date as the second argument which is the column name in the data set in the next statement we will create another object that is IBM price and using the assignment operator and select function we will pass my IBM data as the first argument and price as the second argument which is the column name in the data set before we execute these statements we will set the working directory to the directory where we have the csv files let us execute these statements we will now use the COR function which computes the correlation between two stock prices in this case the first argument we will pass is the GE price and the second argument we will pass is the IBM price when we execute the statement you can find that correlation is computed between stock prices of GE and IBM and the value is zero point one zero nine eight if the data set of the arguments we passed to the correlation function contains null values are any values then the correlation function will not work since the data set may be very large therefore we can have use parameter and assign the value as complete dot obs to compute correlation with any values in the data set when we execute the statement you can find that we get the same correlation result since our data set of stock prices does not contain any null RNA values there are different ways of computing correlation we can use the method parameter and assign the value as Spearman when we execute the statement you can find that we get the value as zero point one six six five which is different from the earlier correlation value this is because by default the correlation function uses Pearson's correlation using our programming language we can also compute correlation test using different methods since correlation tests are performed on vectors we will create a vector GE price vector and using the as dot vector function we will pass the argument as GE price dollar price we will also create another vector as IBM price vector and using AZ dot vector function we will pass IBM price dollar price as the argument we will now use the COR dot test function which computes the correlation test passing the argument as GE price vector and the other argument as IBM price vector we will also use the method parameter and assign the value as Pearson that is the method to compute the correlation when we execute these statements you can find that results of Pearson's correlation test are displayed where the p-value is zero point zero one six zero seven and the correlation value is zero point one zero nine eight we can perform correlation test using Spearman method when we execute the statement you can find that we get the correlation value as 0.166 five in the same way we can perform the correlation test using the candle method and we get the value as 0 point 1 2 1 5 we will now talk about the covariance covariance is a statistical tool that is used to determine the relationship between the movement of two asset prices when two stocks tend to move together they are seen as having a positive covariance when they move inversely the covariance is negative covariance is a significant tool in modern portfolio theory used to ascertain what securities to put in a portfolio covariance is calculated by analyzing returned surprises that is the standard deviation from the expected return or by multiplying the correlation between the two variables our samples by the standard deviation of each sample are variable in our programming language the covariance is computed using the CoV function the first argument we will pass is the GE price and the second argument we will pass is the IBM price when we execute the statement we will get the result of covariance that is in this case 231 point 4 3 5 4 in this video we will look at the correlation coefficient and covariance using different stock prices we will first compare the correlation and covariance between GE and IBM stock prices here we have the same statements which we have seen in the previous video let us execute these statements you can find that because the correlation value of GE and IBM stock has zero point one zero nine eight and covariance between the same stocks as 231 point four three five we will now use the stock price of coca-cola company which also has date and price as the columns in the Reid statement instead of using the IBM data set we will use the coca-cola dataset let us not further about the other objects which we have used such as my IBM data since these are just names which we have used and it does not matter let us execute these statements which will now compute the correlation and covariance between GE and coca-cola stock prices you can observe that now we have the result of correlation as 0.1 775 and covariance as one zero seven point two zero one four when we compare the correlation value of GE and IBM stock we get the result as zero point one zero nine eight and between GE and coca-cola we get the value as zero point one seven seven five which is greater than the previous correlation value therefore we can conclude that GE and coca-cola stock prices are comparatively more related than GE and IBM stock prices in the same way the covariance between GE and IBM data is 231 point four three five and whereas between GE and coca-cola is one not seven point two zero one four which is much lesser than the previous covariance value therefore we can conclude that there is a large variation between GE and IBM stock prices when compared to GE and coca-cola stock prices so this is how we can apply the concept of correlation and covariance on stock prices data or financial data in this video we will talk about the basic approaches of dealing with two variable data or bivariate qualitative data a great deal of statistical analysis is based on describing the relationship between two data variables for example how are height and weight of humans related or how is the heart diseases different in men and women or we can also have a question to analyze how does planting of seeds alter crop yield the bivariate data may consist of two qualitative variables or one qualitative and one quantitative variable or two quantitative variables let us now practically look at how to handle bivariate qualitative data we will use an example of ratings data let us create an object ratings and using the assignment operator and see function we will assign values to the ratings vector let us consider that the elements of the ratings vector represent the reviews or ratings of a product given by different users in the next statement we will convert creating Spektor into ratings factor using the ratings object with the assignment operator and factor function by passing ratings as the argument we will use another example of courses data let us create an object courses and using the assignment operator and c function we will assign values to the courses vector let us consider that the elements of the courses vector represents courses in terms of zeros and ones in the next statement we will convert the courses vector into courses factor using courses object with the assignment operator and factor function and by passing courses as the argument let us execute these statements let us now assign the names of the courses to the courses factor using the levels function we will use the assignment operator and using the C function we will have first element as and the second element is Python so the value 0 and the courses vector is now represented by string R and the value 1 and the courses vector is represented by string Python in the next statement we will use the table function passing the first argument as ratings and the second argument as courses the table function builds a contingency table of the counts at each combination of factor levels in this case the contingency table is built for the counts of ratings and courses combination when we execute these statements you can find that values of ratings factor represents the rows of the table since rating as passed at the first argument and the values of courses factor represent the columns of the table the values of the tables represents the count operating corresponding to the courses now suppose if we pass courses as the first argument and ratings as the second argument of the table function and execute this statement you can find that courses are represented as rows and ratings are represented as columns we will now use the bar plot function to visualize the bivariate qualitative data for the bar plot function the argument we will use is the table function and for the table function we will pass readings and courses as the arguments when we execute this statement we get the bar plot where x-axis represents the courses and y-axis represents the count of the ratings we will customize this bar plot function by using the Co L parameter and assigning 4 values as a vector such as blue yellow red and green when we execute this statement you can find that the bar plot displays each rating with a different color suppose if we pass courses as the first argument and ratings as the second argument and executes the statement we get the bar plot where ratings are represented on x-axis and courses are represented on y-axis with each course displayed and different color here even though we are passing four colors the bar plot displays only two colors because the courses factor only contains two levels that is zeros and ones which is represented as a string of our course and Python course respectively we will use the legend dot text parameter and assign the value as true when we execute this statement you can find that legend is displayed which shows the color of each course now suppose if we pass the ratings and courses as the first and the second argument for the table function and execute this statement you can find that now the legend displays the colors for the values of readings factor the bar plot function also provides the beside parameter and if we assign the value as true and execute this statement you can find that the bars are plotted side by side instead of the stacked bar plot we will now look at another way of visualizing bivariate qualitative data we will use mosaic plot function and pass the argument as a table function with ratings as the first argument and courses as the second argument when we execute the statement you can find that mosaic plot has displayed where x axis represents the datings and y-axis represents the courses the length or height of the tile and mosaic plot depends on the value of the count of ratings we can also customize the mosaic plot by using the Co L parameter and assigning the value as a vector of colors such as blue and yellow when we execute the statement you can find that the tiles are displayed with yellow color for Python course and blue color for our course of the courses factor so this is how we can handle the bivariate qualitative data in our programming language in this video we will practically look about the approaches of dealing with two variable quantitative data or bivariate quantitative data we will use GE stock prices data set as the bivariate quantitative data this dataset contains two columns the first column represents the date and the second column represents the price of GE stock on that date let us import the deflower package using the library function and passing d player as the argument in the next statement let us create an object GE data and using the assignment operator and read dot CSV function we will pass the argument as GE stock dot CSV file in the next statement P will create dates as the object and using the assignment operator and select function and we will pass GE data as the first argument and date as the second argument which is the column name in the data set in the next statement P will create another object for price and using the assignment operator and select function we will pass GE data as the first argument and price as the second argument which the column name in the data set we will visualize the bivariate quantitative data using the box plot function for the boxplot function we will pass price object as the argument when we execute these statements we can find that a boxplot is created which displays the minimum price maximum price median price with the first quartile and third quartile as well as the outliers the boxplot function only displays the values of the univariate data that is the data only about the price variable we will use the plot function to display the line plot the first argument we will pass is the price using GE data dollar price the next argument we will pass is the value for the label of the x axis using the X lab parameter and assigning the value as dates the next argument we will pass is the value for the label of the y axis using the viola parameter and assigning the value as stock prices the next argument we will pass is the co l parameter and assign the value as red to display the line in red color the last argument we will use is the type parameter and assign the value as L which represents line graph then we execute the statement you can find that a line graph is plotted where x-axis represents the dates and y-axis represents the stock price value corresponding to the respective dates although we can approximately find the Max as well as the min value for the stock price from the wishful plot created but we can also use certain statements and our programming language to find the same we will use the max function and pass the argument as GE data dollar price when we execute the statement we can find that we have the maximum price of the GE stock traded we will also use which function to get the index of the maximum value of the stock in the which function we will use GE data dollar price followed by the equal to which is a relational operator and call the max function with GE data dollar price as an argument when we execute the statement we get the result as 364 which is the index in the GE data object that is the data frame to retrieve the entire row belonging to the index we will use GE data which is the data frame followed by the opening square brackets and at the end of the statement we will use comma to retrieve the entire row followed by closing of the square brackets when we execute the statement you can find that we get the result as the max price of the GE stock as well as the date on which the GE stock traded maximum price in the same way we can use the min function to find the minimum value of GE stock as well as its date so this is how we can handle the bivariate quantitative data in our programming language in this video we will discuss about multivariate data we have discussed the data frames in R and have seen how data can be imported the data from the data set may contain multiple variables that can be imported into a data frame and this data frame can be used to work with multiple variables in this video rather than a full introduction to multivariate methods we will cover some of the basic statistical analysis let us now practically look at how to perform analysis on multivariate data we will consider the murders dataset which consists of several variables most important is the population murders and gun murders let us import the d player package using the library function and passing d player as the argument in the next statement we will create an object my data and using the assignment operator and read dot csv function we will pass the argument as murders dot csv file let us display the contents of my data object when we execute these statements you can find that the entire content of murders dot csv data set file is stored in the my data object which is a data frame we will use the STR function which displays the structure of the contents of my data object when we execute the statement you can find that the data set contains 25 observations and 8 variables the state variable is a factor the abb variable is also a factor as well as the region variable is also a factor that is they are qualitative variables let us use the summary function which computes the basic summary of the data when we execute the statement you can find that summary results of state ABB and variables are displayed since these variables are qualitative variables the summary statistics is not useful but whereas the summary statistics of population displays the computed minimum value first quartile median mean third quartile and the maximum values in the same way you can find that the summary statistics of population density murders gun murders and gun ownership are displayed let us now use the box plot function and pass the argument as my data when we execute the statement you can find that box plot for multivariate data is displayed here you can also observe that box plot for qualitative as well as quantitative variables are displayed in the same way let us use the plot function to plot scatter plots by passing my data object as an argument when we execute the statement you can find that matrix of scatter plots is displayed even here you can observe that the scatter plots for state ABB and region is also displayed with other quantitative variables the index of state ABB and region variables is one two and three respectively we will use pails function which also computes a matrix of scattered plots we will pass the argument as my data followed by the opening square bracket and then followed by comma since we need all the observations or rows of the data frame and then for the columns we will use negative sign with the C function the indices we will pass are 1 2 & 3 this means that include all the columns or variables except 1st 2nd and 3rd indices when we execute the statement you can find that a matrix of scatter plots is displayed you can observe that there is a close correlation between population and murders as well as close correlation between murders and gun murders let us create another object my data SEL and using the assignment operator and my data object we will select all the continuity variables from the data set and leave out the connotative variables such as state ABB and region which are on index one two and three respectively we will even exclude the population variable which is at index 4 we will use the bar plot function and pass the argument as my data SEL let us execute these statements you can find that we get an error this is because the bar plot function cannot handle the data frame the bar plot function works with matrix and even though the data frame as well as matrix can be used to store multivariate data but mattresses and data frames are fundamentally different in our programming language therefore we will convert the data frame into a matrix by using the data dot matrix function and passing the argument as my data SEL let us display the contents of the matrix that is in this case my matrix let us execute these statements you can find that matrix is displayed but to use the content from this matrix we need to transpose this matrix therefore we will use the T function which performs the transpose of a matrix and pass the argument as my matrix object now instead of passing data frame as an argument we will pass my matrix as an ax Uman let us execute these statements you can find that the bar plot is displayed we will customize this bar plot using the Co L parameter and assigning for different colors when we execute this statement you can find that bars are displayed with respective colors we will even exclude the population density variable which is at index 5 and use only 3 colors we will use the beside parameter and assign the value as true when we execute the statement you can find that the bars are plotted side by side we will use names dot arg parameter and assign the value as my data dollar state which will display the labels for x-axis we will also use legend dot text parameter and assign the value as true this will display the legend of the colors used in the bar plot when we execute the statement you can find that the bars are displayed side by side with the labels on x-axis as well as the legend conveying which color is used to display values of which variables if you observe there is a relationship between murders and gun murders in all the states as well as they are no bars displayed with green color this is because the green color is used to display bars for gun ownership whose values are very much smaller than the values compared to the murders and gun murders variables in the data set so this is how we can perform basic statistical analysis on multivariate data in this video we will talk about probability distribution a probability distribution is a statistical function that describes all the possible values and likelihood that a random variable can take within a given range this range is within the minimum and maximum possible values but precisely these possible values depends on a number of factors which includes the distributions mean that is the average standard deviation and the skewness probability distribution is used in academics and financial analysis as well as by fund managers and investors to determine and evaluate possible expected returns that the stock price may yield in the future there are many different classifications of probability distributions some of them include the normal distribution uniform distribution chi-square distribution binomial distribution and the poisons distribution different probability distributions serve different purposes and represents different data generation processes the binomial distribution for example evaluates the probability of an event occurring several times of a given number of trials a typical example would be to use a fair coin and figuring the probability that the coin comes up heads a binomial distribution is discrete as opposed to continuous since only 1 or 0 is the valid response the most commonly used distribution is the normal distribution which is used frequently in finance investing science and engineering normal distribution is fully characterized by its mean and standard deviation meaning the distribution it not skewed this makes the distribution symmetric and it is depicted as a bell-shaped curve when plotted stock returns are often assumed to be normally distributed but in reality they exhibit kurtosis with large positive and negative returns than would be predicted by a normal distribution probability distributions are often used in risk management as well as to evaluate the probability and amount of loss that an investment portfolio would occur based on a distribution of historical returns let us look at an example of a probability distribution we will take a simple example of rolling two six-sided dice each dies has ass won by sixth probability of rolling a single number so here are the possible outcomes we may get when rolling a two six-sided dice here the sum of the two dices will form the probability distribution the outcome of getting to when we roll the dices is 1 comma 1 so the probability of getting 2 as 1 by 36 in the same way the outcome of getting 3 is 1 comma 2 or 2 comma 1 so the probability of getting an outcome 3 is 2 by 36 the outcome of getting 4 is 1 comma 3 or 2 comma 2 or 3 comma 1 and the probability is 3 by 36 and so on and so forth you and the sum of all the possible outcomes of rolling to dices is depicted here if you observe the outcome of getting seven as the most common outcome which has the probability of 6 by 36 and the probability of getting the outcome 2 and 12 on the other hand is far less likely that is the probability is 1 by 36 here we have the bar graph which depicts the probability distribution of rolling two six-sided dice on the x-axis you can find that we have the sum of two dices and on the y-axis we have is the probability or the likelihood of getting the outcome let us look at an example of rolling dice and our programming language we will create a vector dice total and let us have all the possible results or outcomes we get by rolling two dice we have the values of the outcomes as 2 3 4 5 6 till 12 we will also create a vector as possibilities which corresponds to the possibilities of getting the outcome or total when two dices are rolled we will now create a bar plot to understand the distribution in the bar plot function we will pass a possibility vector and use the Co L parameter and assign the value as blue we will also use the names dot arg parameter and assign the vector that is dice total let us execute these statements you can find that the bar plot is created which displays the probability distribution of sum of rolling two dice the probability distributions are one of the core concept and statistics we will now look at a random number generation on the computer system to get a feeling about randomness the r programming language provides our uni function which provides information about the you distribution this function when executed returns random numbers for example we will use the our uni F function and pass the argument adds let's say 5 when we execute the statement we get 5 random numbers uniformly generated are in the function stands for random and uni F stands for uniform distribution you can observe that the numbers generated are decimal numbers which lies between 0 to 1 the argument we have provided tells the function to generate those many numbers we can also provide the minimum and maximum values such that the numbers are generated between the minimum and maximum range for example let us pass the argument as 5 comma 1 comma 6 where 5 is the numbers to be generated and 1 as the minimum value as well as 6 is the maximum value now suppose if we are interested in randomly generating the possibility of rolling a dice we need to generate whole numbers rather than decimal numbers to do this we can convert the output generated by our uni F function to integers using the as dot integer function for example we can have as dot integer around the our uni F function when we execute the statement we get the simulation of rolling the dice five times and their outcomes if we execute the statement again you will get a different result or an outcome this video we will talk about uniform distribution statistics a type of probability distribution in which all outcomes are equally likely that as each variable has the same probability that it will be the outcome is known as a uniform distribution for example a coin has a uniform distribution because the probability of getting either heads or tails in a toss is same the coin flip returning a head or tail both have a probability of 0.5 another example is a deck of cards the likelihood our probability of drawing a heart club diamond or spade is equally likely there are two types of uniform distribution discrete uniform distribution and continuous uniform distribution the possible result of rolling a dice provides an example of discrete uniform distribution it is possible to roll any number from 1 to 6 but it is not possible to roll 2.5 or 4.8 therefore rolling a dice generates a discrete distribution with probability of 1 by sixth for each outcome some uniform distributions are continuous rather than discrete an ideal random number generator would be considered a continuous uniform distribution with this type of distribution every variable has an equal opportunity of appearing there are several other important continuous distributions such as normal distribution chi-square distribution and student's t-distribution a uniform distribution with only two possible outcomes is a special case of a binomial distribution there are several data generating our data analyzing functions associated with distributions to help understand the variables and where variants within a dataset these functions include probability density functions and cumulative density functions let us look at how to generate a uniform distribution in our programming language uniformly distributed random numbers have the same probability of occurrence in a given interval that is between 0 to 1 in our programming language random numbers are created with the function R u ni f where R stands for random and u ni f stands for uniform the argument we will pass is 10 which is how many random numbers are to be generated when we execute the statement you can find that 10 random numbers are generated we will now create 200 random numbers and store them in an object X in the next statement we will use the hist function and pass the argument as a vector X when we execute these statements you can find that histogram is displayed which shows the uniform distribution created with random numbers let us now look at the Nidal case of theoretical uniform distribution to find this we need to compute the cumulative frequency which can be displayed as a cumulative histogram in the r programming language the history 8 cumulative plots so we need to perform some computations we will use hist function and pass the argument as object X which contains 200 random numbers the second argument we will pass is the plot parameter and assign the value as false this is because we do not want to display the histogram we only want to compute the histogram we will store the computed histogram in an object H let us now display the value stored in object H when we execute these statements you can find that this structure of the histogram is displayed which contains several attributes we will make use of the count attribute and retrieve the result we will use the C um s um function which returns a vector whose elements are cumulative sums we will store the result returned in an object at C um in the next statement we will use the bar plot function and pass the argument as HCM when we execute these statements you can find that a bar plot is created which displays the uniform distribution of the random numbers generated in this video we will talk about normal distribution normal distribution is a probability distribution that is symmetric about the mean showing that the data near the mean is more frequent in occurrence than data far from the mean the normal distribution is the most common type of distribution assumed in technical stock market analysis and in other types of statistical analysis the normal distribution has two parameters the mean and the standard deviation normal distribution is sometimes confused with symmetrical distribution symmetrical distribution is one where a dividing line produces two mirror images in addition to a bell curve that indicates a normal distribution in financial analysis traders may plot price points over time to fit recent price action into a normal distribution the further price action moves from the mean in this case the more likelihood that an asset has been over or undervalued investors or traders can use the standard deviations to suggest potential trades this type of trading is generate done on very short time frames similarly many statistical theories attempt to model a set price under the assumption that they follow a normal distribution please remember that even if the asset goes through a long period it fits a normal distribution there is no guarantee that the past performance truly informs the future prospects let us practically look at a normal distribution the r programming language provides our norm function to generate random numbers using normal distribution we will use the R norm function to generate 100 random numbers from normal distribution with the mean as 50 and standard deviation as 10 here the first argument is the number of random numbers and the second argument that is 50 is the mean and 10 that is the third argument is the standard deviation in the next statement we will use the hist function and pass the argument as X the object which has 100 random numbers generated using normal distribution and the second argument as probability and assigned the value as true this statement creates a histogram with values of probability density function when we execute these statements you can find that the histogram is plotted with density on y-axis we will compute the mean using the mean function and pass the argument as X we will also compute the standard deviation using SD function and pass the argument as X when we execute these statements you can find that the mean computed for hundred random numbers generated we will now create a bell-shaped curve into the histogram we will use the lines function and pass the argument as density function with X as the argument the second argument we will use is co l parameter and assign the value as below when we execute the statement you can find that bell curve is displayed on the histogram so this is how random numbers can be generated using normal distribution in this video we will talk about the significance of computing p-value for status hypothesis descriptive statistics can help us identify relationships in data descriptive statistics do not draw conclusions beyond the data samples inferential statistics does allow us to make conclusions beyond the data sample that is the population inferential statistics is the process of drawing conclusions about the population based on a sample taken from the population the sample is the data obtained from the population the p-value is the probability that if the null hypothesis is true sampling variation would produce an estimate the p-value tells us how likely it is to get a result like this if the null hypothesis is true let us consider a simple example let us say that Domino's sells pieces at various locations all over the city assume that there is a complain that the cheese on the Pisa is not enough as it is supposed to be let's say that the Domino's has a policy that they would be 100 grams of cheese on every Pisa obviously the management can't visit our the shops in the city to check this complain so we can use the statistical test on the sample of pieces in this case the null hypothesis is the thing we are trying to provide evidence against the average or mean T's in each Pisa should be hundred grams the alternative hypothesis is what we are trying to prove that is the customers have complained that the cheese on the pizzas is less than what it should be so the alternate hypothesis is that the cheese on the pizzas is less than the average cheese that is 100 grams we will use the significance level as 0.05 which corresponds to the confidence level of 0.95 that is 95% confidence level if the p-value is lower than the signal it's level we can reject the null hypothesis let us practically look at how to compute p-values for statistical hypothesis we will be using an data set which consists of the amount of G's son randomly selected pieces let us import the D player packet using the library function in the next statement we will create my data as an object and using the assignment operator and read dot CSV function we will read the data from Domino's cheese dot CSV file in the next statement we will display the contents stored in my data object when we execute these statements you can find that the entire content from Domino's cheese dot CSV file is stored in my data object in the next statement we will have T dot test function which performs one or two sample t-test on vectors of data the first argument we will pass is my DITA object which contains data from zombie no cheese dot CSV file the next argument we will pass is the mean using mu parameter we will assign the value as 100 we are using this value as 100 because we have seen another example that the policy is to have 100 grams of cheese on every pizza let us execute this statement you can find that we have the p value as zero point three four three and the mean value is 98.7 for Domino's cheese data our significance level 0.05 so therefore in this case p value is greater than the significance value so therefore we reject the null hypothesis and we will consider the alternate hypothesis that is we have proved that there is not enough cheese on the pieces so this is how p-value computation and comparing with the significance level helps and statistical hypothesis in this video we will discuss the degree of freedom the degree of freedom refers to the maximum number of logically independent values which are the values that have the freedom to vary in the data samples the formula for degree of freedom equals the size of the data sample minus 1 let us consider the same example of Sami nose pieces in the first statement we are importing the deep layer library in the second statement we are reading the data from Sami knows cheese dot CSV file and storing in my data object in the third statement we display the contents of my data object in the fourth statement we use T dot test function to perform t-test on the data let us execute these statements you can find that when T test is performed we get the degree of freedom as 29 let us have another statement we will use the length function and pass my data as the object using the dollar sign and grams cheese when we execute the statement you can find that we get the result as 30 and the degree of freedom is defined as the sample size minus 1 therefore since we have the sample size as 30 30 minus 1 we get the result as 29 in this video we will talk about confidence level and confidence interval the confidence intervals measure the degree of uncertainty or certainty in a sampling method the confidence interval can take any number of probabilities with the most commonly used being 95% or 99% confidence level the confidence interval and confidence level are interrelated but they are not exactly the same confidence interval is a range of values that likely would contain an unknown population parameter whereas confidence level refers to the percentage of probability or certainty that the confidence interval would contain the true population parameters when we draw a random sample from the population a confidence interval is an educated guess about a certain characteristic within a population let us consider the same cheese example the first statement will import deep liar packet using the library function the second statement uses my data as the object and using the read CSV function we will read the contents from Domino's G's dot CSV file and the third statement we will display the contents of my data object in the fourth statement we will be using T dot test function the first argument will be my data which consists of G's data and the second argument we will use is mu to assign the mean that is in this case hundred is the mean when we execute these statements you can find that for the 95% confidence level or 0.05 percent significance level we have the confidence interval ranging from 95 point nine four one seven two one zero one point four five eight two so this is how we will be able to compute the confidence interval for the confidence level of 95% in this video we will look at the hypothesis testing of the population from a sample the statistical tests are employed for testing hypotheses about data for example specific properties of distribution are their parameters the basic idea is to estimate probabilities for a hypothesis about the population from a sample the descriptive statistics can help us identify relationships and data but the descriptive statistics do not draw conclusions beyond the data samples inferential statistics does allow us to make conclusions beyond the data sample that is the population the inferential statistics is the process of drawing conclusions about the populations based on a sample taken from the population the sample is the data obtained from the population hypothesis testing is a procedure in inferential statistics it is based on the idea that we can express things about the population from a sample in hypothesis testing we have null hypothesis the null hypothesis states that two populations are not different with respect to a certain parameter or property it is assumed that an observed effect is purely random and the other is the alternative hypothesis which is sometimes also referred as experimental hypothesis the alternate hypothesis claims that a specific effect exists and is never completely true or proven the acceptance of alternative hypothesis means only that the null hypothesis is unlikely whether an effect is significant or not as determined by comparing the p-value of a test with a predefined critical value if the p-value is less than the significance level or critical value we will reject the null hypothesis that is the sample of the data has given the evidence that null hypothesis is wrong to perform inferential statistics we use t-test a t-test is a type of inferential statistics which is used to determine if there is a significant difference between the means of two groups which may be related in certain features a t-test is used as in hypothesis testing tool which allows testing of an assumption applicable to a population a t-test looks at the T statistics the T distribution values and the degree of freedom to determine the probability of difference between two sets of data there are three types of t-test one sample test which compare the sample mean to a specified population mean two sample test which compares the means of two independent samples and paid samples test which compares the means of two paid samples in our programming language we can use the function t dot test to perform all three types of T tests let us take an example to understand hypothesis testing let us consider that Domino's sells pieces and they have the offer of buy one get one free on some days to boost the sales we keep track of sales on our days we will consider one sample of sales on the days which have offered and another sample of sales on the days that doesn't have offers to perform the analysis the first step is to decide the hypothesis the null hypothesis is that there is no difference between the sales on offer days and our non offer days and for the alternate hypothesis there is a difference in sales for offer and non offer days the sales could go up or down as a result of an offer so the null hypothesis will be sales on offers is equal to sales or non offer days and the alternate hypothesis will be sales on offer days are not equal to sales or non offer days in the alternative hypothesis we can have the difference as a positive or negative this is called as a two-tailed test our exploratory hypothesis if we rearrange the equation for null hypothesis we have sales on offer days - the sales are non offer days is equal to zero in the same way for alternate hypothesis sales on offer days - the sales are non offer days is not equal to zero the statistical test can only test for differences between the samples and not for the Equality so if we assured that the sales will not go down and only interested in whether sales went up or stayed the same the hypothesis will be for null hypothesis sales on offer days is less than equal to sales on on offer days which is evaluated as sales on offer days - sales are non offer days is less than equal to zero in the same way for alternate hypothesis sales on offer days is greater than sales on on offer days when we reevaluate this equation we get sales on offer days - sales are non offer days is greater than zero so once we are decided with the hypothesis let us practically perform the hypothesis testing we will use the zombie no sales dot CSV file as the data set which consists of first column as sales on offer days and the second column as sales on non offer days in the first statement we will import the deep liar package using the library function and passing the argument as deep liar in the next statement we will create an object my data and using the assignment operator and rate dot CSV function we will pass the argument as Zalman assails dot CSV file name this statement reads the entire content of CSV file and stores the data and my data object when we execute these statements we can find that the entire content of zama no sales data is displayed in the output we will now perform the two-sample t-tests using the T dot test function we will pass the argument as my data dollar offer days and the second argument as my data dollar non offer days remember that the my data object is a data frame therefore we need to use dollar to represent columns of the data frame when we execute the statement you can find that we get the result of the two-sample t-tests displayed the p-value computed for the two-tailed our two sample t-test is zero point zero one eight one four and the mean sales for that offer days is three zero seven point three five whereas the mean sales for non offer days is two sixty five point eight three the significance level by default is 0.05 which corresponds to 95% confidence level the computed p-value that a 0.01 a is less than the level of significance that is 0.05 so we reject the null hypothesis this means that there is a difference in sales between alpha days and non offer days the p-value is the probability that if the null hypothesis is true sampling variation would produce an estimate to compute the one sample t-test we will use T dot test function and pass my data dollar offer days as the first argument and mean value as the second argument using mu as the parameter and assigning the value as two sixty five point eight three which is the mean sales for the non offer days when we execute the statement we find that one sample t-test is computed and the p-value computed for one tailed test or one sample t-test is zero point zero zero two four five six so this is how we can perform hypothesis testing in our programming language in this video we will look at how to perform chi-square test the chi-square test examines whether rows and columns of the contingency table are statistically significantly associated there are two kinds of chi-square tests the test of Independence which asked a question of relationships such as is there a relationship between two variables such as ratings and courses and the goodness of fit test which asked if a coin has tossed hundred times will it come up with heads 50 times and tails 50 times for these tests degrees of freedom are utilized to determine if of certain null hypothesis can be rejected based on the total number of variables and sample within the experiment the null hypothesis for the chi-square test states that the row and the column variables of the contingency table are independent whereas for the alternative hypothesis the row and column variables are dependent for example when considering students and course choice a sample size of 30 to 40 students is likely not large enough to generate significant data or results getting the same and similar results from a study using a sample size of 400 or 500 student is more valid if the calculated chi-square statistic or p-value is greater than the critical value then we must conclude that the row and column variables are not independent of each other this implies that they are significantly associated let us now practically look at how to perform chi-square test we will use an example of ratings data let us create an object ratings and using the assignment operator and see function we will assign values to readings vector let us consider that the elements of the ratings vector represent the reviews or ratings of a product given by different users in the next statement we will convert the ratings vector into ratings factor using the ratings object with the assignment operator and factor function by passing ratings as the argument we will use another example of courses data let us create an object courses and using the assignment operator and see function we will assign values to the courses vector let us consider that the elements of the courses vector represents courses in terms of zeros and ones in the next statement we will convert the courses vector into courses factor using courses object with the assignment operator and factor function and by passing courses as the argument let us execute these statements let us now assign the names of the courses to the courses factor using the levels function we will use the assignment operator and using the C function we will have first element as our and the second element as Python so the value 0 in the courses vector is now represented by string R and the value 1 in the courses vector is represented by string Python in the next statement we will use the table function passing the first argument as ratings and the second argument as courses the table function builds a contingency table of the counts at each combination of factor levels in this case the contingency table is built for the counts of ratings and courses combination when we execute these statements you can find that values of ratings factor represents the rows of the table since rating as passed at the first argument and the values of courses factor represent the columns of the table the values of the tables represents the count of rating corresponding to the courses we will save these results in an object that is my data in the next statement we will use CH I sq dot test function and pass the argument as my data which stores the contingency table representing counts of ratings and courses data the CH I sq dot test function computes the sky' square test when we execute the statement you can find that the result of chi-square test is displayed for the Pearson's chi-square test the p value computed 0.57 0 for since the p value is somewhat greater than the critical value or significance value which is by default 0.05 we conclude that the row and column variables are not independent of each other this implies that they are significantly associated you can also find that we get a warning that is the chi-squared approximation may be incorrect this is because data is likely not large enough to generate significant results we can use simulate dot P dot value parameter and assign the value as true this will simulate p-value with replicating the data values of the contingency table when we execute this statement you can find that the pearson's chi-squared test assimilated based on 2,000 replicated values and we get the p value as 0.72 which also implies that the variables are not independent so this is how we can perform chi-square test in our programming language [Music]
Info
Channel: Softlect
Views: 15,065
Rating: 4.9669423 out of 5
Keywords: Data Science, R for Data Science, R for Statistics, Statistics, Statistical Measures, Mean, Regression analysis, Skewness, Variance, Analysis of variance, Descriptive statistics, Inferential statistics, Measures of central tendency, graphs, median, mode, Standard Deviation, Outliers, Quantiles, Quartiles, P-Value, Statistical Hypothesis, Degrees of Freedom, Confidence Interval, Hypothesis Testing, Chi-square Test
Id: 8Tq73qH8mlg
Channel Id: undefined
Length: 151min 31sec (9091 seconds)
Published: Thu Apr 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.