Explaining nonparametric statistics, part 1

Video Statistics and Information

Video

Captions Word Cloud

Captions

welcome to a very normal Therapeutics training video today I'm going to introduce the idea of non-parametric tests if you're new here my name's Christian and this is very normal a channel for making you better at statistics you are a fictional new employee in my fictional pharmaceutical company and it's my job to give you an actual lesson in statistics let's get started to use statistics is to view the world through the lens of mathematical models by definition these models are approximations of the real world and they need various assumptions to be accurate for example we often assume that the data come from a specific distribution like the normal distribution this is called a parametric assumption and it's required to give the T Test its characteristic distribution if we can't assume that we need to rely on the central limit theorem and all the assumptions associated with that theorem for the T Test to produce good results assumptions are everywhere in statistics this is where non-parametric statistics shines based on its name you might think that non-parametric statistics is just a branch of Statistics where we avoid the parametric assumption while that's true it's not the entire picture you see statisticians really suck at naming things so the phrase non-parametric has come to mean a lot of things some non-parametric methods avoid the parametric assumption but others avoid different type of assumptions we might make like the specific form of regression in non-parametric statistics you're still trying to learn from the data but you're also trying not to assume too much about it for the purposes of this video it's useful to think of non-parametric Statistics as a set of methods that try to minimize the number or strength of the assumptions we need to make in this first part of a two-part miniseries we're going to look at our first example of a non-parametric hypothesis test we're going to focus on how this non-parametric test avoids assumptions that a more traditional test might take before we look at the test let me go through a motivating example at very normal Pharmaceuticals we care a lot about employee productivity so I've installed surveillance software on all my employees laptops to track how much time they spend on nonwork sites here's the data I got it's not a big company so there's not much data to use only 10 observations but look at these lazy bums I mean employees these people spend way more time on non-w work sites and other words they're outliers but unfortunately they're not outliers that I can just cherry-pick out of the data it's okay for there to be outliers some research contexts just have natural outliers in them and it would bias the data to remove them I'd like to test the hypothesis that my employees spend about 60 Minutes on non-work sites that's how much time they're allotted for lunch so that's how much time they should be spending on non-work videos but here's the problem the traditional test we use for a one sample problem like this is the tea test like I mentioned earlier the T Test requires us to assume that either the data is normally distributed or that there's enough data for the central limit theorem to kick in neither of those assumptions seem reasonable here I could still perform the T Test but I would have a hard time convincing other people that this was a good idea I need a hypothesis test that allows me to avoid the assumptions of the T Test but still allow me to learn from the data the first non-parametric hypothesis test we'll look at is Will coxon's signed rank test named after the statistician Frank wi coxen I'll be calling it The signed rank test for short the signed rank test is a non parametric alternative to the one sample T Test since it's a hypothesis test we're going to view it in terms of the elements of the null hypothesis significance testing framework or nhst we're going to see what the parameter of interest is the null hypothesis the test statistic and the distribution of the statistic under the null hypothesis I'll use my original notation for a data set here I'll say that the sample size is some number n instead of 10 like in my example to denote a General Distribution we usually use a capital f to indicate some cumulative distribution function or CDF theed rank test assumes that the observations come from some continuous symmetric distribution and our parameter of interest is the center of this distribution which will denote data even though this is a non-parametric method it doesn't necessarily mean that we get rid of all the parameters in the model the center is still a parameter in the sense that it's unknown and we want to learn about it from our data under the sign rank test we can represent all the observations as the sum of two terms the center of the distribution and some noise Epsilon that nudges the value away from this Center this is also known as a location model the noise itself is assumed to come from some symmetric distribution centered at zero lots of different families are symmetric like the normal distribution the T distribution and other exotic distributions like the Laos distribution it's worth noting that if you assume the noise is normally distributed with some unknown variants then you actually get back to the one sample T Test this General symmetric distribution is allowed to have heavy Tails which aren't allowed by the normal distribution the null hypothesis for the signed rank test is that the center of the distribution is at some value data KN in my example the null hypothesis is at the center is at 60 both of these resemble their analoges in the one sample T Test but things start to look very different once we start to look at the test statistic and this is what it looks like the first thing you should notice is that the notation for the data is slightly different I wrote the original data as a set of x's but here there are y's these y's represent transformed versions of the original data more specifically they're the original data points centered by Theta KN under the null hypothesis the center of these y's would be zero the center of the original data would be Theta KN this function here is called the sign function it spits out a negative 1 if the transformed observation is less than zero and positive 1 if it's greater than 1 since the distribution is continuous the probability that y exactly equals 0 is zero so we don't have to worry about it this term here has two layers to it on the inside we have the absolute value of the transformed observation or how far it is from zero the sign function handles the sign of the data and this term handles the magnitude this outer function R gives the rank of whatever is inside it so it's ranking the magnitudes of the transform data if there are n observations in the data set then the rank function ranges from 1 to n overall we would interpret the statistic as being the sum of the signed ranks of the transform data and that's what gives the test its name this is a lot to take in so let's look at a visual to build our intuition a bit better here's a few of my data points shown on the real line let's say that the null hypothesis is true and Theta KN is pretty close to the true Center if that's the case then approximately half the transform data will be negative and the other will be positive after we take the absolute value of the data and rank them it'll look something like this then the rankings will be evenly distributed between the negative and positive points the ranks of a negative and positive observation will mostly cancel each other out so under the null hypothesis the test statistic will be pretty low and not provide enough evidence for us to reject the null hypothesis well let's say that the null is totally off from the true Center if this is the case then the majority of the data will either be positive positive or negative not only will there be an imbalance in the positive versus negative values there will also be an imbalance in the rankings so when we finally sum up assigned ranks we won't get that same cancellation we did when the null hypothesis was true the test statistic will either be a large negative or positive number which would be evidence for us to reject the null hypothesis the last element of the nhst we have to cover is the null distribution unfortunately the null distribution of the signed rank test doesn't have an easy closed form unlike the T distribution there's no name for this distribution or even an equation to describe it but if you're conducting this test in R and you should be you don't have to worry about it the function itself calculates it for you and by extension can calculate the necessary P values so why is the signed rank test a non-parametric test to understand this it's helpful to look at the statistic for the T Test it's well known that the null distribution for this statistic is a t distribution with n minus1 degrees of freedom the null distribution of the T Test depends on the fact that the distribution of the sample mean is normal you see you get a t distribution when you divide a standard normal random variable by this function of a Kai squ distribution and this is what happens when you need to estimate the population variance in other words the null distribution depends on the original distribution of the data but in the signed rank test this isn't the case rather than using the data directly this test uses a transformed version of the data and even though we didn't see what it was was the distribution of the statistic doesn't depend on the distribution of the original data because of this the signed rank statistic is said to be distribution free this phrase can be a little confusing because it still has a distribution but it's more so a reference to the fact that we don't need to assume a distribution on the original data now that we know what the test is we can actually perform it and the corresponding function in R is the will cox. test function it's already in base R so you don't need to bring in any extra libraries which is nice we'll apply it to my data set from earlier I just need to pass in the data in the form of a vector and specify that the null hypothesis is 60 through the Mew argument according to the test the P value is less than 5% so we can reject the null hypothesis that the typical non-work video watch time for my employees is 60 Minutes noted in this video we got our first glimpse of non-parametric statistics on this channel we looked at a non-parametric analog for the one sample T Test will coxen sign rank test non-parametric statistics are a valuable tool because they can prevent us from making unreasonable assumptions this is helpful because it just makes them more General but as we saw in the signed rank test this comes at the cost of increased complexity the name non-parametric can be really confusing so I dedicated this video to showing you how it applies in this particular case in part two of this miniseries we're going to look at a non-parametric analog to the two sample T Test and examine when we might prefer a parametric test versus a non-parametric one if you like this video and want to see more then consider subscribing to the channel I also keep up a Weekly Newsletter so you can know exactly when a new video comes up and get some extra content as well that's it for this one thanks for watching I'll see you in the next one [Music]

Info

Channel: Very Normal

Views: 18,844

Rating: undefined out of 5

Keywords: biostatistics, statistics

Id: IPi-a7Z1ofw

Channel Id: undefined

Length: 10min 58sec (658 seconds)

Published: Mon May 13 2024