5 tips for getting better at statistics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is sponsored by brilliant more on that later statistics is infamously misunderstood and in practice is easily abused but I don't think it has to be that way when statistics is done correctly we can get new medicines new vaccines and new ways to keep you glued to your computer screen all right maybe not that last one if people knew how to use statistics correctly then I think the world would be a lot better off it's a major reason why I started this channel I wanted to make more people better statistics even though I didn't major in math or statistics I still managed to make my way to the end of a PhD and I'm not particularly well gifted in statistics but I've honed a lot of skills along the way that has made learning it much easier and I want to share the top five that got me to where I am today whether you're a firste statistics student a working data scientist or someone just doing some self-study I hope that my experiences can help you on your own Journey if you don't know who I am my name is Christian and this is very normal I already told you what the Channel's about so let's get started with the list tip one start with concrete terms I love learning about statistics but even after 6 years of doing it full-time I still don't think it's easy to learn statistics is essentially the study of data so definitions have to be abstract to Encompass all the different ways that data can be produced once you get it it's easier to pick up different techniques from different fields but when you're just starting out it's so easy to get lost that's how I felt in the first semester of my masters I didn't didn't understand how some of my classmates could just absorb the lecture material so fast but that didn't stop me from being competitive I wanted to understand it better than them even if I didn't have the same background so I tried a lot of things flashcards helped with memorizing stuff but my understanding always felt empty I went to all the office hours and checked everything I could online but no matter what I tried the recurring struggle was that I had a hard time understanding the abstract mathematical ideas for example one concept I struggled with was the Ki Square test for Independence I knew how to do the test by hand and interpret the results but deep down I knew I didn't really understand it specifically I was having trouble with the null hypothesis of the test I was taught that the columns are independent of the rows I had no idea what that had to do with the test statistic or the distribution of the test and no one could phrase it in a way that stuck with me and one day I just decided to sit down with the test and told myself I wasn't allowed to get up or eat and until I came up with terms that I understood the example I ended up with involved a parent and the child I thought of a situation where a parent will ask their kid to clean the room some days they'll remember to ask and sometimes they'll forget this was my independent variable or my Row the outcome was whether or not the kid actually cleaned their room my column if the kid listened to their parent then the number of times that they cleaned their room when they were asked should be noticeably different than when they weren't to me there was a clear dependence here between the row and the column then I tried to figure out what a world would look like if they were independent I thought of a world where this kid was now a teenager who is less inclined to listen if they didn't listen then the rose should look the same the probability of them cleaning their room would be the same no matter what their parents said to them I knew that this is one of the definitions of Independence so that's how I finally made the connection between the Ki Square test and the idea of Independence between the rows and the columns and look it's not a great example but it's the actual example I used back then to put the Ki Square test into more concrete terms I understood it and it made it easier to apply the Ki Square test to other situations my first piece of advice is to always ground the statistical Concepts you learn and terms you understand if you get the abstract definition right off the bat then more power to you but having lots of concrete examples can help out especially when some of your examples aren't exactly perfect like in my example I thought of a single parent and a kid but the data that comes from this example probably isn't independent from my own teaching experience it can be bad to St purely in the abstract especially in a collaborative field like bio statistics I worked with plenty of PhD students who have a strong statistical background but lose the script when it comes to applying this knowledge to a specific problem like clinical trials biostatisticians are expected to work with other experts and if you can't address their problems in terms they'll understand then as the saying goes you're going to have a bad time tip two look for patterns in most introductory statistics classes it's common to learn a battery of Concepts and methods like hypothesis tests it's something to treat these tests as a list of tools that you can use in different situations this approach is useful for memorizing so that you can Ace a test but it's not very practical in real world settings to Be an Effective analyst it's really important to build an intuition for statistics so that you can apply your knowledge in a variety of real life situations and not just a test let me tell you about my first ever midterm as a master student my first biostatistics class was split into two halves the first was for basic hypothesis tests and the second was for linear regression this midterm focused on the hypothesis tests everyone was allowed a one-page cheat sheet for the exam I don't have a picture of mine anymore but this is roughly what I put on it I tried to fit all of my tests onto a decision tree and everyone who saw it made fun of me when I saw what other people had I learned that other people take their cheat sheets really seriously I spent so much time developing that and honestly I didn't even look at it once during the test it made me question if these cheat sheets or my decision tree were missing the point in my heart of hearts I knew that I wouldn't have a cheat sheet working as a full-time employee if I forget something I'll have time to look up the details but I thought I should at least have a well-developed intuition for what tests are appropriate at which times and one way to develop this intuition is to look for patterns in the things you learn patterns indicate a unifying concept that you can use to connect multiple ideas together here's a list of the six tests that most people learn in their first statistics class the one sample Z test the one sample T Test two variants of the two sample T Test and the one and two sample proportion tests on one hand you can view these six tests as just six things you need to memorize but check this out here's the test statistic you need to calculate for each of them do you notice any patterns here they all have a similar form where the estimator is centered by the population mean and divided by the population standard deviation this calculation is called standardization and it typically produces a standard normal with zero mean and unit variance knowing that these three tests are distributed like a standard normal and the others have a t distribution it must mean that this statistic here also has a normal distribution if we rearrange the terms we'll get something like this and this result comes from the central limit theorem this is for the Z test but the same logic essentially applies to the five other tests the underlying concept behind all of the basic hypothesis tests is the central limit theorem instead of having to memorize six distinct hypothesis tests you can derive each of them starting from the central limit theorem the single unifying pattern patterns like this are common in statistics once you know the pattern you can take advantage of it and use old instances of the pattern to learn about this new one tip three learn a statistical programming language ASAP if you're in a statistics class and most of your studying is dealing with proofs or hand calculations I highly recommend that you incorporate programming into your study routine nowadays it's uncommon for statistics to be done by hand if you need to conduct a hypothesis test or run a regression you're most likely going to do it with code my first tip recommended that you make it a habit to translate statistical Concepts into terms that you understand this tip recommends that you translate the concepts into the language of code there are lots of benefits for translating statistical procedures in the code the first one is that you can focus more on actually interpreting the results rather than doing rote calculations another benefit is that code gives you an alternative way to interact with statistical ideas in a more concrete way which I'll get to in the next tip and third code gives you a faster reproducible way to visualize these Concepts here's an animation of me plotting the histogram of many sample means I wanted to see how the histogram compar to what I should expect in theory given by the central limit theorem and almost like magic the histogram begins to fill the normal PDF I used this visualization in my explainer for the normal distribution but it was originally something I made when I was first trying to understand the central limit theorem sure you can make plots by hand but with code you can take it another level and make dashboards or animations like I did in my opinion I don't think I'm as technically gifted in math as the other PhD students in my department but I make up for it by the fact that I know R really well and I can find other ways to fill in the gaps in my understanding if you're just starting out it doesn't really matter what language you choose most people will do well either python or R the point is to just pick one and start learning and using it if you have a different take on that please let me know in the comments I'm going to need to get a full-time job next year and I got to be prepared the true benefit of programming for the purpose of learning is that it forces you to think about the concepts from yet another perspective there's the purely statistical perspective then there's the applied context that you're working in and now there's a third code perspective the best statisticians know how to effortlessly translate between these three different lenses and communicate with anyone on their team tip four strive to investigate and struggle have you ever heard your professor say something that just didn't make sense to you and before you could ask about it further they've already moved on it can be really frustrating when that happens but these are actually perfect opportunities to take the initiative to fill in the missing details yourself on one hand you can just sit down and try to struggle with the theory sometimes that's necessary other times you can start interacting with the idea through code the ability to generate synthetic data is extremely powerful it gives us a way to get data without actually needing to spend time and money to get it more importantly we can use these data sets to interact with statistical Concepts if you don't understand something in class you don't have to wait for office hours to get an answer you can just pull up our studio and start figuring it out right away here's an example from my own experience one time my professor said multicolinearity hurts your hypothesis test that made no sense to me so I set aside some time to figure out what harm meant if you're not familiar with the concept multicolinearity is when two or more regressors in a linear regression are correlated what wasn't clear to me was how this correlation hurts a hypothesis and what form this harm takes so I ran a simulation study where I generated correlated data and used this data to simulate a linear regression here's how multicolinearity influences the estimated regression parameters this plot is for the first and this one is for the second this black line indicates the true value for each parameter while these red and blue lines show the average of 1,000 simulations for a given correlation you can see that for a fixed sample size as the correlation increases the estimates also get worse indicated by how far they're deviating from the True Value so one form of harm is worse estimates here's a similar plot looking at the standard errors of the estimates as the correlation increases the standard errors also increase not only that but this increase looks at it increases exponentially this is important because the standard errors control how large the confidence intervals are very roughly speaking if the confidence interval is larger it's easier for them to contain the null hypothesis and make it so we fail to reject it this directly represents a decrease in power to summarize my point when my professor said that multicolinearity harms your hypothesis test she was referring to the fact that it one increases the bias in the estimation process and two explodes your standard errors which has a downstream effect on power let's say you knew that already you knew that the increased correlation between the regressors ultimately gets captured in this expression for the variance of the estimates while this intuition can give you the answer it can't tell you everything the mon Carlo approach gives you the extra benefit of quantifying how much the standard deviation will change for a given correlation I debated whether or not to make this a separate tip from tip three but it happened so much that I thought it merited its own point sometimes professors and textbooks will just say things that you don't understand and from my perspective the process of investigating further is a win-win situation if you're lucky you'll figure it out and you'll have the code and plots that give you a much more detailed answer that your professor couldn't more often than not you'll get stuck but by investigating further you can better identify what about the problem you don't understand often we don't even know what we don't know and by extension we don't know what questions we need to ask to get the answers we need learning to ask the right questions is a general skill that everyone needs but in order to know what those questions are you need to know the exact limits of your understanding and you get that the fastest just by struggling with the ideas tip five respect limits and assumptions a lot of people are familiar with the phrase there are three types of lies lies damned lies and statistics there's truth to the statement but not in the way most people think some people think it's a burn when they comment this on one of my videos they think that this quote suggests that the entire field of Statistics is a lie something that's dishonest but that's just not true what Mark Twain was really referring to is that statistics relies on models idealized approximations of the real world since they're approximations they can't possibly capture all the complexities of the real world but they're close enough such that we can still learn from them more often than not statistical models will require multiple assumptions some assumptions are used to make sure that you can take advantage of important theorems like the law of large numbers or the central limit theorem other assumptions are made to simplify hard problems instead of assuming a general probability distribution we often assume the data comes from a parametric family there are also assumptions about the data itself one of the strongest assumptions we can make is that the data are independent and identically distributed and as it turns out this is a very strong assumption to make while some statistical methods let us relax some assumptions at the end of the day you still have to make some of them the best thing you can do is to know what assumptions you're making and know why you need to make them in the first place a lot of misused statistics I see come from well-meaning smart people including MDS and phds who don't know what these assumptions are and don't bother to consult with the statistician before collecting their data then they're stuck with an expensive data set that doesn't tell them anything there's an RA Fisher quote that says to consult the statistician after an experiment is finished is often merely to ask him to conduct a postmortem examination he can perhaps say what the experiment died of the moral of this last tip is to keep track of all the assumptions you use for a statistical model and the data you collect the reality is most assumptions can't actually be verified so the next best thing is to make them explicit with your collaborators it'll save you wasted effort when you need to do the analyses yourselves in this video I delved into some of my best practices for learning new Concepts and models and statistics I'm not going to pretend like anything I've said here is original or new but I do all five of these on a daily basis when I'm trying to learn something new for my research or when I'm working with students for office hours some of you might feel that all these tips are related or are just different sides of the same coin and that's on purpose my personal Philosophy for developing expertise is that you need to interconnect your knowledge as much as possible and learn to look at the same thing from different lenses this General strategy is called interleaving all of the tips I've taught you are concrete realizations of the strategy these tips work for me but in their current form they might not work for you but my hope is that I've inspired you to take one of these tips and adapt it to your own personal needs and if you got something out of this video consider subscribing to the channel and my newsletter if you want to stay updated learning statistics is just like learning any other field if you want to get better at something you have to put in the time and effort to internalize it in terms you understand this can be really time consuming but the sponsor of this video can help speed up the process brilliant is an online platform for learning math computer science and data science they offer courses that are updated every month and you can solidify your understanding through interactive exercises firsthand experience in problem solving is the best way to stress test your knowledge and Brilliant mix is the top priority I've been working through the math for quantitative Finance course sent it a topic and statistics that I don't know much about but want to know more of to try everything brilliant has to offer for free for a full 30 days visit brilliant.org very normal or click in the link in the description you also get 20% off and annual premium sub description thank you to brilliant for sponsoring this video thanks for watching everyone I'll see you in the next one [Music]
Info
Channel: Very Normal
Views: 20,754
Rating: undefined out of 5
Keywords: biostatistics, statistics
Id: StSAJIZuqws
Channel Id: undefined
Length: 17min 15sec (1035 seconds)
Published: Mon Apr 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.