How Much Statistics Do You REALLY Need for Data Science?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up youtubes welcome back to my channel Richard on data my name is Richard and this is the channel where we talk about data so I talked in my last video about statistics being one of the real core sets of skills that you need to have in order to be successful in data science but I also understand there's a lot of you who want to go into data science who are maybe going through school right now maybe you have an applied stats major or minor or you're not even necessarily in stats in the first place for instance a lot of people go into data science from a computer science or from an engineering background so I want to talk to you guys today about how much statistics you really need to know and what the most important things are to know to maximize your chances of number one getting a data science job in the first place but then number two once you had that job actually being effective at it so before we get into this just a little bit about me so I actually have a master's degree and applied statistics myself now if you're asking me do I use my degree every single day in my job my answer would be yes but sort of the reason I say that is necessarily when you get a degree in statistics or in a lot of different other fields for that matter they're really gonna hone your critical thinking and your inferential reasoning skills because you're constantly solving problems that are unclear that you really have to think through and you're really flexing those muscles in your brain that are responsible for solving complex problems so education is very helpful as far as the actual skills that I learned during the degree I would say there are some on the heavy theoretical side which I've never once used and there's other skills which I use almost every single day so just as a bit of a disclaimer I've worked in data science for about five years now and what I'm about to discuss as what I feel are the most important things to know in statistics are based on my own personal experience they're also based on the anecdotal experiences of colleagues and people in my network that I keep in contact with as well as research that I've done in that sense take these with a little bit of a grain of salt because your experience could certainly vary and it definitely varies across regional lines as well but having said that I really do believe if you know these things you're going to be in a really good position in the data science world all right so speaking broadly about statistics as it pertains to data science you need to have good solid scientific foundation behind the work that you do you also need to have tools in your arsenal for attacking various problems and then you need to be able to explain your work and understand the things which influence the world that you're doing data science work on so I've grouped the things that you need to know in statistics into three different categories number one I've got the foundation type stuff I've got tools and I've got reasoning inferential type skill sets so if we start with the foundational skills number one I've got here is probability so knowing probability is really gonna give some integrity behind your work and the reason for that is all of statistics at the end of the day is based on probability theory so I would understand things like doing probability calculations the central limit theorem conditional probability Bayes rule these are things which have wide reaching implications to a variety of different subject spaces so understanding those things is gonna go a really long way next down the list I have distributions so I would understand things like the expected value of a distribution or of a random variable rather the variance also things like just generating and working with that distribution inside of a programming language so calculating quantiles using them to perform probability calculations things like those you don't have to necessarily know every single distribution but some key ones like the normal exponential maybe gamma and beta uniform those are things which you should probably have in your toolbox so for the next item on the list we have estimation now often in data science you're going to be asked to estimate some sort of quantity but it's not going to be enough to provide just a point estimate for it you're going to need to provide some kind of interval in order to communicate what the spread or the uncertainty around that quantity is and so when we think about something like a confidence interval specifically it could be any confidence level 90% 95% 99% whatever have you you need to be able to generate that confidence interval inside of a programming language and then interpret it so that say it's a 95% confidence level what does that mean and then that interval that you created in the first place what exactly does that mean and now for the final item under foundation I've got inference now that to me is understanding the hypothesis testing framework from beginning to end and now in your data science job it's extremely unlikely that you're gonna spend all of your day conducting a bunch of t-tests or a bunch of small-scale hypothesis tests that's really not going to happen but that's also not really the point because the hypothesis testing framework starting from stating what you're trying to test then working with a p-value that's getting generated from your data and then coming up with the conclusion that's a really helpful framework for thinking through what you can demonstrate and what you can't demonstrate through statistics and through data science so speaking about p-values I would thoroughly understand what that number indicates and then more honestly more importantly what it doesn't mean and then also as far as hypothesis testing is concerned multiple comparisons rear their ugly head and data science all the time so understanding how to tackle those problems and then things like multiple comparisons for instance a bonferroni correction as an example those are things which you should probably be equipped to work with as well and now next up we have the tools which you're actually going to use to attack problems so for a lot of instances linear models will actually be enough suppose it's some kind of classification or regression problem where you're trying to classify or predict some kind of outcome for a lot of those problems a linear or logistic regression type of approach will suffice and the biggest reason for that is because a lot of people in all kinds of industries even if they don't come from mathematical or statistical backgrounds they intuitively understand the concepts and the meaning behind the linear models the reason for that is because these things can be explained by pretty straightforward mathematics they're not black boxy type approaches the way a lot of machine learning models are and being able to interpret and understand why the model is returning the results that it is that's going to be very important for a large number of stakeholders so I would understand how to create these models how to work with things like some basic model selection backwards and forward selection and then also you're going to create a model inside of your programming language and it's going to return a lot of output you need to be able to understand all of that output so wrapped inside of that output you have things we discussed under the fundamentals things like a confidence interval for your slope parameters you're conducting a hypothesis test in the background so just understanding how all those things tie together that's something that you're gonna want to know and next up everybody's favorite machine learning or at least that's some people's favorites some people hate it some people love it I happen to love it now machine learning is gonna vary substantially depending on the job that you're working in some jobs may have you working in machine learning everyday some not at all and that's really going to depend a lot on how advanced that company's infrastructure and data science processes are but having some of those tools in your arsenal is going to be helpful particularly with supervised learning or classification or regression problems because it's pretty well known a lot of machine learning algorithms generally are going to outperform more linear type models now you don't need to be familiar with every single model in the book but these days things like decision trees and random forests those terms get tossed around a lot so those are things you should familiarize yourself with some what things like K nearest neighbors support vector machines neural networks those are some of the other really big algorithms out there these days so certainly be able to implement them inside of a programming language now these days things like scikit-learn in python or carrot and AR that makes the whole streamlined process super easy so I would be able to work with that entire process now I must say as a caveat it's really easy to get good at the actual programming of a machine learning model but something that's really going to set you apart is knowing how you could actually improve that model so to me that often comes back to understanding the bias and variance trade-off oftentimes with a lot of different data sets you're going to run into an overfitting problem now being able to think through how to address that and make your model perform better that is really gonna set you apart from a lot of other people lastly under tools I've included survival analysis and my reason for that one is partly personal that's because I've worked in the healthcare and the pharmaceutical space a lot so survival analysis comes up all the time so things like working with a kaplan-meier curve working with a Cox proportional hazards model basically anything involving some kind of time to event outcome or time to event endpoint that could come up in a lot of different industries not just in the two that I describe so that's probably going to be a good one to know as well moving on to our third and final category we have reasoning or what I like to call elements in the statistical world which can help to which can help you to interpret as well as explain the space behind your analysis the design shortcomings and things like that so if we start with the first bullet item we've got assumptions so any single statistical model or test that you do is going to carry with it some level of conditions or assumptions now you don't want to just blindly throw a test or a model a problem you need to know these conditions and assumptions so that you can be sure that your test or model is appropriate now to take that just a little bit of a step further there are some conditions which are going to be important sometimes and sometimes the results of your analysis are going to be robust even if that condition isn't necessarily met so just as an example suppose you're working with a two-sample t-tests and you have two groups a sample of a hundred under each group now one of the conditions of that test is normality the problem is once you have a large enough sample size you could actually simulate this and figure out that most of the time even if you don't have normality in your two groups because you had that large sample size you're not gonna have any real problem from a type 1 error or from a power standpoint so you can probably go ahead with that test without any problems even though you're violating that assumption however things like independence of your actual data points now if you violate that you're in a world of problems so understanding some of these things is something that's going to serve you pretty well next up on the list we have bias now there are all kinds of different biases which can affect your analysis now there are all different kinds of biases there's response bias there selection bias there survivorship bias there's more than I can even list off the top of my head so I would get familiar with these because they will show up in your work and you need to understand how it could affect your results and then lastly we have confounding so this is the general idea that shows up in causal or general inferential problems where X is thought to influence Y but really there's some ground variable Z that's affecting both x and y so if you don't know it definitely look up Simpsons paradox understand that concept inward and outward because that's something that's going to show up all the time and knowing that is really gonna set you apart as a data scientist and as a consultant so to understand some of these reasoning concepts a little bit better I really do recommend a book called how not to be wrong it's by Jordan elenberg it's really good at Dell diving really deep into some of these reasoning ideas so it definitely talks about bias it talks about confounding it talks about a whole variety of other principles and it really brings a mathematical and a statistical edge to them for the non mathematician and the non statistician link will be in the description so in conclusion as I'm sure you all know the data science field varies very substantially from job to job so as a result the skill sets that are going to be required are going to vary substantially as well so that is the challenge with lists like these as a result the more knowledge that you have the better having said that obviously data science is an industry where the demand for knowledge is quite high so for that average person who wants to get into data science do really well in their job and isn't necessarily interested in developing the field in either the Silicon Valley world or publishing research I really do believe that these skills if you master them and get good at them they're going to serve you incredibly well so thanks again for watching this video if you enjoyed it smash the like button and hit the subscribe button if you agree or disagree with the content then please leave a comment down below and let me know and then I will see you in the not so distant future until then Richard on data
Info
Channel: RichardOnData
Views: 24,240
Rating: 4.9735098 out of 5
Keywords: data science statistics, statistics for data science, statistics, data science for beginners, statistics and probability tutorial, regression, inference, data science, machine learning, artificial intelligence, data scientist, learn data science, statistics tutorial, how much statistics do you need for data science, what is data science, statistics data science, data science tutorial, deep learning, how to become a data scientist, introduction to data science, data analysis
Id: _tEek2jPn-4
Channel Id: undefined
Length: 15min 19sec (919 seconds)
Published: Fri Nov 29 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.