A Study Pathway for Data Science in 2020 (7 Steps)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up everybody welcome back to my youtube channel richard on data if you're new here my name is richard and this is the channel where we talk about all things data data science statistics and programming subscribe for all kinds of content just like this and hit the notification Bell so YouTube notifies you whenever I upload a video I'm gonna say something that pretty much everybody knows and it's that data science is white hot right now and everybody wants to get into it and they come across the spectrum there's people in school right now who are in programs for statistics computer science business even and they're looking at data science as a potential career field same sort of thing with professionals there's tons of people in other professions who want to make lateral style career moves to get into data science but here's the problem the real question is how do you get into data science a big part of the problem is that the field is so multidisciplinary I did a video a while back where I described what data science really is and I described it as the intersection of four different skill sets and those are statistics programming communication and domain knowledge but that's just the high level overview statistics is an extremely broad field in and of itself and there are tons of different programming languages out there and nobody has time to go to school or get a bunch of different certifications to learn 13 different programming languages so it's a totally natural question to ask what are the most important skills to know and what's the order of priority so what I thought would be helpful is to lay out a study pathway for data science so how this will work is you start with item number one learn it after you're done with that move on to item number two and so on and so forth that way you have priorities you have things you can focus on and you can see yourself becoming more marketable and more competitive for your first data science job now you might know item number two but you don't know item number one or item number three and that's totally okay my recommendation then would be start with item number one then move on to item number three now what's my basis for all of this a lot of it is based on what I think has made me successful in my years as data scientists so far but it's also based on my own reflection about what I would have done differently if I could start over today as well as what the trends are in data science in the year 2020 now there's people out there who do say that everybody's pathway to becoming a data scientist is different and I both agree and disagree with that there's certainly tons of variety in the field and there's people who come into it from all kinds of different backgrounds and skill sets so what you end up with is different flavors of data science but having said all of that there are some Universal things which I think every data scientist must know in order to be successful so without further ado let's move on to item number one item number one is statistics you really can't argue with how important statistics is to data science it's really to me what prevents the field of data science from being data pseudoscience and to me things like programming languages are truly the how of data science domain knowledge will give you the why but statistics will give you the what I did a video all about how much statistics you really need for data science if you haven't seen that check it out I'll have it linked but in short you need to know things like probability distributions Bayes rule confidence intervals hypothesis tests that way you have a good solid foundation you also need to reason your way through problems you need to know some concepts like confounding variables and Simpsons paradox bias variance the assumptions of different statistical tests and you need to know different techniques like the statistical tests themselves and things like survival analysis machine learning methods are important too but they get their own category later for this step just gloss over them now the question you're probably asking is how do you learn statistics well if you're interested in data science and you're just now looking at masters programs statistics is about as good of an option as you could possibly pick some schools do offer full-blown data science programs now but depending on how much computer science expertise you have you might even be better going with the statistics degree over the straight data science degree and it just comes down to the fact that there's a lot more programming and computer science resources out there than there are for statistics naturally having a degree is the best possible option for learning statistics because you just get immersed in the field for two or four or however many years your program is and if the program is good you'll get the opportunity to do and try things so you'll get a healthy balance of both the theory and the application but at the end of the day it's important to actually know your stuff as opposed to just looking like you know your stuff just because you have a fancy degree and if you just don't have the time or the money to go to school you need to find some other options and the next best thing is probably Coursera courses there's several very popular options out there there's one from Duke University there's one from John Hopkins University and there's one from the University of Amsterdam links to all of these will be in the description all right so you've got a good foundation in statistics you know the frameworks you know the tools maybe even you know some theory now item number two on the study pathway is sequel now why would I suggest that you learn sequel before you learn the more mainstream programming languages like R or Python well there's three reasons for that first of all is a psychological reason sequel is almost universally considered to be easier than R or python r you don't have the problem of constantly learning new packages and even to somebody with no coding expertise whatsoever sequel reads basically like the English language and it's pretty easy to pick up secondly sequel just prepares you for the real world data collected in the real world is really messy it lives in all kinds of different environments often times with different primary keys and sequel just equips you to deal with those kinds of challenges and third you could easily be in a situation where the work that you would do in R or Python is downstream that is maybe your data live in a sequel server database environment and then you have to query different data sets from there first in order to create a working day set after that point you could do things like models and algorithms and the typical stuff that you would do in our Python but if you can't use the sequel to get to that functional data set in the first place then you have all kinds of trouble so this step should hopefully be a pretty quick and easy thing to learn and you absolutely don't have to be a master database architect or sequel guru in order to be an effective data scientist but you obviously need to know how to query your data that means knowing things like joining data sets case when statements exist statements window functions nested queries stuff like that also storing temporary sets we're going for function over form here for the most part alright so now you know sequel you're able to effectively query your data now moving on to item number three and you already saw this one coming it's one of our or Python I did a video recently on where the R vs Python battle stands in the year 2020 and I did conclude that Python is a bit more of a hot job market item right now just because R seems to have declined a little bit in the last two to three years however I'm not going to tell you which one of our vs. Python you should learn first you need to make that decision yourself based on your own educational background and your own interests here's the deal you don't want to get yourself into a position where you're trying to learn two things at once if you don't have background in either of them it's very difficult to learn two things at once and you're much better off being an expert at one thing rather than being mediocre at two things if you're coming from a stats background you probably learned are along the way and likewise if you're coming from a more computer science or programming background you probably picked up Python if that's the case you're probably best off sticking with that language and mastering it so learn one of these two from beginning to end you're going to want to know the overall fundamentals of the language how to tidy and manipulate your data set how to create visualizations reports models all of that and both of these languages have key data science related packages if you're learning are a good starting point is probably tidy verse and then if you're learning Python some good starting points would probably be pandas numpy scikit-learn Seabourn matplotlib and stats models alright so at this point you know statistics you know sequel and you know one of our or Python at this point you're in an extremely good position to get started in data science you're not gonna get every single job listing that you see on LinkedIn and you still have tons to learn frankly you always will however you are in a position where you can tackle tons of problems from beginning to end and you have an extremely strong foundation so with the rest of these items I'm putting them in a rough order of importance but I want to give you the freedom to rearrange them based on what interests you and what kind of data scientists you want to be item number four is the other one of our or Python so you do need to be an expert at one of these two but if you're an expert at both of them or at least you know enough of the other one to be dangerous all of a sudden you are extremely employable let's face it some companies just have our infrastructure set up and they're not switching to Python anytime soon and the other way around of that is true to some companies have been using Python forever and they just have no use in their ecosystem for our there is evidence that are is declining a little bit while Python is growing but since R is still preferred in academia and in research and since some firms do have our studio server infrastructure set up our and Python are still the two juggernauts of the data science world and I don't see that changing anytime soon so if you know both the fact that some companies strictly use one or the other becomes irrelevant you'll be able to match exactly what that company needs item number five is linear algebra now this is a pretty interesting one if you work in any kind of research capacity or even if you're just working in a firm that's developing brand new products it might not be enough to rely on existing algorithms and methods you have to innovatively create your own solutions and in that case you need to get creative and you need to come up with ideas that are rooted in strong mathematics and sound science the field of mathematics that's going to support you the most with these endeavors is linear algebra and also just having a strong understanding of it is gonna have massive crossover benefits to your ability to utilize linear models and machine learning methods this is another one of those things for which an education is ideal but it's not a total requirement one book that I do recommend is the no BS guide to linear algebra like everything else the link will be in the description item number 6 is UX and design principles now almost everybody would agree that communication is a major part of the data science puzzle that means you can form a relationship with your client but also that you can create understandable visualizations reports just things like that such that real-life humans can understand and absorb information and data that you present everybody knows that there's a giant technical side of data science but don't overlook the fact that there's a massive human side to it which is just as important if not more important so what I would recommend is to read some books on the principles of user experience I will also rope in here learning the principles of solid data visualization so reading some books like the visual display of quantitative information by Edward Tufte day or show me the numbers by Stephen whew will pay enormous dividends for you as well item number seven is everybody's favorite thing and that is of course machine learning so I'm listing this category last because the importance of it is probably a little bit overstated compared to some other skills and some companies will even have full-blown machine learning engineers and data scientists will barely even get to dabble in the fun themselves but that's probably the exception rather than the rule you also need to have some background in some of the items I've listed so far before you even think about machine learning so that would include statistics one of our or Python and probably some linear algebra as well but there's no question that more and more companies are beginning to incorporate machine learning into their products and services so for absolute starters I have to recommend the absolute classic machine learning course on Coursera by Andrew egg you'll implement your own algorithms you'll learn some of the theory behind the most popular algorithms and you'll just come away with a solid fundamental understanding of machine learning breaking this one down a little bit you do need to understand what gradient descent is and how that works and for some of the most popular algorithms you really need to be able to explain in the simplest English possible what these algorithms are actually doing for unsupervised learning methods that would be things like k-means clustering or hierarchical clustering for supervised learning that would be methods like K nearest neighbor decision trees random forest you get the idea now if you get really good at these things you might even be able to implement these things from scratch luckily that's already done in the packages we use in our and Python so generally you don't have to do that where I personally think the biggest value is going to come from in all of this is being able to explain the intuition behind these methods and to attack any given problem that you have in the smartest possible way whether your goal is inference or prediction for example in the real world you'll always start by setting up your data set then you can choose the right machine learning methods based on things like the dimensionality and the class balance of your data after that point you can train and tune your models in ways that make sense evaluate based on appropriate metrics and then you can iterate just based on how much time you have to spend on the problem so that's your study pathway to become a successful data scientist item number one is statistics item number two is sequel item number three is R or Python these are the core fundamentals that you must know after at that point you have a little bit of flexibility to figure out what flavor of data scientists you want to be now you can reorder these a little bit based on your own preferences but what I would recommend next is item number four learn the other one of our or Python item number five is linear algebra item number six is UX and design principles and then item number seven is machine learning this is by no stretch of the imagination a comprehensive list there are tons of things which I'm leaving out here such as SAS or Julia or Big Data technologies like spark or MapReduce or things like cloud computing and that's not to say that none of those things are important everything is and it's extremely important to have a continuous learning mindset but having said that this field is absolutely saturated with things that you're expected to learn and nobody can know every single thing I do think if you know these things though you are setting yourself up for success so thanks for watching this video let me know what you think in the comment section and if you haven't already smashed the like button then I'll see you all next time until then Richard on data
Info
Channel: RichardOnData
Views: 14,212
Rating: 4.9662447 out of 5
Keywords: data science study pathway, how to become a data scientist, statistics, statistics for data science, sql, r vs python, data science, data science with python, machine learning, what is data science, data science roadmap, data science training, introduction to data science, data science tutorial, study pathway for data science, data scientist, data science in 2020, data science sql, step by step action plan, learning data science, how to learn data science, deep learning
Id: ySI5P37Xf_k
Channel Id: undefined
Length: 16min 28sec (988 seconds)
Published: Mon Mar 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.