How I Would Learn Data Science in 2022

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to another recall by data iq video in this video i'm going to walk you through how i would learn data science in 2022. you've probably already seen a couple other videos on this topic before but what i'm going to be focusing on here is a very practical guide because from my experience the hardest part about learning data science is that you can't figure out what to learn but rather how to learn effectively and kind of like how to not give up essentially because data science is hard it's an interdisciplinary field that involves coding math and stats and business staff product sense first i'm going to outline the topics to cover and my step-by-step approach kind of like framework for how to learn then i'll go through the approximate timeline for each topic some recommended resources and finally ending with where i see data science heading and how you should adjust your learning plan to suit it so do stick to the end of the video because data science is a rapidly changing field and i think it's important to understand the landscape if you're genuinely interested in getting into the field throughout the video i'll also be pointing out where i very intentionally built in accountability which is basically how to maximize your chances to not give up because at least for me i don't have the strongest willpower and i tend to give up easily so if you can kind of relate to this maybe these tips and checkpoints will also help you as well okay topics to cover there's programming stats data visualization exploratory data analysis or eda machine learning data scripting apis databases deployment and specific niches like nlp and computer vision don't worry i'm just listing these here now but we're going to go through each of these later in the video and talk about why each topic is relevant and why i recommend learning them in this specific order but first i want to share with you what is called meta learning where how to learn the general approach i recommend is what is called a breadth-first approach centered around project-based learning basically what i mean by this is say we take the topics i listed before right breadth-first approach means that you should cover just enough for the minimum amount of theory for each topic before doing a project surrounding it then you can learn more about the topics and do a more complex project and you do this over and over still being learning more about each topic and expanding your skills this is called a breath first approach to learning and as opposed to a depth first approach where you would attempt to learn every single thing about a topic and then move on to the next topic and then again try to learn every single thing about that and after you learn each topic thoroughly then you would try to do the project the reason why i recommend this breadth-first approach centered around project based learning is for three major reasons the first reason is that technical subjects like coding math and stats etc are really different in theory and in practice if you try to learn how to code before you may have experienced something like you do a course on coding right and you're like okay makes sense i know how to code now and then when you actually sit down to code something yourself you're kind of like uh where do i even start the reason for this is because implementation is really a separate beast and the whole point of learning to code and data science is that you can actually implement and do cool projects right so you do want to know how to implement the second reason is that if you try to deeply learn each subject in turn you will be there learning until the end of days each subject of coding stats and machine learning is huge and you can really go down the rabbit hole and find yourself super overwhelmed not knowing what is actually relevant and important and at some point you're probably going to give up before you've been starting to use these things that you learned trust me i know this from experience and finally another plug for project-based learning studies have actually shown that project based learning is the best form of learning because by doing things and figuring things out yourself you're actually more deeply encoding that information into your brain and more likely to retain the information as opposed to just like kind of passively absorbing information if you're just watching someone else code for example so yes breadth first approach centered around project based learning hopefully i have convinced you alright let's now go through the topics in my opinion you should start with coding first and the reason why i recommend coding first is because it's a lot more motivating at least for me to be able to see the results of things that i do as opposed to starting with more theoretical topics like math and stats which of course is extremely important and you'll certainly get to them later but i find these topics more abstract and less engaging aka easier to get bored and give up for choice of language i would recommend starting with python the reason why i recommend starting with python is because it's a general purpose language that is super simple to understand has great documentation and also has great libraries for data science including machine learning so what to learn for coding you should know the basics including how to declare variable functions loops and if statements then you should get familiar with two specific data science modules pandas and numpy pandas is built on top of numpy and is like the data science module where you can manipulate your data sets and feed them into other more specialized libraries for data visualization and machine learning for example after you learn the basis of coding next i recommend learning we're brushing up on your stats and i'm not talking about like crazy stuff here like we're talking about high school to first year university stats mean median mode standard deviation distributions central limit theorem confidence intervals things like that this comes in really handy when you're understanding the nature of your data set now what's really cool is that because you know how to code now you can actually implement the stats on your data sets which again i think is a lot more fun because you can see the things that you do next up is visualizations there's a lot of different visualization modules out there but honestly if you learn one of them the rest are kind of just variations with different functionalities i personally like seaborne because it's really intuitive to use and the graphs are automatically really pretty as well at this point you should know the basics of coding stats and visualizations and you're ready now for your first project which is some exploratory data analysis or eda eda is just a fancy way of saying exploring your data set and familiarizing yourself with it by seeing if there's any trends patterns correlations between variables etc with the basis in coding stats and visualizations you're now well equipped to do eda by taking a data set playing around with it a bit and doing some stats like finding the mean distribution of variables and making some visualizations okay let's talk about timelines to get to this point in terms of timeline i would say coding should take you about one or two weeks at four hours per day so that should take you again one to two weeks maybe a little more maybe a little longer depending on how much stats that you remember and visualization should take you only about one to two hours to a day to get a hang of now you might be thinking this is probably a lot longer than i thought and that's okay because remember breadth first approach centered around project-based learning you don't have to know everything just enough the basics that you can start doing a project which will help you learn even faster so what exactly should you do for your first project well let me let you in on a secret so much of data science is learning from other data scientists and working on top of what others have built i find that the best projects to start with when you're new in the field is to take someone else's project and work through it for example you can start with the famous titanic data set on cable and pick one of the highly rated notebooks then if you're feeling daring you can add something onto it and take it a step further word of warning here is of course don't just go and copy code right like that clearly will not help you learn but if you understand what each line of code is doing and the rationale behind it you'll gain an understanding on how to approach a project then next time when you're doing another project you will know how to approach it honestly even now when i want to learn something that i'm not super familiar with i find the fastest way to learn is to start by doing a project that someone else has done and then applying it to my own project later so by now after working through a kaggle notebook or two you'll probably notice that for many kaggle notebooks after some initial exploration of the data they start jumping into machine learning for example some exploratory data analysis may show that the likelihood of survival when you're male is far lower than if you're female and also your class has to do with survival then the question becomes can you predict survival and the answer is yes with machine learning so now it's time to learn about machine learning there's around 10 to 15 common machine learning algorithms and there's a lot of ways of classifying them one example is dividing them into supervised learning unsupervised learning and reinforcement learning i recommend intuitively understanding how the algorithms work without worrying too much about the exact math behind it for example linear regression is the simplest machine learning model and intuitively how it works is that it tries to draw a straight line that minimizes the distance between each data point and that line and the model is the line you drew that can predict for example the probability of survival on the titanic given an age the good news is that most machine learning algorithms are actually quite intuitive and not super difficult to understand to learn the basics of the common machine learning algorithms i would say it should take you about like three to four weeks again assuming four hours per day definitely feel free to go deeper into the math if you are interested however depending on your math proficiency you may need to refresh your calculus and go deeper into statistics okay cool now you can continue working through the notebook of someone else's project and trying out the different machine learning algorithms it's also super useful here to understand the notebook author's reason for the data pre-processing that's being done the reason why certain machine learning algorithms are chosen and their pros and cons as well as how to optimize the models these are super practical things that are extremely important to doing machine learning so be sure to really understand the reasoning behind choices that are being made now we'll cover things up to machine learning and next up is data scraping slash apis this comes into play when you graduate out of using pre-built data sets especially if you want to do your own project it's actually really rare that you'll find kind of like just a nice data set laid out for you already the more likely situation you find yourself in is having to scrape the data yourself from websites or using apis which stand for application programming interface for scraping data a module i would recommend checking out is beautiful soup very useful and quite cute and whimsical too if i do say so myself it shouldn't take you more than a couple days to a week to have a good grasp for apis we're application programming interfaces they are software built by other people that you can use to get access to data amongst other functions but what is relevant here is that you can get data using apis to learn how to use an api it may take you some time to understand how to use it because it involves understanding how to use other people's software and this really has to do with how well documented the api is reading documentation is in itself a skill in both understanding how to read documentation as well as like developing the patients to read documentation again remember the approach that i guess i've already beaten into you at this point brad first approach project-based learning learn the minimum and do the project next up databases for databases what to learn here is understanding the different types of databases like relational databases nosql databases cloud databases etc a language that you may especially want to pick up here is sql it's a much easier language to learn compared to python and shouldn't take you more than a week or two to learn it well pro tip here is if you're interested in getting a job as a data scientist data analyst or data engineer almost all companies will ask you sql questions as part of the interview process in my opinion the minimum here to learn is relational databases and the language behind them which is sql especially if you're primarily learning data science to get a job timeline here is two weeks for the basics for database projects i recommend downloading some data sets like from kaggle for example and then importing that data into your own database this teaches you how to create a database create tables inside the database and manipulate the data okay we're almost done so for the next two topics deployment and specific niches i consider these more advanced topics deployment comes into play when you want to take the machine learning model you develop and put into a live environment instead of just having it in a notebook that you have you can deploy the model across different code environments and also integrate them into other software then if you're interested in a specific field of data science you can also explore niches like natural language processing or nlp which has to do with developing algorithms that understand human languages known as natural languages it's really a very cool interdisciplinary field there's also niches like computer vision that has applications in self-driving cars for example it's kind of hard for me to give you a timeline on these niches because theoretically you can easily do a project in nlp for example with the skills you learned so far by employing modules that other people have developed which abstract away a lot of the underlying concepts and this will take you like a few hours to a few days to learn but if you're interested in these niche topics i would also assume you would want to understand more of the theory behind it and i mean there are people who have phds in the field so in terms of timeline really depends on how far you want to go now let's talk about some recommended resources i personally prefer interactive interfaces to learn coding like free code camp for example because you can see what it is that you were coding for basic statistics and theory and math behind machine learning algorithms the top resources i would recommend are stat quests by josh summer and data aiku's own guides both of which are free for projects to follow i already mentioned it before but kaggle is great because there's notebooks where you can see how people approach projects from different perspectives a great free resource to learn sql is moat which is what i personally use to learn sql from scratch and pass my own data science interview to learn more about databases in general there's also great moocs available finally for deployment and more niche topics i would personally go with highly rated courses from moocs and again rely heavily on working on my own projects because at this point you should already be quite proficient in the basics so it's more about building on top of them and doing specific projects that interest you honestly there are so many amazing free and low cost options for learning data science out there and i just listed a food that i personally used and liked my preference is to choose resources that are interactive and already have projects built into them i get it though if you prefer learning from online video courses or books for example and that's totally fine my only recommendation is that you should also intentionally work through project so you can learn to implement and in summary if you want like the most simplistic guide possible for how to choose a good resource and if you're willing to spend a little money you just absolutely cannot go wrong with choosing a highly rated course on the topic on a mooc platform there are many many courses to cover each of these topics that we discussed now finally let's talk a little bit about how the landscape of data science is progressing as the data science field develops and becomes more mature a lot of repetitive tasks in data science like data cleaning pre-processing exploratory data analysis machine learning and even deployment are becoming automated in fact data iq does just this data iq is a platform for everyday ai that systemizes the use of data for business results by using dataiku you're able to create share and reuse applications that leverage data and machine learning to extend and automate decision making data iq also allows you to scale ai safely and effectively and deliver advanced analytics using the latest techniques at big data skills data iq is really powerful and you should check out more about the platform if you're interested link in the descriptions below but wait a second if you've been paying attention you are probably thinking right now why should i learn all the things we just talked about earlier if it's becoming automated there's actually still very good reason to do so first it's still important for you to understand how things work so you can understand how to apply analyses and algorithms to specific use cases and learn how to best leverage these tools available because after all they still are tools even if they're automating things and we need to make sure that they're doing what they're supposed to be doing another implication of so much of the data science and machine learning pipelines being automated is that it's become more and more important for a data scientist to have domain knowledge which by the way is the third pillar of data science that we haven't really discussed until now domain knowledge or business product sets this is just as important as the coding and the stats coding and stats and the machine learning algorithms and all the other technical stuff is only as valuable as how much value it can provide to the company so even if you make the fanciest and best algorithm ever honestly nobody actually would care if it doesn't provide value to the company so it's very much the data scientists job to understand the business reason for doing an analysis where building a model to make sure that what they're doing also has real impact in the organization it is also a data scientist job to communicate the value of what they're doing and make sure what they do is actually going to be used for those of you who are not in industry you might think that this is kind of weird right it's like of course it has so much value and impact but believe me in practice it's actually really crucial and it's not just a given because if decision makers don't understand why your analysis or your model is useful then they don't want to use it right and even if you make the best model ever and it has a lot of impact your effort would be for nothing if it's not being used so to summarize this section since many repetitive data science tasks are being automated it's important to one understand how the algorithms work and make sure that they're functioning properly in your given context and two focus on gaining domain knowledge and learn how to communicate and present your findings and the impact of your work in the business context alright that's all i have for you today i've linked all the resources i've talked about in the description below do also share your thoughts on this guide on how to learn data science i will see you guys in the next video
Info
Channel: Recall by Dataiku
Views: 176,622
Rating: undefined out of 5
Keywords:
Id: Rt6eb9VOFII
Channel Id: undefined
Length: 17min 34sec (1054 seconds)
Published: Wed Feb 09 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.