Full Stack Data Science Roadmap 2023

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in 2016 I was a data analyst everyone wanted to become a data analyst in 2019 everyone wanted to become a data scientist in 2020 data engineering was hard in 2021 machine learning engineering took off in 2022 and upcoming 23. it will be four stack data science recently everyone has been talking about chat CPT stable diffusion and Dali 2. together with advance of powerful machine learning python libraries and end-to-end data science platforms data scientist roles are becoming more and more high level and stretching across different parts of a data science pipeline I think data science roles are becoming more full stack especially now with economic downturn and companies are wanting to cut costs more and more companies are wanting someone who can do it all basically to implement a project from start to finish so in today's video we talk about what full stack data science is what are the required skills for an end-to-end project and how to learn these skills in 2023 with a clear roadmap that we'll cover later in this video this video is sponsored by jetbrain's data lore a collaborative data science platform for teams mourn them later so what is full stack data science a full stack data scientist is someone who has a reasonable amount of skill and experience in all of the steps in a data science pipeline from idealization data collection and Engineering model development and deployment to generating business insights and data storytelling the full stack term was originally used for software developers who can do both front-end and back-end work they can design the UI and make the app work on the front end and also work on the gritty back-end architecture and algorithms in data science is slightly different so I think it should actually be called Full pipe rather than full stack being full stack doesn't mean that you have to do everything alone by yourself it just means that you have the flexibility to take on different roles and contribute to different projects in many ways this diverse skill set is very valuable for companies because you can understand the bigger picture and contribute to many different projects you may be a data engineer in one project while analyzing data and building models in the other you can also take on different roles in the same project because the real world application of a machine learning model can be quite complicated and involves many different skills for example this is a model life cycle within a bank to guide you through this jungle that we call full stack data science I'm going to break this down into eight main groups of skills firstly math probability and statistics secondly coding skills that can be python SQL R and JavaScript we also have databases and data engineering machine learning and deep learning that includes computer vision and Nob computer science fundamentals including software development and web development skills Cloud platforms deployments and API and other skills such as business knowledge and communication skill you it's worth noting that what sets for stack data scientists apart is the software engineering and data engineering skills because this is where your job can become more impactful in most businesses you can build the database and actual applications to demonstrate the concepts and deploy them in the real world this is what's solely lacking in many companies who are trying to figure out what to do with that data I think before we learn anything it's important to understand how the skills actually being used in a real world use case I don't know about you but for me it's very hard to learn something without knowing what's the purpose for it perhaps except for learning art which I think is a point in and of itself there are many data science and machine learning applications across Industries some of which we might already take for granted nowadays such as our social media recommendation feed but let's take a somewhat high stake use case such as the customer rating system in a financial institution to classify the high risk and low-risk customers this use case and many other use cases you encounter in businesses or have the similar fundamental process first you need to define the business problem and for this we definitely need some business and domain knowledge for customer rating system the business problem could be that the financial institution for example a bank cannot manually evaluate the risk of every customer because it can be extremely expensive and inefficient so they need to somehow automate this process with machine learning to be able to be more efficient and cut costs so your job is to identify the problem do a cost benefit analysis on different solutions for example you might consider 100 manual as it is now versus 100 automated system or a hybrid system where human analysis still can manually check some of the suspicious cases that are flagged by the machine learning model with some domain knowledge you might also have some expectation of which customer characteristics are more important for or risk classification and in which cases this automated system might fail the second step is to collect the necessary data from different databases maybe from transaction database or from customer demographic database if certain information is unavailable then you might need to go out and collect the data yourself for example through web scripting or using data apis from a third body then you combine the data and engineer some useful features store them in a data set that you use for model training this is also where your knowledge of databases and feature engineering comes in there are many different types of databases based on the data structures for example relational or non-relational data and how the data is stored in the database when it comes to relational databases the most known language is SQL or structured query language because SQL is a query language used by many popular relational database systems in including MySQL postgres SQL and Microsoft SQL server for web scripting and apis there are many python packages that you can use I've also covered some of them on my other videos over here then the next step in the pipeline is to use this data set to build a proof of concept that is the machine learning model that predicts the customer risk this process will involve exploratory data analysis and coding skill to Wrangle the data you might also need to go back and forth to the previous step to engineer some extra features then you use your machine learning and deep learning skills and some math and statistics knowledge to select the right features and the algorithm train the final model test and evaluate the outcome of the model oh and by the way it's also not likely that you work on this project alone so you need to use get to Version Control your goods and collaborate with others the fourth step is to build an application and deploy the model to the real world this is where the software engineering part comes in if you work for a large company like a bank or an insurance company it's not likely that you have to build the whole web application or deploy the application from scratch as a standalone thing but rather the final model would be integrated into the existing system within your organization however if you have to build and deploy the application from scratch yourself it's useful to have some software development skills for example you know how to go to web app and use a cloud platform to deploy your application that sounds a bit difficult but luckily if you're using python for your project we have many available libraries that make it super easy to create a web app you might already know python libraries like jungle Dash streamlits and panel on the other hand if you're using JavaScript which is often the case when you're building a high skill application that has to respond very fast to user interaction on the front end like an e-commerce website then node.js on the backend and with react.s on the front end for example is one of the popular Frameworks for this purpose then you might want to package your application with all the dependencies in a self-sufficient portable container and that's why we might want to use Docker container to make it easier to deploy your app then finally you can deploy your app on the cloud platform such as Azu Heroku or Google Cloud platform this may sound a lot if you're new to all this and will definitely cover the deployment part in another video so be sure to subscribe however the good news is that there is a more beginner friendly option that is to use an online data science platform to create and publish a project think of platforms such as Google lab pipnode or data log and talking about data law from jetbrains who is sponsoring this video today it is a collaborative data science platform that helps data science and business teams collaborate and share insights you can actually perform a complete data science pipeline here on data lower from querying data Eda model prototyping model training to presenting results to stakeholders as reports or even creating data apps in addition to that you can collaborate in real time on the gold with your friends and colleagues a unique feature of data law compared to other online platforms is that in some cases when you can't use cloud tools to work with data your team can host a private version of data law Enterprise on AWS Google Cloud platform or Azure and also on premises this ensures that the data doesn't leave the company's environment for personal projects with non-sensitive data you can use data law online with the community or professional plan you can check out my gift code in the description below for free months of professional plan with this plan you get access to GPU and also 20 gigabytes of storage for your projects which is a very solid deal going back to our risk classification use case the final step is to monitor the model perform comments and communicate insights again you also need to use some of your business knowledge to generate and communicate insights for example in this case how the different levels of risk appetite might impact the model performance if the threshold is too low we get way too many false positives while if it's too high we get too many false negatives when talking to stakeholders though you might want to avoid those technical terms they are only used between you and me and instead you might want to explain this as how many missed cases and how many false alarms the model will generate given a certain risk threshold okay now that we know roughly what we need to learn and why we need to learn them let's talk about the learning roadmap there are generally two approaches to learning the first approach is the breadth first approach like going top down on letter T you first learn a little bit of everything before you go deeper into one or two topics the second approach is a depth first approach meaning going Bottom up you first learn in depth and master one thing and expand your skill set later which approach applies to you depends on your situation if you've already had experience in a few different roles in data science you might already have a diverse skill set but not too deep into anything then the first approach is probably the more logical one however you might already be an expert in some specific tool and you know it very well so working your way up and expanding your skill is a must to help you organize your learning I have prepared a nice detailed roadmap here on data law to show you exactly what and how to learn each of these topics and this is assuming that you're starting from scratch we are trying to cover the horizontal part of the T as quickly as possible so you can go more in depth and maybe specialize in one thing later now let me walk you through this roadmap so my recommendation is to First learn some basic programming with python R and SQL SQL is probably the most beginner friendly programming language because it is a lot of natural words in it so you might want to start learning how to query the data from a SQL database and how to manipulate and transform the data here you can find some learning resources and examples on this page on data law you have Dynamic notebook to practice so it's quite handy next is time to learn some more advanced data wrangling and Eda here is where python and R comes in now on this tab you can find some learning resources and books for different levels of python and R you're going to start with either python or R I start with r earlier in my career and I love it but I think python nowadays is a bit more popular and more beginner friendly so the key thing here is to learn the basics of the language and keep on practicing the on the actual data sets to get the hang of it the next step is to learn some data visualization and you can create many beautiful data visualization and even interactive dashboards in Python and R just like one of the examples here on my channel in Industry many companies actually use propriety bi software such as power bi and Tableau to create dashboards so I've also included some learning resources here for these tools after having nailed some important Basics now it's a time to turn to some underlying topics such as math probability and statistics this gives you the foundation you need to learn machine learning and deep learning I've also made a few videos on my channel on how to learn math and statistics for data science so feel free to check them out before you dive in all the resources I mentioned in those videos are included in this tab now for learning machine learning there are tons of resources online but I've embedded here the essential resources and tips for deep learning which includes computer vision and Nob and reinforcement learning it's actually optional if you want to learn those topics it totally depends on your interests and the type of projects you're working on and the data you're working with for example in my job I work in the tax department so we have quite some NOP projects because we often work with a lot of Text data but computer vision projects are very hard to come by for the rest of the skills I think the best way to learn is to actually get your hands dirty and build an end-to-end project through doing this you can practice using data apis writing productionalized code using Git Version Control for your project organizing your project and learn how to deploy your model on the cloud front end-to-end projects I'd encouraged you to not download data from kaggle but to actually collect the data by yourself for example through web scrapping or through data apis or you can also try to combine different data sets in a clever way for example if you have a geospatial data set and you have another one is the text data set with some location data then you can combine those two data sets to make an interesting data science use case I have a few examples on my Channel about how I use data apis and web scripting to collect the data also don't forget to take inspiration from articles and from online tutorials that you can find and use them as a starting point for your projects and the best tip I think is to learn to read documentation and also use like overflow or maybe give chat TPT a try overall this is a very rough roadmap and it's obviously not possible to list out and learn all the available tools out there also the exact tools that you use in your job might differ very widely from one job to another from company to company so don't sweat too much about the exact tools exact libraries or Cloud platforms that you need to learn but rather you want to learn the basic first and learn one of the tools and then you can transfer the skill to another tool very easily later on again you can find this roadmap in the description below if you want to contribute and collaborate with me on building this roadmap please send me a private email to become an editor I believe that learning takes dedication patience and time and most importantly don't lose sight of your your purposes and know that you are not alone if you are curious about how I self-study anything as a data scientist feel free to check out this video on my channel in the upcoming video I'm collaborating with Peter ackis a productively YouTuber to talk about how to build a goal setting system for Learning and how to stay on track so don't forget to subscribe if you want to learn more about it thank you for watching bye [Music] thank you
Info
Channel: Thu Vu data analytics
Views: 133,945
Rating: undefined out of 5
Keywords: data analytics, data science, python, data, tableau, bi, programming, technology, coding, data visualization, python tutorial, data analyst, data scientist, data analysis, power bi, python data anlysis, data nerd, big data, learn to code, business intelligence, how to use r, r data analysis, vscode
Id: QnGotm29cZE
Channel Id: undefined
Length: 16min 30sec (990 seconds)
Published: Wed Dec 21 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.