God Tier Data Engineering Roadmap (By a Google Data Engineer)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] I'm just supposed to make sense out of all this data but it's all trash okay wait let me do something wait wow who are you well I am let's do that here so this is all what you do just transform the data obviously we transform the data and publish it so that it is useful for Downstream users we also clean out the trash so that you never have to see it the data that you see on the table is really very small we are usually able to handle terabytes of data making sure that our pipeline is highly scalable and is able to perform all the time apart from that we have to select a data model either we should go with star or we should go with snowflake or something in the middle we also have to orchestrate the pipeline using a tool or a programming language and make sure that it is completely configuration driven having all the monitoring and logging capabilities as well on top of that we should be able to handle all of this for real time as well as streaming data also capturing the changes that happen over time in the data so that it's useful for data analysts and data scientists for recognizing historical patterns so that they can make predictions and did I mention that we also integrate all different data sources and make sure that the pipeline is super fast performance even on the cloud wow that is impressive do you know how I can learn data engineering luckily this video is exactly for someone like you who wants to learn data engineering from scratch hello everyone this is Josh and I work as a data engineer at Google and this video is all you need if you want to be a data engineer learning everything from scratch now this is by far the most requested video I've had on this channel whether I talk about YouTube comments or LinkedIn comments or LinkedIn direct messages so let's Jump Right In this video is not one of those paid collaboration where I talk about get this course from XYZ person and after six months or a year you will definitely be a successful data engineer and get placed into Fang or mang or similar companies no this is for those people who are genuinely interested in learning data engineering and transitioning into this super cool field before I get started I would just like to highlight that a lot of work has been put into this video so if you find this useful don't forget to leave a like button and subscribe to the channel so think of this roadmap as a game where you would have to pass through seven different levels to become the ultimate data engineer now each level has multiple tasks that you need to accomplish and the Order of tasks in a particular level doesn't really matter but the order of level definitely matters so let's begin and we are going to start with prerequisites so as we can see in prerequisite section there are some languages and some Concepts that we need to learn the first one is dbms database is one of the most key Concepts in the world of data engineering and without it you simply cannot survive if you go through this link it will open a tutorials Point link here it's an article where there are different uh sections that you can go through like the overview the architecture data model data schemas it talks about everything from the ground up it talks about ER model it talks about SQL overview what are different types of joints how is indexing hashing done and it also contains some Advanced topics and ultimately also has some interview questions so this link is super useful for you to learn dbms Concepts then moving on we are talking about SQL now so when I click on this SQL link you will see that you are redirected to this W3 schools and it has everything that you need to learn about SQL from basic concepts to more advanced ones like stored procedures and operators and whatnot you can see this is an example of how right join is done there is a little bit of an explanation and then there is a try it yourself button now when you click on it you will be able to run a SQL directly here and on the right hand side you can also see all the tables that we have so our database has different tables like customers categories you can click on it to just do a select star from customers I'm just going to do a Ctrl Z now and if I want to run this particular join I just have to click the Run SQL button and it will give me the output of this joint you can play around with it and understand the concept by changing the SQL query here so this is definitely a super good way to learn SQL after that we are talking about Hands-On now you would be wondering we already went through Hands-On in the W3 schools right but this Hands-On is completely different it's about hacker Rank and it has all the easy medium hard problems on SQL that are usually asked in interviews so when you click on it you'll be redirected to this page and we have different uh ways we can filter out for example if you want to look at intermediate ciscals you can do it and you can just click on solve challenge you'll be able to see the question the example output and you will be able to write your SQL query most of the job requirements that I have seen on data engineering contains python over Scala but don't worry if you know one language it's very easy for you to just switch gears a little and learn a different one going through a lot of links and resources and all of it is going to be mentioned in the description but when you are going through it just make sure that you are following all the levels and the Order of the levels that I'm mentioning it in this video don't just randomly go around in different links if you're already good with dbms SQL Python and Lead X let's say then you can skip the prerequisite section so when you open the python link you'll be redirected here so this is learn python.org it has everything from from Hello World to more advanced concepts and you can also learn different data structures for example if I want to learn lists I can do that and just like in W3 schools you can play around here with your code to know more python Concepts so I would definitely recommend going through this website once you're done with that I would suggest go through the Hands-On one now this is similar to what we have done in SQL you'll be redirected to a hacker ranked link so here again you can sort different difficulties and then go with different python problems moving on let's talk about Linux so the reason I have added Linux is not because we do a lot of shell scripting yes sometimes that is the case but knowing Linux commands is kind of a key to uh being any kind of engineer not just data engineer so for example if you are doing anything on a virtual machine you need to understand the Ubuntu commands or Linux commands in order to install something in order to move your files copy your files everything right so that's why you can go through this Linux basics for beginners so it's basically a udemy course and uh uh I'm not sure what will be the price of the course by the time you see this video but as you can see I have already taken a snapshot where it was available at 449 so if you think that any course from udemy is very expensive for you then just I would say wait for a sale and the prices will definitely drop and at that time you can just get it there are no Hands-On section uh separately for this because it is already included in the course we are done with prerequisites and we are moving on to next level which is everything data this has all data Concepts there are so many data warehousing Concepts like data Lake data Mart data modeling or even slowly changing Dimensions these are just some keywords that I'm talking about but it's super important to understand what exactly is Data Warehouse so for that you can just go through this everything data section so if I open this course you will be able to see that again it is redirecting to uh the udemy link you can obviously see before purchasing the course what exactly is in the description and see if it works out for you but I definitely recommend this course to learn all the concepts of data warehousing if you are a beginner once you are done with that course you can go through this book this is a very famous book by Kimball Group which is the data warehouse toolkit when you are done with both of these materials you also have to do some self study so you have to do some research about Concepts like data Lake data Mart one what is data fabric what is a data mesh data catalog Etc now a lot of these things will already will already be covered in the courses or the book that I have mentioned here but if it is not then I would definitely recommend learning the basics of it all right so we are already two levels down now the third level is distributed systems size of data that we generally deal with is around gigabytes terabytes to even petabytes range as data Engineers so we need to make sure that we are not running this python code and SQL queries in our local machine because we're never going to get the output in time the concept of distributed system is think of it like a cluster of computers working specifically for your job dividing data into different nodes Gathering results and correlating it back together before representing it to you so the time that it takes to get the job done gets significantly reduced now there are different distributed systems right we've got Hadoop we've got spark Hive Etc but right now the first one that you should learn is definitely spark because things like Hadoop Hive are kind of outdated right now and also when you will learn basics of spark you will know why it was invented and why Hadoop wasn't the best thing uh so while doing learning spark you will definitely learn a little bit about Hadoop and high I took this course when I found out that I'll be working with the big data team at Zs three years back when I knew nothing about what is spark so this is taming Big Data with Apache spark and python it has everything like introduction spark Basics the data frames data sets rdds running spark on a cluster everything it also has a lot of Hands-On materials so I would definitely recommend going through uh doing the Hands-On along with him uh what he is suggesting is that you should install spark on your local machine and if you have never done that it can be a little frustrating so I would just recommend uh going to databricks and creating a Community Edition account so for that also I have created a link if I click on hands on here will be redirected to to this page which is try databatics for free you can host a databricks instance on AWS Azure or gcp and it is completely free obviously there are some limitations on number of clusters or number of nodes in a cluster that you can have but it will have everything it will have spark installed you'll be able to run your spark code or even SQL code whenever you're doing the course and they're doing Hands-On just do the same Hands-On on this databricks Community Edition now previously uh distributed systems and all of our data used to be in on-premise on a particular server but nowadays everything runs on the cloud so it's super important to understand Cloud Concepts and that's why we have the next level pick anything gcp AWS or Azure so for example if you pick AWS I have provided you a link to Ultimate AWS certified Solutions architect associate course why this course even if you don't want to do a certification like Solutions architect this course is very good for you to understand the basic thanks then there are also different types of course that I am going to link down below for example for gcp we have this course from a cloud Guru which is the certified professional data engineer course this is specifically for data engineering professionals professionals or who want to learn data engineering using gcp I would also recommend for you to consider certifications like AWS Solutions architect AWS Big Data specialty gcp professional data engineer or Azure Cloud fundamentals you know you can pick the cloud platform that you like the most and do its certification it has tremendous market value and it will definitely give you a lot of credibility when you are sitting in a job interview now coming to the next level we have must learn tools I have divided tools into different categories for example orchestration compute and so on so if you talk about orchestration it's again a udemy course that you can take it will give you complete Hands-On as well as end-to-end understanding of what is iPhone how it is used it's basically an orchestrator that allows you to run your pipeline then if we talk about compute I have added two different links here data breaks and snowflake databricks is generally preferred when you have a spark workload and snowflake is generally preferred when you have SQL workload for example if I open snowflake you will be redirected to this link which is snowflake a to Zero to Hero master class I took this course uh before I completed my snowflakes no Pro course certification so I would definitely recommend this one also databricks substitutes can be AWS EMR or gcp data proc and snowflake substitutes can be AWS redshift or gcp bigquery uh the reason that I have mentioned these data bricks and snowflake only is because they can be hosted on any Cloud but if you want Cloud specific Technologies then uh there are these different uh Alternatives that you can go and learn as well then let's talk about CI CD cicd is continuous integration and continuous deployment which allows you to continuously update your code and deploy it without hampering the production system Jenkins is pretty much like an airflow it's almost like an orchestrator but it is generally used when it comes to orchestrating CI CD pipelines sonar cube is a tool that will allow you to run some tests it has a lot of packages that you can test your code against it also checks the security of your code if there are any vulnerabilities and whatnot so you can do that as well if I click on let's say Jenkins here you have this particular tutorial this is completely free and it in fact it is their official documentation so you cannot go wrong with this and also if you want to learn sonar Cube this one is also completely free because sonar cube is a very small tool so that documentation itself feels like a tutorial then we are coming over to streaming our data pipelines so there are two types of data pipelines we are going to cover both of those in the next level but in order for you to understand the streaming data pipeline it's very important that you get a basic out idea of what is Kafka so for this I have again given a link to Stefan Marek's course now just personal preference Stefan Marek is my favorite tutor on udemy I have had so many courses by him and he explains everything in the most easiest way possible so I would definitely recommend his course the Apache Kafka series to learn about Kafka and then we have the container section so for those of you who don't know what exactly is a container it can be a little confusing at first or even under to understand what is the use of it but that's exactly why I have given you a link which is a YouTube video it's completely free it's over a couple of hours it will give you an idea of what are Docker containers and how we can work with them now you know basically everything that you need to know about data engineering at this point and you can officially go on and create data engineering pipelines so next level allows you to do just that so if you click on the batch processing project it will redirect you to the data engineering project for beginners batch Edition it has everything like what is the sample data used what is its data model what is the architecture of the sample pipeline that say they have set up and they are also giving you some sample code to run on AWS so definitely recommend I would definitely recommend that for your so for setting up your batch processing pipeline this is the second type of pipeline which is a little bit more complex but still very widely used in real world and asked often in the interviews is the real-time processing and that's why we learned about Kafka in the previous section so this is nothing but a YouTube playlist that you can go through it so it's completely free it's talking about how to use Kafka to ingest data from different data sources like MySQL I would suggest you to create a free TR account in any Cloud platform that you prefer it can be AWS gcp or Azure to implement your end-to-end projects now I'm not going to to go into the debate of which Cloud platform is better I've worked on all three of them there are pros and cons for each platform there are some strengths and weaknesses for each platform so it really depends on what do you see yourself working in in the future or if you are not sure then then just randomly pick one because the good thing is eighty percent of cloud platforms are exactly similar the differences that I'm talking about is not more than 10 to 20 percent if you go through everything that I've mentioned in this video uh by this time that you complete all your projects you would have become a really really good data engineer but if you want to become an even better one or super Advanced one then there are some Advanced topics that I have for you in the next level so this is like an external level when you complete a game sometimes you have an epilogue of mission uh that's not really necessary for you to complete but it still is a lot of fun so this is exactly like that this level is additional optional topics you can learn about building training and deploying ml models you can learn about data visualization with Tableau power bi or looker or you can also learn about different etls or ELD tools and one thing is uh nowadays are used everywhere is kubernetes so we already went through Docker if you are interested in Docker then I would definitely recommend going through kubernetes it is basically built on containers itself but it has a distinctive set of features compared to the normal Docker and our final goal should be to become an SME now SME means subject matter expert so let's say you can you become a subject matter expert in real time processing or in containerized applications or in batch processing you can pick whatever you like the most and become a subject matter expert and that will give you and that will give you the edge in market now what is SME SME is basically a subject matter expert so let's say you like real-time processing or you like containerized applications or you like batch processing just decide what you love the most and become a subject matter expert in that area that will give you the Cutting Edge that you need to stand out in the market and in the additional optional topics uh you would have observed things like deploying models now you would be wondering is that really a data engineer's job shouldn't a data scientist be doing it then I also mentioned visualizations then you can say that isn't this a data analytics job why am I doing visualization as a data engineer well it really depends on what your role is mostly if you will be working in a big organization then your role will be very much limited to exactly what a data engineer generally does but if you work at a startup you might have to wear different hats so in that case you might also act as an ml engineer sometimes or even act as a bi engineer doing all the visualizations so that's why I've added those sections in the advanced topics and I would definitely recommend going through them to learn at least to have a basic idea about what these topics are and you will never face any problems in the world of data engineering so I sincerely hope that this video was useful to you when I created this roadmap I just thought that what if I wanted to become a data engineer from scratch what would I do differently compared to what I've done and what would be the best path to become a data engineer from zero right now and after that I created this roadmap so if you like it leave a like button and subscribe to the channel have you already gone through uh some of these sections are you already a pro in distributed systems but you don't know a lot about what is data mesh or have you already gone through private requisites let's talk about it in the comment section if you have any questions on this video feel free to drop a comment I reply to every one of them also if you have any suggestions for my next video just drop a comment thanks a lot again for watching see you guys next time [Applause] [Music] [Applause] [Music]
Info
Channel: Jash Radia
Views: 148,831
Rating: undefined out of 5
Keywords: how to become a data engineer, data engineer roadmap, data engineering roadmap, learn data engineering, data engineer guide, data engineering guide, how to learn data engineering, learn cloud, learn aws, learn gcp, learn snowflake, learn airflow, learn python, data engineering, google data engineer
Id: WgCavqDntlQ
Channel Id: undefined
Length: 20min 55sec (1255 seconds)
Published: Sat Oct 15 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.