Data engineering is the fastest-growing
field. As data is being generated, companies need people who can manage and process
all this data on a large scale. Data engineers are being paid more than software engineers, and
everyone wants to get into this field. However, it is confusing; there are so many different tools
available in the market. Just take a look at this big data landscape; there are so many different
tools available, it's always confusing where to get started. Even if you do get started, you
might get lost in the entire learning process. This video is your ultimate guide to become
a data engineer. I will give you a clear and concise roadmap to becoming a data engineer. On
top of this, I will also provide you with eight different end-to-end data engineering projects.
So, this is not just a theoretical roadmap; you will gain practical experience if you follow
the steps that I will give you in this video. I have been working in this field for
the last five years. I started as a job, then I became a full-time freelancer. I work
with a lot of startup companies and also with big companies like Ware, so I understand how
data engineering practices are being performed in smaller companies as well as in large
companies. Therefore, I will give you a clear understanding of how these different tech stacks
are being used in all these different companies, so that you will have a better understanding
of what to focus on and what not to. Before we talk about different tools and the skill
set to become a data engineer, I want to mainly talk about having the right mindset. It's not just
about data engineering, but even if you decide to learn anything, you need to have a positive
mindset that you can become something — you can become a data engineer, data scientist, or
whatever your goal is. If you're watching this, that means you want to become a data engineer.
So, the first thing that I suggest is to have a positive mindset that you can become a data
engineer. Do not let negative thoughts, such as you not being capable enough or you not being
smart enough, stop you. Have a positive mindset. The second thing is you need to be fully focused
while you execute this roadmap. We live in a world of distractions — you have social media, your
phone, gaming, and all other things. So, find your distraction and make sure for the next six to
eight months, you remove all of this distraction and just focus on executing this roadmap.
Believe me, if you stay fully focused and remove all distractions, then no one can stop you from
becoming a data engineer in the next six months. To help you in this process, I will give
you a challenge at the end of this video, where I will provide you with a quick
guide on how you can stay consistent, learn in public, and also grow your network.
So once you acquire all the skill sets, you will also have different
opportunities available to you. I highly suggest you to watch this video
from start to end. Get a pen and paper and start taking notes so that you understand and
remember all these things. And before we start, I would appreciate it if you could hit the
like button on this video. That helps this channel to grow and also keeps me motivated
to make more videos. And if you are new here, then don't forget to hit the
subscribe button. Let's get started. So here's the situation: you might be at different
stages of your journey. You are completely new, or you know a few tools but want to know where to go
next. The first thing I always suggest to people who are just getting started and do not have a
technical background is to clear their computer science fundamentals. This is the core — the bread
and butter of everything we do on the Internet. Understanding and having strong computer science
fundamentals will help you in the long run. I'm not telling you to go to college
and get a degree. All I'm saying is that get an understanding of the basics of computer
science fundamentals, such as understanding how code is compiled, how code is executed,
basics of data structures and algorithms, building blocks of programming languages,
loops, conditional statements, variables, and all of the others. For this, we have
one of the best resources available on YouTube for completely free, provided by Harvard
University, called CS50. If you go on YouTube, you will find this playlist. This playlist
has everything you need to clear your basic computer science fundamentals. All I recommend
is just watch the first five videos. This will give you enough understanding about computer
science fundamentals. If you spend just two to three hours daily learning about this, then
you can finish these five videos within a week. Once you clear your basic
computer science fundamentals, then you need to take one step forward and
work on your foundational data engineering skill set. There are two skill sets that you
need to focus on, and you might already know, which is understanding a programming language,
and second is SQL, Structured Query Language. Now, people used to say you can learn Java,
Scala, or Python, but these days companies mainly prefer Python for data engineers. So you
can just learn Python and get started with your data engineering career, so that you don't get
confused between multiple languages. The reason to learn a programming language is basically
you will be automating some of the workflows, you will be writing some transformation jobs, you
will be deploying some of the data pipelines. So, you need to have a basic understanding of how to
do all of these things programmatically. And the same with SQL, Structured Query Language. Most of
the data that gets stored is in databases. Now, SQL is the way we communicate with all of
these databases. So, if you want to insert, retrieve data, delete some
records, or update some records, you can easily do that using SQL language.
SQL has become the universal data language, so no matter which database you use,
you will be writing SQL code there. Learning Python and SQL is non-negotiable.
Now here's the good news: when I was learning about all these different things,
I had to refer to multiple blogs, videos, and courses to understand how these things are
performed at the data engineering level. Now, just to solve this problem, and if you have
been following me, you will know that I have like a dedicated niche course for Python and SQL
for data engineering. These courses are specially tailored for data engineers. So, if you are
a data engineer who wants to understand how Python is used from the data engineering point
of view or how SQL is used for data engineers, then I have created dedicated courses for
that only. These courses are taken by more than 5,000 people, and they all love it.
The way I break down the complex topics and make you understand all of these different
things in a simple manner using my real-world examples will make you fall in love with
the process of learning data engineering. These courses are completely hands-on, and
there are some amazing projects like Spotify Data Pipeline and many other projects. So,
the only resource that I will suggest you for the core foundation of data engineering is
my Python and SQL for Data Engineering courses. I will also add some free resources, so
if you don't want to take this course, then you can also go from the free resources.
Again, I spent around two to three months preparing for all of these different courses,
so I will encourage you to at least check them out — Python and SQL for Data Engineering.
And I'm also running a special discount on all of these different courses, so you can find
all of this information in the description. Doing this much will give you a strong foundation
to start your journey as a data engineer. Now, what you need to do is focus on
highly demanded tools and skills in the market. There are hundreds
of tools available in the market; we just want to focus on the highly demanded
tools that can give us opportunities to get a job. So now, I'm going to suggest a different skill
set that you need to acquire to build your core data engineering skill set. Here's a different
approach that I will suggest: one part is just learning about the tools, but you also need
to understand the core foundations of data engineering. Why do we perform data engineering?
For the learning approach here, it's a little bit different — just pay attention right now, so
that you understand the entire process clearly. Here's the thing: when you start watching
videos, you will get bored after one hour or two, and after you get bored, you might jump onto
doing something else, like you might watch some random YouTube videos or you go on Instagram
and scroll through reels. What we really need to do to avoid the boredom is to replace all of
these different activities with another learning material. This is what I'm going to suggest: we
will be doing two things at the same time. One, I will recommend you a book so that
you can read that in the background, and also you can do a course to learn
your core data engineering skill set. Let's say you decide that you will spend 2
hours daily to learn about data engineering. What you can do is you can spend around 1 to 1 and
a half hour learning from the courses or the video materials, and after that, you can also spend one
hour reading a book. What it will do is, you will learn the highly demanded tool in the market
by watching courses, but at the same time, you will also gain the theoretical knowledge
from the core foundations of data engineering. So the book that I recommend you to read is
"The Fundamentals of Data Engineering". I have the hard copy; you can buy the hard
copy, or you can also get the ebook from the internet. This is one of the best books
available on data engineering foundation, and I highly recommend you to read this while
you are on this journey of becoming a data engineer because most people learn
tools, they learn the technologies, but their fundamentals are very weak. So,
fundamentals don't change for 10 to 15 years; that's why this book will give you strong
foundations about data engineering. I know it might be a little bit confusing,
but the way it works is that you read this book in the background whenever you pee
or whenever you have the time because you can't watch the videos all the time, but you
can read a book. You can read a paragraph, you can read one page, you can read one entire
chapter whenever you get time. So this way, you will stay focused, and you
will enjoy the process of learning. Now, while you are reading this book, I will also
recommend you to do a course to learn about the highly demanded tool in the market. The next core
data engineering skill set that I recommend you is to learn about the data warehouse. Everything
you do as a data engineer will eventually get stored inside the data warehouse. This is where
businesses generally start extracting value from the data. So, if we want to find out, let's
say, what was the last five years of revenue, or how many products we sold this year
compared to last year, you can find answers to all of these questions in the data warehouse
because they are built for analytical queries. So again, learning about data
warehouses has two parts: one is learning about the foundation,
and second is learning about the highly demanded tool in the market. The foundation
of data warehouses includes understanding OLAP and OLTP systems, understanding
dimension tables, extract, transform, load (ETL), ER modeling, or dimensional modeling
such as understanding fact and dimension tables. There's one more book available on data warehouses
called "The Data Warehouse Toolkit" by Kimball, but you don't really have to read this book,
and I will tell you why in a while. After clearing your data warehouse fundamentals,
then you can learn about one tool where you can practice all of this foundational knowledge.
There are many data warehouse tools available, such as Snowflake, BigQuery, Amazon
Redshift. As per my recommendation, you should definitely learn Snowflake because
companies are moving from traditional data warehouses to Snowflake. Snowflake is
the modern data engineering database, so I will highly recommend you to
add Snowflake to your skill set. Now, where to learn all of this? Again, I have
created a detailed course — one of the most in-depth courses you will find in the market,
especially designed for data engineers. The book that I told you about, "The Data Warehouse
Toolkit", I have already referred to that book to create this course, so you don't have to read
this book by yourself. You can just complete this course; you will get an understanding
of the data warehouse core fundamentals, understand how to build a modern data
warehouse using the Snowflake database. It took me 2 to 3 months to prepare this
course. I've referred to multiple blogs, courses, books to create and bring everything
at one place, so I will highly encourage you to at least check these courses out. I have put
all of my hard work behind all of these courses. Imagine this situation in your brain, right?
You're reading a book in the background, "The Fundamentals of Data Engineering",
so whenever you get to pee or whenever you find time, you can read a
page, you can read a paragraph, and also you are dedicatedly focusing on
learning the core data engineering toolset from different courses. This will start feeling
like magic as you go forward in this process. Once you finish learning about the data warehouse,
then the next thing you need to focus on is data processing. Now, the core of data engineering
is data processing only because we get data from multiple places — RDBMS, web analytics, sensors —
all of these data are coming from multiple places in multiple formats. What you really need to do
is you need to write some logic to bring all of these data to one place in a structured format.
All these data are coming in different formats, coming at different frequencies, so
you need a proper tool to process them. Data generally gets processed in two different
ways: one is batch processing, where you take a chunk of data and process it daily or weekly as
per the requirement, and the second is real-time data streaming, as you see it on Google Maps
or Amazon. Right? You get notifications in a real-time manner. So as soon as the data comes
in, you need to process it and pass it forward. One tool for batch, or even real-time
data processing, is Apache Spark, one of the most highly demanded tools available
in the market. It is used by big organizations like Google, Microsoft, and many more. So you
can learn the same way: first, you learn the foundations of Apache Spark, such as understanding
the core architecture, the higher-level APIs, and what are the different functions available in
Apache Spark, and then you can learn the tool that powers Apache Spark's environment. There are many
different tools available in the market, such as Databricks, AWS Glue, Dataproc, and many more.
My recommendation is to learn Apache Spark with Databricks, and the language you will be using
is PySpark — the combination of Python and Spark. I'm working on my Apache Spark course,
so it is not launched yet at the time of creating this video, but you can definitely
follow me if you want to get more updates on this. I will add some good courses and
resources where you can learn Apache Spark and Databricks in the final documentation
that you will get at the end of this video. Now, for batch processing, you can learn Apache
Spark, but for real-time data streaming, you can learn one highly demanded tool in the market
called Apache Kafka. Apache Kafka is a distributed event store and stream processing platform,
so you can process your data in real-time. Now, understand this: when you process
all of this data, you execute all of these different tasks in a sequential manner,
such as extract data from multiple sources, then do some aggregation, do some transformation,
and maybe load this data to some target location. All these operations or tasks
need to happen in a sequence; you cannot have the third task executed
first or the first task executed last; it will not work. And for that, we also need an
orchestration tool or workflow management tool. So, the next skill set that we will be focusing
on is learning about the workflow management tool, and one of the most important tools and
highly demanded tools available in the market is called Apache Airflow. It was developed
by Airbnb, and then it was open-sourced so that everyone can use it, and every company uses
Apache Airflow to build their data pipelines. So again, at this point, you are reading
"The Fundamentals of Data Engineering" and also learning all of these different
tools. At one point, everything will start making sense. So this is kind of like
individual dots, and as you move forward, all of these dots will start connecting, and you
will start understanding why these tools exist in the first place, what are the different
problems these tools are trying to solve. As a data engineer, we process big data — the huge
volume of data — and you cannot store and process all of this data on your local computer. And for
that, we have the cloud platforms: we have three main cloud providers — AWS (Amazon Web Services),
Microsoft Azure, and Google Cloud Platform. So, if you are learning Python, SQL, or data
warehousing from my courses, then you'll already know that we are using AWS to understand all of
these concepts, so the confusion for choosing the right cloud platform will disappear. Just by
doing the foundation courses, you will already get introduced to the cloud computing platform
then and there. But if you're going from the self-learning path, then I will highly recommend
you to either start with AWS or Microsoft Azure. Now, this is the case if you don't have any cloud
experience. But if you already know one cloud, do not jump from one cloud to something else. If you
already know one cloud computing platform — GCP, Azure, or AWS — then forget about all
of those different cloud providers. Just focus on one and start learning the data
engineering side of it. And if you don't know any cloud computing platform, then I
will highly recommend you start with AWS. I just want to give you a clear answer because
a lot of people get confused between AWS, GCP, and Azure. I'm giving you a clear answer: AWS has
a good market segment, so you can always rely on AWS. But if you have any other preferences
in mind, then you can go with Azure too, because it is growing at a rapid pace. So learning
either AWS or Azure will keep you in a safe place. Now again, I do have plans to launch courses
in the future on all of these different topics, so you can follow me if you want to get updated.
Now at this point, you can call yourself a data engineer because you've learned all of these
different things. If you've reached this part of this video, then you can write it in the
comments, "I'm going to become a data engineer." Just send some positive vibes in the comment
section so that everyone can feel motivated. Now, these are the core data engineering
skill sets. Now I want to talk about some of the advanced levels or the things
that have come up in the last few years, so you also need to pay attention to all of those.
One of the trending topics in the market is the open table format. While reading "The Fundamentals
of Data Engineering," you will learn about the data lake. A data lake is basically a centralized
repository where you can store all of these data, and as per the requirement, you can query
this data and select the chunk of it. The problem with the data lake is that it does not
support a lot of different functionalities; it doesn't have ACID transactions.
Now, to solve all of these problems, we have the new concept called the open
table format that comes with a lot of different features on top of the data lake.
There are many different tools available, such as Iceberg and Delta Lake. This is something
that has been trending in the last year, so I will highly encourage you to keep your eye
on it and add the skill set to your portfolio. Then we also have the data observability tools.
You might have hundreds of data pipelines running in your company. Now, how do you monitor
them? How do you keep track of the errors, and how do you debug them? Tools such as
Datadog can help you with that. So these are the modern data engineering tools
that help you with tasks like this. You can also learn about the
modern data stack. Here's the list of them. All I recommend you
to do is just explore these tools; do not get attached to them because these tools
come and go in the market. But as you read "The Fundamentals of Data Engineering," your core
concepts are clear, so you will understand why each and every tool from the modern data
stack exists and where they actually fit. And as you go forward in your career, you
also need to know about the DevOps and the data ops side, which is basically deploying and
automating the entire workflow. And for that, you can also learn about Docker
and Kubernetes. At this stage, you need to think like a principal engineer.
You don't have to be just a data engineer because you are growing in your career,
and you need to be in a position where you can take decisions so that companies
can solve their problems using technology. You can also start reading blogs from different
top companies like Netflix, Zerodha, AWS, GCP, so you will get an understanding of how
data engineering is performed in the real world. So this was everything that you need
to really focus on to become a data engineer. Now, the thing that I told you
at the start of this video, that I will give you the complete guide on
how to stay consistent and learn in public: so if you are really serious, and if you have been
watching this video from start to end till here, that means you are serious about learning and
becoming a data engineer. Because most people already gave up in between. But I assume
if you've stuck to this video till the end, that means you are serious
about learning data engineering. So, you can comment below, "I watched this video
till the end," so that I will know that you are really serious about learning data engineering.
Now, if you want to stay consistent and focused, what you need to be is accountable to someone
else. Now, the best way you can be accountable is on social media. So you can go to platforms
like LinkedIn, Instagram, or Twitter. What you have to do is you have to announce it to everyone
that from today or tomorrow, I'm going to start my data engineering journey by following this
roadmap. So, you can also link this video or the document that I will be giving you, so you can
tell people that I will be focusing and executing this complete roadmap for the next 6 months,
and I will be sharing my daily learnings here. All you have to do is every day, whenever you
learn something from, let's say, my Python course, or SQL course, or from a book that you
are reading, all you have to do is just summarize that entire thing in your own language.
Do not copy-paste. You don't have to be a content creator; all you have to do is just share
your learning with people. If you do this, you will gain confidence, you will be focused, and
if any recruiter sees that you have the knowledge and you are trying to learn and get into this
field, they might contact you in the future. Once you have all of these skill sets, a
lot of people do this: they learn something, share it with the world. All I ask you to do is
create a post, announce it in public, and tag me, so that I will repost it, so that you will
feel accountable to complete this entire journey. That way, you stay consistent, build your
portfolio, and open doors for new opportunities. Now, this is the complete document that I
was talking about. Everything that we talked about in this video, I have added here. With
that, I've also added some extra resources, such as projects from my YouTube channel and
some technologies that you can learn from. So, you can go through this entire roadmap and
start your journey in data engineering. So, I upload quality content on data engineering
on this channel. So, if you are new here, then don't forget to hit the subscribe button.
If you found this video helpful, then please, please, please hit the like button. Thank you
for watching; I'll see you in the next video.