One of the tasks you will do as a Data Engineer is
to build a data pipeline. Basically, you take data from multiple sources, do some transformation
in between, and then load your data onto some target location. Now, you can perform this entire
operation using a simple Python script. All you have to do is read data from some APIs, write
your logic in between, and then store your data onto some target location. There is something
called a Cron job. So, if you want to run your script at a specific interval, you can schedule
it using Cron job. It looks something like this. But here's the thing: you can use Cron
job for, let's say, two to three scripts, but what if you have hundreds of data pipelines?
We know that 90% of the world's data was generated in just the last 2 years, and businesses around
the world are using this data to improve their products and services. The reason you see the
correct recommendation on your YouTube page or the correct ads on your Instagram profile is because
of all of these data processing. There are more than thousands of data pipelines running in these
organizations to make all of these things happen. So today, we will understand how all of
these things happen behind the scene, and we will understand one of the highly
used data pipeline tools in the market, called Apache Airflow. So, are
you ready? Let's get started. At the start of this video, we talked about
the Cron job. As the data grows, we will have to create more and more data pipelines to process
all of these data. What if something fails? What if you want to run all of these operations
in a specific order? So, in a data pipeline, we have multiple different operations coming.
So, one task might be to extract data from RDBMS, APIs, or some other sources. Then the second
script will aggregate all of these data, and the third script will basically store
this data onto some location. Now, all of these operations should happen in a specific
sequence only, so we will have to make sure we schedule our Cron job in such a way that all
of these operations happen in proper sequence. Now, doing all of these operations using a simple
Python script and managing them is a headache. You might need to put a lot of engineers on each
and individual task to make sure everything runs smoothly. And this is where, ladies and
gentlemen, Apache Airflow comes into the picture. In 2014, engineers at Airbnb started working on a
project, Airflow. It was brought into the Apache Software Incubator program in 2016 and became
open source. That basically means anyone in the world can use it. It became one of the most
viral and widely adopted open-source projects, with over 10 million pip installs over a
month, 200,000 GitHub stars, and a Slack community of over 30,000 users. Airflow became
a part of big organizations around the world. The reason Airflow got so much popularity
was not because it was funded or it had a good user interface or it was easy to install.
The reason behind the popularity of Airflow was "pipeline as a code." So before this, we talked
about how you can easily write our data pipeline in a simple Python script, but it becomes very
difficult to manage. Now, there are other options, such as you can use enterprise-level
tools such as Alteryx, Informatica, but these software are very expensive. And also,
if you want to customize based on your use case, you won't be able to do that. This is where
Airflow shines. It was open source, so anyone can use it, and on top of this, it gave a lot
of different features. So, if you want to build, schedule, and run your data pipeline on scale,
you can easily do that using Apache Airflow. So now that we understood why Apache Airflow
and why we really need it in the first place, let's understand what Apache Airflow is. So,
Apache Airflow is a workflow management tool. A workflow is like a series of tasks that need to
be executed in a specific order. So, talking about the previous example, we have data coming from
multiple sources, we do some transformation in between, and then load that data onto some target
location. So, this entire job of extracting, transforming, and loading is called a workflow.
The same terminology is used in Apache Airflow, but it is called a DAG (Directed Acyclic
Graph). It looks something like this. At the heart of the workflow is a DAG that
basically defines the collection of different tasks and their dependency. This is the
core computer science fundamental. Think of it as a blueprint for your workflow. The
DAG defines the different tasks that should run in a specific order. "Directed" means
tasks move in one direction, "acyclic" means there are no loops - tasks do not run in a
circle, it can only move in one direction, and "graph" is a visual representation
of different tasks. Now, this entire flow is called a DAG, and the individual
boxes that you see are called tasks. So, the DAG defines the blueprint, and the tasks
are your actual logic that needs to be executed. So, in this example, we are reading the data from
external sources and API, then we aggregate data and do some transformation, and load this data
onto some target location. So, all of these tasks are executed in a specific order. Once the
first part is completed, then only the second part will execute, and like this, all of these
tasks will execute in a specific order. Now, to create tasks, we have something called
an operator. Think of the operator as a function provided by Airflow. So, you can use all of these
different functions to create the task and do the actual work. There are many different types
of operators available in Apache Airflow. So, if you want to run a Bash command, there is an
operator for that, called the Bash Operator. If you want to call a Python function, you can use a
Python Operator. And if you want to send an email, you can also use the Email Operator. Like this,
there are many different operators available for different types of jobs. So, if you want to read
data from PostgreSQL, or if you want to store your data to Amazon S3, there are different types of
operators that can make your life much easier. So, operators are basically the functions
that you can use to create tasks, and the collection of different tasks is
called a DAG. Now, to run this entire DAG, we have something called executors. Executors
basically determine how your tasks will run. So, there are different types of
executors that you can use. So, if you want to run your tasks sequentially, you
can use the Sequential Executor. If you want to run your tasks in parallel in a single machine,
you can use the Local Executor. And then, if you want to distribute your tasks across multiple
machines, then you can use the Celery Executor. This was a good overview of Apache Airflow.
We understood why do we need Apache Airflow in the first place, how it became popular,
and what are the different components in Apache Airflow that make all of these
things happen. So, I will recommend an end-to-end project that you can do using Apache
Airflow at the end of this video. But for now, let's do a quick exercise of Apache Airflow to
understand different components in practice. So, we understood the basics about Airflow
and what are the different components that are attached to Airflow. Now, let's look at a quick
overview of what the Airflow UI really looks like and how these different components come
together to build the complete data pipeline. Okay, so we already talked about DAGs, right?
So, Directed Acyclic Graph is a core concept in Airflow. Basically, a DAG is the collection
of tasks that we already understood. So, it looks something like this: A is the task,
B is the task, D is the task, and sequentially it will execute and it will make the complete
DAG. So, let's understand how to declare a DAG. Now, it is pretty simple. You have to
import a few packages. So, from Airflow, you import the DAG, and then there is the
Dummy Operator that basically does nothing. So, with DAG, this is the syntax. So, if you know the
basics of Python, you can start with that. Now, if you don't have the Python understanding,
then I already have courses on Python, so you can check that out if you
want to learn Python from scratch. So, this is how you define the DAG. With DAG,
then you give the name, you give the start date, so when you want to run this particular DAG,
and then you can provide the schedule. So, if you want to run daily, weekly, monthly basis, you
can do that. And there are many other parameters that this DAG function takes. So, based on your
requirement, you can provide those parameters, and the DAG will run according to all of
those parameters that you have provided. So, this is how you define the DAG. And if you
go over here, you can use the Dummy Operator, where you give basically the task, the task
name, or the ID, and you provide the DAG that you want to attach this particular task to. So,
as you can see it over here, we define the DAG, and then we provide this particular DAG name to
the particular task. So, if you are using the Python Operator or Bash Operator, all you have to
do is use the function and provide the DAG name. Now, just like this, you can also create the
dependencies. So, the thing that we talked about, right? I want to run my, uh, all of these tasks
in the proper sequence. So, as you can see, I provide the first task, and then you can
use something like this. So, what will happen, the first task will run, and it will execute the
second and third tasks together. After the third task completes, the fourth task will be executed.
So, this is how you create the basic dependencies. Now, uh, this was just documentation, and
you can always read about it if you want to learn more. So, let's go to our Airflow
console and try to understand this better. Okay, once you install Apache, it will look
something like this. You will be redirected to this page, and over here, you will see a lot
of things. So, first is your DAGs. These are the example DAGs that are provided by Apache Airflow.
So, if I click over here, and if I go over here, you will see, uh, this is the DAG, which basically
contains one operator, which is the Bash Operator. Just like this, if you click onto DAGs, you will
see a lot of different examples. If you want to understand how all of these DAGs are created
from the backend, over here, you will get the information about the different runs. If your
DAG is currently queued, if it's successful, running, or failed, this will give you all of
the different information about the recent tasks. So, I can go over here, I can just enable this
particular DAG. Okay, I can go inside this, and I can manually run this from the
top. Okay, so I will trigger the DAG, and it will start running. So, currently,
it is queued. Now it starts running, and if I go to my graph, you will
see it is currently running. Uh, if you keep refreshing it, so as you can see,
this is successful. So, our DAG ran successfully. Now, there are other options, such
as like failed, queued, removed, restarting, and all of the different statuses
that you can track if you want to do that. So, this is what makes Apache Airflow a very
popular tool because you can do everything in one place. You don't have to worry about
managing these things at different places. So, at one single browser, you
will be able to do everything. So, all of the examples that you see it
over here are just basic templates. So, if I go over here and check onto example_complex,
you will see a graph which is this complicated, right? You will see a lot of different
things. So, we have like entry group, and then entry group is, uh, dependent on
all of these different things. So, the graph is pretty complex. So, you can create all
of these complex pipelines using Airflow. Now, one of the projects that you will do after
this is build a Twitter data pipeline. Now, Twitter API is not valid anymore,
but you can always use different APIs available in the market for free
and then create the same project. So, I'll just explain to you this code so
that you can have a better understanding. So, I have defined the function
as run_twitter_etl, and the name of the file is twitter_etl, right? Uh,
this is the basic Python function. So, what we are really doing is extracting
some data from the Twitter API, doing some basic transformation, and
then storing our data onto Amazon S3. Now, this is my twitter_dag.py. So, this is
where I define the DAG of my Airflow. Okay, so as you can see it over, we are
using the same thing. From Airflow, import DAG. Then, from PythonOperator, I'm
using the PythonOperator because I want to run this particular Python function, which is
run_twitter_etl, using my Airflow DAG. Okay, so I first defined the parameters, which
is like the owner, start time, emails, and all of the other things. Then, this is where
I define my actual DAG. So, this is my DAG name, these are my arguments, and these are my
description. So, you can write whatever you want. Now, I define one task. So, in this
example, I only have one task. So, PythonOperator, I provide the task ID, Python
callable, I provide the function name. Now, this function is, I import it from the
twitter_etl, which is the second file, uh, this one. So, twitter_etl, from twitter_etl,
I import the run_twitter_etl function, and I call it inside my PythonOperator. So,
I call that function using my PythonOperator, and then I attach it to the DAG. And then,
at the end, I just provide the run_etl. Now, in this case, if I had like different
operators, such as I can have like run_etl1, run_etl2, something like this, okay? So,
I can do something like this: run_etl1, run_etl2. And then, I can create the dependencies
also. So, then etl1, then etl2. So, this will execute in a sequence manner. So, once this
executes, then this will execute, this and this. So, I just wanted to give you a
good overview about Airflow. Now, if you really want to learn Airflow from scratch
and how to install and each and everything, I already have one project available,
and the project name is the Twitter data pipeline using Airflow for beginners. So,
this is the data engineering project that I've created. I will highly recommend
you to do this project so that you will get a complete understanding of Airflow
and how it really works in the real world. I hope this video was helpful. The goal of
this video was not to make you a master of Airflow but to give you a clear understanding
of the basics of Airflow. So, after this, you can always do any of the courses available
in the market, and then you can easily master them because most of the people make technical
things really complicated. And the reason, uh, I started this YouTube channel is
to simplify all of these things. So, if you like these types of content,
then definitely hit the subscribe button, and don't forget to hit the like button. Thank
you for watching. I'll see you in the next video.