Learn Apache Airflow in 10 Minutes | High-Paying Skills for Data Engineers

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
One of the tasks you will do as a Data Engineer is  to build a data pipeline. Basically, you take data   from multiple sources, do some transformation  in between, and then load your data onto some   target location. Now, you can perform this entire  operation using a simple Python script. All you   have to do is read data from some APIs, write  your logic in between, and then store your data   onto some target location. There is something  called a Cron job. So, if you want to run your   script at a specific interval, you can schedule  it using Cron job. It looks something like this. But here's the thing: you can use Cron  job for, let's say, two to three scripts,   but what if you have hundreds of data pipelines?  We know that 90% of the world's data was generated   in just the last 2 years, and businesses around  the world are using this data to improve their   products and services. The reason you see the  correct recommendation on your YouTube page or the   correct ads on your Instagram profile is because  of all of these data processing. There are more   than thousands of data pipelines running in these  organizations to make all of these things happen. So today, we will understand how all of  these things happen behind the scene,   and we will understand one of the highly  used data pipeline tools in the market,   called Apache Airflow. So, are  you ready? Let's get started. At the start of this video, we talked about  the Cron job. As the data grows, we will have   to create more and more data pipelines to process  all of these data. What if something fails? What   if you want to run all of these operations  in a specific order? So, in a data pipeline,   we have multiple different operations coming.  So, one task might be to extract data from RDBMS,   APIs, or some other sources. Then the second  script will aggregate all of these data,   and the third script will basically store  this data onto some location. Now, all of   these operations should happen in a specific  sequence only, so we will have to make sure   we schedule our Cron job in such a way that all  of these operations happen in proper sequence. Now, doing all of these operations using a simple  Python script and managing them is a headache. You   might need to put a lot of engineers on each  and individual task to make sure everything   runs smoothly. And this is where, ladies and  gentlemen, Apache Airflow comes into the picture. In 2014, engineers at Airbnb started working on a  project, Airflow. It was brought into the Apache   Software Incubator program in 2016 and became  open source. That basically means anyone in   the world can use it. It became one of the most  viral and widely adopted open-source projects,   with over 10 million pip installs over a  month, 200,000 GitHub stars, and a Slack   community of over 30,000 users. Airflow became  a part of big organizations around the world. The reason Airflow got so much popularity  was not because it was funded or it had a   good user interface or it was easy to install.  The reason behind the popularity of Airflow was   "pipeline as a code." So before this, we talked  about how you can easily write our data pipeline   in a simple Python script, but it becomes very  difficult to manage. Now, there are other options,   such as you can use enterprise-level  tools such as Alteryx, Informatica,   but these software are very expensive. And also,  if you want to customize based on your use case,   you won't be able to do that. This is where  Airflow shines. It was open source, so anyone   can use it, and on top of this, it gave a lot  of different features. So, if you want to build,   schedule, and run your data pipeline on scale,  you can easily do that using Apache Airflow. So now that we understood why Apache Airflow  and why we really need it in the first place,   let's understand what Apache Airflow is. So,  Apache Airflow is a workflow management tool.   A workflow is like a series of tasks that need to  be executed in a specific order. So, talking about   the previous example, we have data coming from  multiple sources, we do some transformation in   between, and then load that data onto some target  location. So, this entire job of extracting,   transforming, and loading is called a workflow.  The same terminology is used in Apache Airflow,   but it is called a DAG (Directed Acyclic  Graph). It looks something like this. At the heart of the workflow is a DAG that  basically defines the collection of different   tasks and their dependency. This is the  core computer science fundamental. Think   of it as a blueprint for your workflow. The  DAG defines the different tasks that should   run in a specific order. "Directed" means  tasks move in one direction, "acyclic" means   there are no loops - tasks do not run in a  circle, it can only move in one direction,   and "graph" is a visual representation  of different tasks. Now, this entire   flow is called a DAG, and the individual  boxes that you see are called tasks. So,   the DAG defines the blueprint, and the tasks  are your actual logic that needs to be executed. So, in this example, we are reading the data from  external sources and API, then we aggregate data   and do some transformation, and load this data  onto some target location. So, all of these   tasks are executed in a specific order. Once the  first part is completed, then only the second part   will execute, and like this, all of these  tasks will execute in a specific order. Now, to create tasks, we have something called  an operator. Think of the operator as a function   provided by Airflow. So, you can use all of these  different functions to create the task and do the   actual work. There are many different types  of operators available in Apache Airflow. So,   if you want to run a Bash command, there is an  operator for that, called the Bash Operator. If   you want to call a Python function, you can use a  Python Operator. And if you want to send an email,   you can also use the Email Operator. Like this,  there are many different operators available for   different types of jobs. So, if you want to read  data from PostgreSQL, or if you want to store your   data to Amazon S3, there are different types of  operators that can make your life much easier. So, operators are basically the functions  that you can use to create tasks,   and the collection of different tasks is  called a DAG. Now, to run this entire DAG,   we have something called executors. Executors  basically determine how your tasks will run. So,   there are different types of  executors that you can use. So,   if you want to run your tasks sequentially, you  can use the Sequential Executor. If you want to   run your tasks in parallel in a single machine,  you can use the Local Executor. And then, if you   want to distribute your tasks across multiple  machines, then you can use the Celery Executor. This was a good overview of Apache Airflow.  We understood why do we need Apache Airflow   in the first place, how it became popular,  and what are the different components in   Apache Airflow that make all of these  things happen. So, I will recommend an   end-to-end project that you can do using Apache  Airflow at the end of this video. But for now,   let's do a quick exercise of Apache Airflow to  understand different components in practice. So, we understood the basics about Airflow  and what are the different components that are   attached to Airflow. Now, let's look at a quick  overview of what the Airflow UI really looks   like and how these different components come  together to build the complete data pipeline. Okay, so we already talked about DAGs, right?  So, Directed Acyclic Graph is a core concept in   Airflow. Basically, a DAG is the collection  of tasks that we already understood. So,   it looks something like this: A is the task,  B is the task, D is the task, and sequentially   it will execute and it will make the complete  DAG. So, let's understand how to declare a DAG. Now, it is pretty simple. You have to  import a few packages. So, from Airflow,   you import the DAG, and then there is the  Dummy Operator that basically does nothing. So,   with DAG, this is the syntax. So, if you know the  basics of Python, you can start with that. Now,   if you don't have the Python understanding,  then I already have courses on Python,   so you can check that out if you  want to learn Python from scratch. So, this is how you define the DAG. With DAG,  then you give the name, you give the start date,   so when you want to run this particular DAG,  and then you can provide the schedule. So, if   you want to run daily, weekly, monthly basis, you  can do that. And there are many other parameters   that this DAG function takes. So, based on your  requirement, you can provide those parameters,   and the DAG will run according to all of  those parameters that you have provided. So, this is how you define the DAG. And if you  go over here, you can use the Dummy Operator,   where you give basically the task, the task  name, or the ID, and you provide the DAG that   you want to attach this particular task to. So,  as you can see it over here, we define the DAG,   and then we provide this particular DAG name to  the particular task. So, if you are using the   Python Operator or Bash Operator, all you have to  do is use the function and provide the DAG name. Now, just like this, you can also create the  dependencies. So, the thing that we talked about,   right? I want to run my, uh, all of these tasks  in the proper sequence. So, as you can see,   I provide the first task, and then you can  use something like this. So, what will happen,   the first task will run, and it will execute the  second and third tasks together. After the third   task completes, the fourth task will be executed.  So, this is how you create the basic dependencies. Now, uh, this was just documentation, and  you can always read about it if you want   to learn more. So, let's go to our Airflow  console and try to understand this better. Okay, once you install Apache, it will look  something like this. You will be redirected   to this page, and over here, you will see a lot  of things. So, first is your DAGs. These are the   example DAGs that are provided by Apache Airflow.  So, if I click over here, and if I go over here,   you will see, uh, this is the DAG, which basically  contains one operator, which is the Bash Operator.   Just like this, if you click onto DAGs, you will  see a lot of different examples. If you want to   understand how all of these DAGs are created  from the backend, over here, you will get the   information about the different runs. If your  DAG is currently queued, if it's successful,   running, or failed, this will give you all of  the different information about the recent tasks. So, I can go over here, I can just enable this  particular DAG. Okay, I can go inside this,   and I can manually run this from the  top. Okay, so I will trigger the DAG,   and it will start running. So, currently,  it is queued. Now it starts running,   and if I go to my graph, you will  see it is currently running. Uh,   if you keep refreshing it, so as you can see,  this is successful. So, our DAG ran successfully. Now, there are other options, such  as like failed, queued, removed,   restarting, and all of the different statuses  that you can track if you want to do that. So,   this is what makes Apache Airflow a very  popular tool because you can do everything   in one place. You don't have to worry about  managing these things at different places. So,   at one single browser, you  will be able to do everything. So, all of the examples that you see it  over here are just basic templates. So,   if I go over here and check onto example_complex,  you will see a graph which is this complicated,   right? You will see a lot of different  things. So, we have like entry group,   and then entry group is, uh, dependent on  all of these different things. So, the graph   is pretty complex. So, you can create all  of these complex pipelines using Airflow. Now, one of the projects that you will do after  this is build a Twitter data pipeline. Now,   Twitter API is not valid anymore,  but you can always use different   APIs available in the market for free  and then create the same project. So,   I'll just explain to you this code so  that you can have a better understanding. So, I have defined the function  as run_twitter_etl, and the name   of the file is twitter_etl, right? Uh,  this is the basic Python function. So,   what we are really doing is extracting  some data from the Twitter API,   doing some basic transformation, and  then storing our data onto Amazon S3. Now, this is my twitter_dag.py. So, this is  where I define the DAG of my Airflow. Okay,   so as you can see it over, we are  using the same thing. From Airflow,   import DAG. Then, from PythonOperator, I'm  using the PythonOperator because I want to   run this particular Python function, which is  run_twitter_etl, using my Airflow DAG. Okay,   so I first defined the parameters, which  is like the owner, start time, emails,   and all of the other things. Then, this is where  I define my actual DAG. So, this is my DAG name,   these are my arguments, and these are my  description. So, you can write whatever you want. Now, I define one task. So, in this  example, I only have one task. So,   PythonOperator, I provide the task ID, Python  callable, I provide the function name. Now,   this function is, I import it from the  twitter_etl, which is the second file, uh,   this one. So, twitter_etl, from twitter_etl,  I import the run_twitter_etl function,   and I call it inside my PythonOperator. So,  I call that function using my PythonOperator,   and then I attach it to the DAG. And then,  at the end, I just provide the run_etl. Now, in this case, if I had like different  operators, such as I can have like run_etl1,   run_etl2, something like this, okay? So,  I can do something like this: run_etl1,   run_etl2. And then, I can create the dependencies  also. So, then etl1, then etl2. So, this will   execute in a sequence manner. So, once this  executes, then this will execute, this and this. So, I just wanted to give you a  good overview about Airflow. Now,   if you really want to learn Airflow from scratch  and how to install and each and everything,   I already have one project available,  and the project name is the Twitter data   pipeline using Airflow for beginners. So,  this is the data engineering project that   I've created. I will highly recommend  you to do this project so that you will   get a complete understanding of Airflow  and how it really works in the real world. I hope this video was helpful. The goal of  this video was not to make you a master of   Airflow but to give you a clear understanding  of the basics of Airflow. So, after this,   you can always do any of the courses available  in the market, and then you can easily master   them because most of the people make technical  things really complicated. And the reason, uh,   I started this YouTube channel is  to simplify all of these things. So, if you like these types of content,  then definitely hit the subscribe button,   and don't forget to hit the like button. Thank  you for watching. I'll see you in the next video.
Info
Channel: Darshil Parmar
Views: 138,111
Rating: undefined out of 5
Keywords: darshil parmar, what is apache airflow, apache airflow tutorial, learn apache airflow online, apache airflow project, apache airflow in 10 mintes, apache airflow for big data, apache airflow for beginners, project using apache airflow, learn apache airflow, free apache airflow tutorial, learn apache airflow fast, how to learn apache airflow, apache airflow, apache airflow guide, projects using apache airflow, airflow
Id: 5peQThvQmQk
Channel Id: undefined
Length: 12min 37sec (757 seconds)
Published: Sat Oct 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.