Why Airflow? The Top 5 Reasons To Use It!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
why airflow is a common question about airflow why do I need airflow why should I use it in this video I would like to give you five reasons why airflow might be the tool that you need my name is mauka mati and if you enjoy that content please like And subscribe that will help me a lot and without further Ado let me show you the scariest image that you will ever see in your life this is the data ecosystem in 2021 you can imagine how bigger it is in 2023 this is overwhelming and scary now if we zoom in a little bit and focus on the data engineering World usually you have those tools such as air byte and 5tron to extract the data then compute engines to transform the data such as data breaks or spark as well as data lags and data warehouses where to load the data such as snowflake redshift and so on but we still have a lot of tools to deal with isn't it let's take a look at a typical example imagine that you have those sources and you want to extract the data using Fab fry and air byte or another tool then draw some of the data with DBT or databricks and load the data into a data warehouse like snowflake the point is you have two things to solve here the first one is how do you integrate those tools so that they walk together and the second one is how do you connect the dots between those tools so they are executed in the correct order you want to extract the data first then draw some of the data and load this data well this is why you need airflow airflow is an orchestrator and the goal of airflow is to orchestrate and manage those different tools together so that you can build data pipelines where each step will be executed in the correct order in this case you extract the data first with air byte then you trust on the data with data bricks and finally you load the data into snowflake without airflow you don't have this unified way of managing your tools together indeed you will have to take a look at a byte for example if there is a connection with data bricks and then in data bricks to take a look if there is a connection with Snowflake and so on you can imagine how inconvenient that can be but that's not all if we take a look at the after UI you can see what's happening with the tools your database plan works with for example this retail data type appliance which is by the way a project that you can build from scratch by following the video somewhere here you want to know what's happening in DBT or Google Cloud Storage you just need to click on the corresponding task then take a look at the logs and that's it you don't have to switch over different uis and that will save you a lot of time obviously you can monitor your tasks and date Appliance as well pretty easily just by looking at the states of your tasks as well as your dag runs as you can see right there something else that people wonder about airflow is the scalability am I able to run as many tasks as I want that's a good question and if you want to run like three tasks or thousands of tasks guess what you can do that with airflow you can run as many tasks as you want as long as you have enough resources and the budget for that by the way with airflow you can run it on top of of popular Frameworks such as salary desk or kubernetes so you can execute your tasks and data pipelines in parallel and on different machines last but not least airflow has three main components the web server for the user interface the scheduler to schedule your tasks and data pipelines and the metadatabase to store the metadata related to your airflow instance but in production you want to make sure that those components are reliable for example the security log has done well in this case you are in big trouble because you cannot run any more tasks but in airflow you are able to replicate those components so that if the skill log is done you still have other schedulers to schedule your tasks and your data pipelines same thing with the metadatabase and the web server so the first reason why airflow is that with airflow you can efficiently manage and monitor your data stack with reliability and scalability moving on to the second reason remember that image that's a lot of tools we need to integrate but the question is do we have those Integrations in airflow well let me show you a website if you go to the following link you can see on the live that airflow has over 100 Integrations for example air byte Kafka spark and so on what does it mean it means that you don't have to reinvent the wheel to interact with those tools and if the tool you want to interact with doesn't exist guess what you can interact with it using python code or you can even create your own integration and add it as a plugin to your airflow instance because at the end of the day the good thing about airflow is its modularity indeed when you install airflow you install the Apache airflow core package and if you want to add functionalities operators and so Integrations to it you just need to install the corresponding provider for example you want to interact with data bricks then you install the provider databricks same thing with snowflake you install the provider Snowflake and so on to sum up the second reason why airflow might be great for you is the fact that it has a lot of integration and customizations okay so we know that we can efficiently orchestrate our data stack with scalability and reliability and we have a lot of Integrations in the data pipelines but we haven't touched those data plans yet in airflow everything is coded in Python so as long as you know python you are good to go it's a pretty easy language and it is extremely flexible and Powerful let me show you how to create a data plan from scratch in airflow first thing first you will always make the following Imports to create a dag and a task then you define the dag object the data pipeline here you are saying that you want to start scheduling your data pipeline the 1st of January 2023 then you want to run it every day at midnight you add a description my dags does that and you add the tag team a to indicate on the Apple UI that this dag belongs to team a so you can filter on it and the other dag object you define the tasks extract transform and load and finally do dependencies you can see that you are able to share data between your tasks pretty easily just by returning the value and taking that value back as a parameter of the next task that's what you are doing between extract transform and load just like that you have successfully created your data pipeline but you might argue that these data rain is very easy what if you want to get files from your F3 bucket but you don't know in advance what those files will be however you want to create a task per file in this case airflow has this ability to create Dynamic workflows by using a new feature which is dynamic task mapping just by using partial and expand method you are able to create Dynamic workflows so if you look to in advance what you will have still you can create tasks based on that and that opens so many possibilities and use cases with airflow so the third reason why airflow is so great is the fact that you can have Dynamic workflows and those workflows are coded in Python which is why they are so powerful for the next reason I would like to show you an image that I'm sure we all have been there at some point where the data pipeline breaks and we don't even know why I could tell you in this video that you have the following data Pipeline and the last task fails if the issue is within the tasks of the data pipeline then it's pretty easy to will shoot it but in the real world that's not what's happening in the real world maybe this issue was caused by an upstream task that you don't even know that it exists for example you can have different data pipelines belonging to different teams and maybe this failure was caused by these very Upstream tasks but again you have no idea that this task is causing a failure in your task so how do you know that how do you debug that how do you troubleshoot that this is where you need data lineage if you don't know what is data lineage think of it as a way of tracing the complex relationships between your data sets and your data ecosystem so that if something goes wrong or if a data set has been changed or modified you know exactly why how and when it has been modified basically you have this map with your data sets your ecosystem and you can see the interactions between your data sets now with airflow you have this built-in integration with opening age so just by using another tool like Marcus you're able to get this map using airflow and since every is the orchestrator and works with all the tools of your data stack you can imagine why airflow is the best place to implement data lineage so the fourth reason is data lineage and monitoring capabilities the last one and maybe actually one of the most important ones is the community airflow has a huge Community airflow has over 30 thousands of users on slack almost 3 thousands of contributors and you can see how many commits per request and so on and how active the airflow repository is just by looking at the following metrics and I think when you decide to choose a tool as important as an orchestrator you need to make sure that this tool will be well supported and you will be able to find the resources that you need not only to learn about this tool but also to find the support and the help that you will obviously need at some point in your journey so that's the five reasons of why airflow is great and might be the good tool for you you can efficiently orchestrate your data stack with reliability and scalability you have a lot of Integrations and customizations available with airflow and the community keeps adding new Integrations such as cosmos to have your DBT projects in airflow you can create Dynamic workflows so you can create tasks based on something that you don't know in advance you have this monitoring and data lineage capability which is truly important in the real world production environment then last but not least this huge active community so you can show that every will still be around 10 years and you have support documentation and videos like this one to help you along your journey I hope you enjoyed that video take care and see you for the next one
Info
Channel: Data with Marc
Views: 18,849
Rating: undefined out of 5
Keywords: why airflow, why use airflow, why apache airflow, airflow, apache airflow, marc lamberti
Id: vEApEfa8HXk
Channel Id: undefined
Length: 10min 1sec (601 seconds)
Published: Tue Aug 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.