Getting Started with Airflow for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm gonna show you the best way to install and set up airflow on your computer how to create your First Data pipeline in 5 Steps and how to run it and monitor it so you know what to do when things go wrong let's get started how you doing welcome to another video my name is and if it is your first time here and you want to learn about airflow and stay up to date with it subscribe and click on the Bell so you don't miss anything by the way if you wonder why airflow or what is it then you can find the links to the videos I made about that in the description below for now let's install and set up airflow locally so you can develop and run your data templates to set up and run airflow on your computer you have different ways the first one is by using pip install Apache airflow if you have a terminal Python and wsl2 for Windows users then you can install airflow manually with that command however you will have to do everything by yourself you will have to create the folder tags you will have to configure airflow and much more so it's a bit inconvenient that's why there is another way which is by using the docker compost file if you have Docker then you can run the following command and that downloads the docker compose file of airflow then you execute Docker compose app and just like that you have an afro instance running on your computer this time you get the Dax directory as well as the logs and plugins but even if it is easier than with PP install you still have things to configure manually for example the tests where do you put the test to verify that your data pipelines and tasks work or whether you put your python functions or SQL requests that you want to include in your data pipelines but you don't want to do that in the Dax directory as it is not a best practice or how do you configure airflow as you don't have access to its configuration file from there the point is while you have an app for instance running you still have to create and configure things manually and also that structure doesn't promote best practices that's where the Astro CLI comes in the Astro CLI is an open source project anybody can use it and it helps to create local development environment with airflow following best practice to install the astrocli the only requirement is to have Docker and then you go to the following page install the CLI choose your operating system and follow the instructions it is a straightforward and that once you have the Astro CLI installed you can go back to your terminal and execute document Astro Dev init that initializes your local development environment and as you can see on the left you have many directories and files that have been created for you the Dax directory contains your data pipelines the include directory contains anything you want to include in your data pipelines but are not data pipelines for example python functions or SQL requests plugins is where you put your plugins if you want to customize the air for user interface for example tests to put your tests and verify that your tasks and data pipelines work then you have the file dot AV to export environment variables that is useful to configure your afro instance airflow settings to persist connections variables and pools even after you destroy your airflow environment the docker file specifies the docker image used to run airflow here it is the US runtime Docker image which is just a wrapper around the airflow if you want to know what is the corresponding afro version you just need to take a look at the documentation of the Astro runtime image also packages if you want to install any operating system dependency you can do it like wget last but not least requirements to install any python package you want such as pandas or providers to add functionalities to your afro instance that's the beauty of the astrocla you run one command and you get this fully functional structured airflow local development environment following best practices the last step to run air through is to execute Astro devstat and then you will get airflow up and running on your machine the last option that I want to show you is the airflow CTL airflow CTL is an open source project created by my friend kick seal it is a command line tool like the Astro CLI to manage airflow projects you have a set of commands to initialize build start and stop you airflow projects take a look at it if you don't have Docker or if you don't want to use Docker on your machine this is very helpful for the rest of this video we will use the Astro CLI but regardless of the option you use you can still follow that video and it's time to create your first date I've applied if you don't know what is a dag or an operator I recommend you to take a look at the video in the link in the description below what is Apache airflow because if you don't know those Concepts it's going to be a little bit harder for you to follow the rest of the video that being said let me show you the dag you will create find activity the purpose of this tag is to return a random activity to do from an API and if you take a look at the graph you can see three tasks get activity write activity to file and real activity from file if you take a look at the logs of the last task you can see the activity which is learn to greet someone in a new language each time the afro scheduler runs this data pipeline you get a new random activity to do okay now let's move on to the code go back to your code editor in airflow to create a new data pipeline or dag you need to create a python file in the dags directory here create a new file find activity and now you are ready to define the data pipeline first add the following Imports you will always make those Imports the First Line Imports the deck object that's how airflow recognizes a file as a dag and the second line of imports date time is to create the start date you will see what it is in a second once you have those Imports you can Define the dag object and for that use the DAC decorator a dag which is a data pipeline expects a couple of parameters and the first one is the start date that defines the date at which your dag starts being scheduled here the 1st of January 2023 in addition you have the schedule parameter that defines how often the scheduler triggers your dag and it expects a chrome expression here you want to trigger your tag every day at midnight there are other ways for defining the schedule but let's stay simple here next to the schedule parameter you have the tags and tags allow featuring tags on the user interface if you have many data appliance that will help you to categorize your data pipelines according to the teams projects and so on this is very useful finally the capture parameter that I love as it avoids running non-trigued dagrance between the last time the dag was triggered and the current date again if you don't know what I'm talking about take a look at the video what is Apache airflow but basically each time you trigger a dag that creates a dagran and between the last time your dag was triggered and today you may have missed many days and by default airflow will trigger all the non-triggered dagrance between the last time your deck was triggered and today you can disable that with ketchup and I recommend you to do that to avoid having a lot of diagrams running at the same time finally create a function for your tasks and dependencies under the deck decorator that function's name is your dags unique identifier and make sure that it is unique across all of your dags and you must call the function at the end of the file otherwise airflow won't recognize this file as a dag now you have defined the dag the next step is to create the first task that requests body API to fetch a random Activity Board API returns a random activity in a Json format that we are going to fetch in Our Deck to do that you can use the python operator remember that airflow brings many operators to interact with different tools and services you can take a look at the following website to learn more about that and an operator is basically a task so the python operator executes a python function or a script either you call the python operator as you can see here or you use the decorated version with add task which is easier to write and to read let's see how to create these tasks next to the DAC decorator import the task decorator which is the python writer under the hood then import requests which is an HTTP library to make requests then you create a viable API with a link to board API call the task decorator and Define the python function get activity that you want to execute Under The Decorator the name of the Python function is the unique identifier of that task within the dag this is the task name you see on the user interface in the python function so in your task implement the logic to request the API and return the activity in Json by the way when you return a value from a python function that creates an airflow x-com that allows data sharing between tasks if you don't know what is the next com you can take a look at the video at the top right corner but basically XCOM is the mechanism allowing you to share data between your tasks each time you want to share data you create the XCOM with the value you want to share now the first task is ready you can call it at the end save the file and take a look at the after UI to see your data pipeline find activity with one task get activity congratulations you have successfully created your first task that fetches data from an API the second task to implement is write activity to file that creates a file activity.txt in the include directory and writes the activity fetched by the previous task get activity so let's do that as for the first task you can use the python operator with the add task decorator the test name is write activity to file and it takes a parameter response this parameter is the activity returned by the previous task get activity something new here is the variable get with airflow you can create variables for values to include in different DAC files or tasks in is useful when you need to change a variable's value instead of updating this value at different places tags tasks and so on you do it once in the variable basically whenever you have a value that you want to use in different dags or tasks it's better to create a variable with that value you need instead of hardcoding the value everywhere now let's create this variable from the user interface go to admin variables and add a new record for the key it's activity file and for the value TMP activity txt then click on Save so just like that you have successfully created a variable now I do not recommend to store any sensitive values using variables but if you need to then you can create another record and make sure that you add a prefix like secret or key something like that so when you create a variable that variable at least the value will be hidden from the user interface back to the task import viable at the top of the dag with airflow models variable and then we fetch the variable activity file so TMP activity txt then we open that file and we write the following sentence remember that response is a Json value with the activity and comes from the first task finally the task Returns the file path to share it with the last task to implement read activity from file the last task is the easiest one as you already know how to create tasks so again we use the task decorator with the test name read activity from file and it takes a parameter file path that the task write activity to file returns then it opens this file and reads the content on the standard output that's it it is as simple as that with the three tasks ready the last step is to define the dependencies between them in airflow there are two ways to create dependencies the first one is by using the right and left bit shift operators for example you want to execute get activity first then you want to execute write activity to file you can use the right bit shift operator we say that get activity is apps stream to write activity to file or write activity to file is Downstream to get activity read the left bit shift operator is just the opposite this time right activity to file is Upstream to get activity and so is executed first before get activity the second way to create dependencies is by using the values returned by your tasks so for example here get activity returns a value so let's call it response and then we pass response to write activity to file and then write activity to file returns a file path so let's get back this file path and pass it to read activity from file and just like that you have created the dependencies between your tasks using the values they share get activity is executed first then write activity to file and finally read activity from file you can verify that by going on the F3 UI click on your tag then graph and you can see the three tasks get activity then write activity to file and finally read activity from file so congratulations because at this point you have successfully created your First Data pipeline in airflow using variables xcoms to share data between your tasks the Tasker API which is the new way of creating your dags using decorators that's exactly what you did and you are able to define the dependencies using the data shared between your tasks okay let's trigger the data pipeline to see if it works and for that you just need to turn on the toggle right there this unpause the tag and so the airflow scheduler starts scheduling your database so turn on the toggle then refresh the page and you can see one dark run running wait until it is completed and now it's done you can click on the last task and then go to logs and take a look at the activity today you will volunteer at your local food bank again congratulations to data pipeline works now I would like to give you a few tips to better manage and monitor your data pipelines so first thing first this is the grid view you can see the history as well as the current states of your diagrams and task instances the task instances are the squares and the diagrams are the long bars at the top if you click on the diagram and go to the details you have useful information such as the duration of that background when it started the date enter will start the data inter will end so the data interval for which that diagram was triggered in addition if you take a look at the graph you can identify the dependencies between your tasks this is useful when you have a giant dag with a lot of tasks gaunt is nice to spot any bottlenecks in your dag so if a task is taking longer than expected to complete it might Worth to take a look at that task and the code is useful to make sure that your dag is using the latest version of your code you can verify that by looking at the past at date so if you make a modification to the code but you don't see that modification yet in the code here that means airflow doesn't know yet about this modification so you need to wait a little bit longer if a task is in failure and you want to retry it select that task and then click on clear task click on clear and that will run the corresponding task if you want to rerun all the tasks of a diagram you can just click on the background and click on clear then clear existing tasks and just like that you rerun all the tasks of a given diagram if you have many diagrams or task instances that you want to rerun in this case you can use Bros then dag runs and select all of your diagrams then click on actions and clear the state driven all of them same thing with task instances you go to rows task instances you select all of your task agencies actions clear and that will rerun all the task instances you can verify the data that your tasks share in return by clicking on one tasks for example get activity then go to XCOM and you can see that get activity Returns the following XCOM the following data this is shared with the next task remember write activity to file again click on it go to details and click on XCOM and you can see that this one Returns the following value that will be used but the last task of the Tag Read activity from file if you want to have a complete tour of the airflow user interface take a look at the video at the top right corner so that's it for this video I hope you enjoyed it if you don't want to miss any video don't forget to subscribe to my channel that will help me a lot and I see you for another one
Info
Channel: Data with Marc
Views: 12,485
Rating: undefined out of 5
Keywords: getting started with airflow, airflow tutorial, airflow tutorial for beginners, apache airflow tutorial for beginners, airflow 101, marc lamberti, apache airflow tutorial, airflow for beginners, apache airflow, airflow tutorial python, airflow example, airflow introduction, apache airflow project, apache airflow overview, apache airflow for beginners, apache airflow etl, apache airflow introduction
Id: xUKIL7zsjos
Channel Id: undefined
Length: 15min 59sec (959 seconds)
Published: Mon Oct 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.