Airflow tutorial 1: Introduction to Apache Airflow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hii everyone! in this video I'll be talking about the introduction of airflow so this is the first video in the tutorial series. The goal of this video is to answer two questions: What is airflow? and Why do we need airflow? So let's go to the first question: What is airflow? Airflow is a platform to programmatically author, schedule and monitor workflows or data pipelines. So what is a workflow? a workflow is a sequence of tasks started on a schedule or triggered by an event and frequently used to handle big data processing pipelines. so this is an example of typical workflow when the first task is you need to download data from a source right and then you need to send data somewhere else to process. In the middle you need to monitor how much the process is running. After the process is completed, you need to generate a report and finally you need to send a report out by email. So in a traditional ETL approach in the example here when you have a database and HDFS and you need to get the data out of a database and load it into HDFS to process. So a naive approach is you write a script and in the script you specify you know multiple tasks for example the first time is to connect to the database. The second task is to pull data from the database and a third task is sending the data to HDFS to process. And if you want to schedule the script you schedule it as a cronjob either it is a daily or an hourly script depends on your need. So there's multiple problem with this naive approach the first one is failure. How do you handle failure? It is way better if the process is retried if failure happens right but then how many times? And how often you want it to retry. The second problem is monitoring. How do you keep track of the status of each task and how you keep track of how long each task runs? What if want one task taking too long to run? The third problem is dependency. The data dependency issue is for example if you have upstream data is missing in this case you don't want to downstream task to run right. Another problem of dependency is called execution dependency. In this case you have two cronjobs and for example cronjob one is taking an hour to run you schedule it at 1:00 a.m. and you expect it to run to finish at 2 a.m. so just schedule the second one, job two at 2:30 so you have a buffer of 30 minutes. But at some point for example and in at one day jump 1 take 2 hours to run that mean the college up - is started before and a job 1 is finished so that's this execute execution dependency issue another issue is gal ability so you have a machine you put in multiple cron job and at one point you need more and more con job to scale up the machine but at one point you need to scale out I mean you have another machine and another machine to schedule more and more con job then you have no century scheduler between mobile chrome machines right another issue is deployment how to keep track on maintain of all the new changes and a new deployment and the new changes that constantly happen and the final problem is how do you process historical data and this is the regular need of all the big data company I mean you need to back feel I'll rerun historical data to generate all the report so that you compare for example you compare the report of six months ago up until now you see any like a upward or downward trim and a report or something like that all the problem I told you will be handled nicely by airflow so fo is a workflow or data pipeline management system developed by a BB it is a framework to define tasks in dependency written in Python and you can bring all the testing and dependency in Python as well it can execute schedule and distribute tasks across worker notes that mean it's a care of the scalability issue that I told you earlier it gives you a view of present impasse run I mean all the historical run and a locking feature it can be extensible to plugins and had very nice UI and you can interact with it to address interface and interact very well with all the major clown or database system it's join the apogee software foundation incubation program it is in 16 that mean it has a huge community support of the project it is currently an open-source project and is used by more than 200 company and victor company like Airbnb Yahoo PayPal in Telstra Google they all use inflow internally so it's very exciting so a workflow in air flow is described as a directive a cyclic craft or in short you can call it a DAC and adapt is composed of tasks so you can see this is an example of a DAC or workflow and each of the rectangular box here is a tense you so you start at one point in you and at one point and forces it run it will run this first test here and then it reaches all the all the tasks and section one of this will be running parallel and then this one the some other tasks in the middle will wait for all this one fitness so this is an execution graph or the dependency graph you can see it as well and you can monitor it in real time you can see it always running parallel on this one we wait for office to run before this one run and you keep all the data will flow to think of it's like a river like all this is all the branch and if at all all the data flowing to all these brains because before it reached the ocean the thing of it it's like that so let's go to a quick demo right so in this case here I have two deck I want to call example to the deck and one it's called tutorial so in a example trader - if you take a look at graph bill in the example of example and try to fetch some Twitter data I clean the Twitter to tweet and analyse tweet and then I put into HD I pass and all this you can see like you know you just look at a look at the dependency graph on the DAC here each of that rectangular box is a task and you can see it gives you a very nice UI then all this one will run in parallel and wait for this one finished and let's run through the tutorial because this is a simple crap and I can explain it to you you can either run it manually by click here to trigger that to run manually on this cake I just clicked turn on the scheduler because this is a daily schedule and you can see immediately it run and if you go to graph video you can see this one's assessed this one it's Center Q and this one now it's assess it run very quick this is just a template task so but you can think of it if you have like longer than tense of this one that's an hour to run but you can see like this one currently running this one way for 5 second and is this sleep task but you can think of it like you have like an hour running test and you can monitor in real-time how the task and all the dependency test is flowing right so it's so the graph will give you the nice versus representation of the execution craft and how things is flowing in at reveal you have a historical Kraft historical craft that you mean you can go back into him you have a start and end date and it you know all the way back feel it run to all the historical day and you have a historical graph for each day each test running I mean you can click on each task for example this one is one on for yesterday it one for yesterday data November 18 so I can click on this and go to the lock and can see the input and output the input is the E and the running command and the output is all output from this command right and it's only run one time and you see the lock bottom use only one and we go back to TV oh you can go to the second day you know you can go to print day click on a lot and so you have a law for each day it's it run for as many day if you specify you know in starting and ending and if you want to rerun you can click on the task it clear okay and this one overlaying your send to the queue to to rerunning me live for you this one is a very quick test so we can see it we run and you know successfully run but if you go to lock right now you see that lock my attempt to I mean it run the second time so this is first time run and keep track of a lock and this is the second time it free run and this is the input and output of the second run and you can think of it like if you have like a Fowler for example the first time I run and fail at some point and you specify the retry mechanism all the way handle by fo it also lead retry the second time so you can go back and go to the lock and keep track of them you know the first one it run and what it fail and you see the arrow right here and the second one retry you can see why a success you know so you can go back and try to monitor everything and keep track up a lock that the gun crafty will tell you how long it stands running detail will tell you you know starting in flow and ending on the initialization of your graph and a coat here which will give you the code that you use the code whose you know to define your your craft your are your DAC when you work for example in this case you have a tutorial back you call the tutorial you put some default argument in this is a description of your workflow you you put a schedule of you know daily and this is the first hand just run come in to bring the day the documentation of the first task second hands just sleep for five second and third pass it's just you know Bascom in attempt at the pass come in and you this is the dependency that you put like test one run and then test you that's really pain on test one that's why and it only you know create the dependency or execution graph for you so that's a quick demo let's go back to the video so what makes air flow great and you can see you know it can handle upstream dousing dependency crystally I mean if the upstream me data is missing the downstream one run it way for a upstream to retry successfully before all the downstream can run right it very easy to reprocess historical job by day and I can show you the mean you can go back and historical graph to rerun any like task specific in any specific day you won in the past and you can rerun for any specific in the value one and you can pass parameters from ups from upstream to downstream you can it handle air and fellow gracefully for you that mean it also a redrawing when a tense fail ease of deployment integration with a lot of you know different infrastructure for example hive presto LTS cool cloud I mean the community pushes to different level and they put all the operator and the library is there we to use all you need to do if you want to connect to different database a different clown you just import it and immediately immediately is there for you to use so you don't have to rewrite data sensor Weasley you know you have a sense of tasks to keep sensing if the data is there when the data available is immediately tricular DAC to run so we've talked about it in the next video about a sensor and all the concept of a deck a different other great feature airflow job testing accessibility of Loch Fyne triggered rural monitoring email alert and the most important one is a huge community to support airflow so different info application data warehousing when you need to maintain the quality of your data warehouse she learning is you need to write production you know machine learning workflow when from start to finish like call end to end machine learning will flow like the first stands collect all that data if abyssum different source do some manipulation to have a quality data before you can apply some sort of machine learning algorithm on top to generate the model and finally serve it in production so end to end machine learning in a flow can help you you know write a pipeline and schedule to refreshing machine running workflow daily weekly or monthly if a new case different application grown and attacks a be testing in experimentation email targeting search and data infrastructure maintenance so airflow is it's amazing in a way that it helps you write production in a pipeline or presciently the product let's talk briefly about a hierarchy of the design and this is the thing that I see after a work and the design industry for more than five years in a sense so this pyramid here represent the hierarchy of the design currently need in an industry so the airflow framework put things into perspective right so in this pyramid the bottom one is the one that currently needed the most in order one all industry need that mean you need they need some sort of framework to collect data right production ETL data pipeline so that you have a clean quality data before you can apply any sort of some thing in in the top of a pyramid and an Tek matrix a B testing or a re building machine learning something like that so before company can optimize the business more efficiently or bill intelligently the product this layer now foundation need to be built first so data is the fuel for all data product and to get data into your company you need chanita pipeline what I see in your currently is unfortunately a most college and the design training program only focus on the top of the pyramid so this part is missing and it's a currently a discrepancy between the need of industry and training program in college or any design training program so I hope this tutorial is helpful anyone you're trying to fill out a gap to have a big picture this reach the end of the video and in the next video I'll talk about how to set up like the air flow environment and we'll go to some you know running the first workflow and right going to from some example or use cases when you understand more about a framework so good bye guys and see you in the next video
Info
Channel: Tuan Vu
Views: 348,038
Rating: undefined out of 5
Keywords: python, datascience, etl, datapipeline, dataengineer, airflow tutorial, airflow tutorial for beginners, apache airflow, airflow tutorial python, airflow for beginners, airflow 101, apache airflow use cases, airflow docker, airflow explained, airflow example, apache airflow tutorial, apache airflow tutorial for beginners, airflow introduction, apache airflow intro, data science, data engineer, data engineer tutorials, data engineer projects
Id: AHMm1wfGuHE
Channel Id: undefined
Length: 16min 24sec (984 seconds)
Published: Tue Nov 20 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.