Building (Better) Data Pipelines with Apache Airflow

Video Statistics and Information

Video

Captions Word Cloud

Captions

my name is sedan and a little bit about me I've worked in the valley for many years many of the company is listed above I currently work at PayPal I joined about ten months ago to work on various data infrastructure challenges I'm also co-chair for many of the q con conferences one of them is this one I'm also a maintainer on Apache airflow and in my spare time I spend time with my family one of which is like a 1 year old girl and the other is like a 5 year old boy but they both act exact same age so Apache are for first of all a show of hands how many people have used it ok how many people have heard of it ok more okay good so what is it in a nutshell it's a platform to programmatically author schedule and monitor workflows workflows are think you know things that do things they're also known as DAGs or directed acyclic graphs and the first way most people come to know airflow is through its UI that's essentially the first interaction most people have so I thought why not just walk you through the UI a little bit so this is the landing page for the air flow web server when you hit it if you have air flow running at your company for example there are multiple columns here the first column unless ignore this one the first column is just the name of the dag and then next to it there's a toggle switch which basically tells you whether the dag is enabled or disabled so if you're doing some sort of maintenance on some system and let's say you're doing an ETL you have some sort of ETL that's reading data from let's say my sequel and writing it into redshift and let's say you're doing some maintenance on my sequel you might want to disable that tag the next column over is sort of a schedule and it supports multiple types of schedules so it supports a cron scheduling you can also specify like because this is written in Python which I'll talk about a bit more it supports like Python date/time semantics and it also has some inbuilt like built-in enums like hourly daily monthly the next column of course the owner and then this is sort of a sati older UI but the stuff in trunk is much got a few more columns here but this basically tells you that this had three tasks and all of them completed successfully in the last run of that dag this one just started and there's one task in progress so it's light green and then down here this says that this dag had a few tasks that successfully completed but then it hit an issue and then ever that's all the Reds that are failed tasks and then within that dag within this directed acyclic graph at the bottom of in a downstream of that those were never run those two orange globes just tell you that the bottom part of your dog never actually executed because upstream of it some tasks failed and finally there's some sort of like Quick Links to hop around the UI let's say you pick one of these tags and you click in on it you'll come to this view which is the pictorial view of your graph that you're executing this is a graph of computation and so these are the tasks that's the task that's the task that's a task so these tasks are chained together in a dependency tree so this is the first one to run after which if it's successful this will run if that's successful it'll go down these two paths so how many of you have used something else like a dag scheduler of any sort there's to get a sense a few of you okay cool not too many there so this is sort of the pictorial view and then another view is the code view so this dag is actually represented in simple Python code what differentiates this like air flow from let's say Susie or Azkaban which are also workflow schedulers is that you have to deal with like xml definitions or in the case of azkaban you actually have to zip a directory of files and that that directory actually mimics your the layout of your dag it's very cumbersome if you start trying to make structural changes your dag to CD to multiple directories to change config and stuff air flow says that your dag is this going to be code and as a software engineer you'll apply all the best practices of for engineering to this code so you get all the benefits of like good software hygiene by just being a good software engineer you don't have to manage another like DSL or or like you know XML or something so this is exactly what the code looks like say that you've run your dag multiple times now and you want to see historically how it's done so this is called the tree view and the tree view basically shows you a few things so this is your dag represented as a tree okay these round circles at the top are the status of a dag run so let's say your dad were to run every day over 30 days you'd have 30 green circles if they were all successful and this light green one just tells you that that's the one in progress and then all the squares are actually tasks statuses so whereas the round circle up here at the dag status these are all tasks statuses and they can either be as you can see pink which means that they were skipped because they didn't meet a condition so they were just conditional a task that only run maybe maybe every 3 3 a.m. every day in addition to what your dad normally does it also takes it back up so you know that's why you would end up skipping a lot like 23 other hours of the day and then this is the current dag that is being run on this edge and you can sort of see this this green one is the one in progress and downstream of it are two that will need to need to run afterwards so this is sort of a tree view of your dag and you can see over of course of days this I think this has 25 hours these are our lead eggs it's got 25 runs how about executing how about looking at the actual performance of your dag so you can look at a specific dog run and you get a gap view and the gap view shows you that okay for this specific dag it spent five minutes waiting for some data to land it says waiting for collector ingest then it spent about 10 minutes doing a spark job this is aggregate that data spark job and then it had to import that data into a database and that it's called wait for empty queue here but essentially it spent two minutes so the most expensive part of this diane's to be this spark job which makes a ton of sense this is some sort of aggregate aggregating job now how is it performing overtime so this will give you the time view so where whereas in this case each task in the dag where we're an item here these are the tasks in a specific dag we're taking each of those and making them a series so there now this is the aggregate data spark time series and these are all the other ones and these are all fast so they get they take almost no time this x-axis is the run time of the dag like when was a schedule to run if it's running every hour so every hour of the day there'll be a dot here the y-axis is the duration of that task so this dag is running over multiple months in fact but this is how long it takes in terms of fractions of an hour so what we found out was in January of this year our spark jobs are just getting slower and slower and slower and slower and of course these are the weekends so things are kind of better but it's just getting slower and slower and if we didn't have this sort of monitoring view built in to air flow we wouldn't have noticed this our data scientists optimized the spark job and then they brought their performance back down to you know 0.3 hours 18 minutes or something I'm instead of like over 30 minutes and then they did another optimization back here to bring down the time and of course they added more like features and things got worse again but this is one thing that air flow gives you why use it well it's useful to automate things like ETL pipelines machine learning pipelines also you know just simple cron jobs that run on one machine especially if you're running like in the cloud and that may that instance can go away you sort of need some reliable cron that's distributed that's going to run even if a machine crashes or something and what should you know a workflow scheduler do for you right what are the like table stakes so the table stakes are first of all it should schedule a graph of dependencies so if upstream dependencies succeed then downstream should run it should retry if things intermittently fail so should handle the task failures by retrying it should report and alert on any type of failure so there should be something that comes bundled with it but in all of our companies we all use a different hodgepodge of like monitoring and alerting so it should easily integrate with those it should give you a built in performance over time which we sort of saw it should in for some sort of SLA alerting because if you if you're in the if you're the person in charge of workflows you probably care about correctness and timeliness right if an hourly job is taking three hours to run you're probably gonna get paged okay so we're quickly running out of time so what what is airflow add on top of this it adds the fact that it's configuration is code it's usability in terms of a stunning UI it's got centralized configuration resource pooling and and it's very easy to extend because it's essentially Python code base all right so I'm gonna really run through this just to give you a sense of this so this is an actual use case that is powered by the airflow tag you saw in the previous company I worked at we were collecting email metadata live from customers customers being other enterprises and that data would land in s3 in an s3 bucket and every hour a spark job would run to aggregate aggregate like some statistics on that massive data and then we would post their aggregates back into a different s3 location and that would cause notifications to be sent over SNS and SQS and then we would launch a fleet of importer nodes running in what's called an auto scaling group to read all of this data and transfer it into a database in a structured format where a UI could be used by like some sort of NOC person and a corporation center person to actually analyze the results of all the email that they got all of this was being done by air flow orchestration and was handled by this same dag that we just looked at so this is sort of the stuff you can do we do a lot more than this like a model building model deployment everything can be done through air flow primitives it's a pretty it's been around for about two and a half years so it was created in Airbnb in 2015 and then max launched an Hadoop summit that same year and then in 2016 we got it into the incubator and today it has 2400 plus Forks 7600 stars 430 contributors 150 companies officially using it and we are at like 14 committers and maintainer is right now and we're always growing so that was about it thank you if you have questions you can come up [Applause]

Info

Channel: InfoQ

Views: 59,154

Rating: undefined out of 5

Keywords: InfoQ, QCon, QCon.ai, Apache Airflow, Data Science, Artificial Intelligence, Machine Learning, ETL Engineering, Data Engineering, DevOps, Directed Acyclic Graphs, Python

Id: 6eNiCLanXJY

Channel Id: undefined

Length: 11min 40sec (700 seconds)

Published: Tue Jun 12 2018