Introduction to Apache Airflow [Webinar]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
uh i guess i'll start by just saying thank you to everybody for joining uh we're really happy to have you here for this webinar on intro to airflow uh housekeeping things before we get started uh this is being recorded and we will send it out afterwards uh feel free to throw questions in the chat uh we will have time for q a after the presentation that i run through um but uh barrage and maybe others can also help answer kinda as we go along so feel free to throw things there um i guess i i should have actually started by introducing myself before i did the housekeeping uh uh i'm ken gendanas i'm a field engineer at astronomer i've had a lot of experience in data engineering and helping companies adopt airflow and one of my favorite things to do is talk about airflow to people who don't know about it and haven't used it before so really excited to to do this uh webinar here awesome um and i'm baraj prak i'm the field cto here um i'll be helping out kind of um policing the questions and uh making sure that uh kent has everything she needs over zoom to crush the webinar because this is her show entirely um yeah super thankful for everyone for joining uh we've been doing these just about every two weeks i think something we've decided on is probably every like month month and a half or so we'll probably do an intro session just to make sure that as we cover advanced topics you know we are helping out people that are just getting started and just from the early stages of their airflow journey so i believe this is the first injured airflow one we're doing but we should be doing another one and probably another month or month and a half depending on what you all ask for awesome with that i will jump right in over here so i'm going to start a webinar by going through a couple slides that cover sort of the concepts of airflow help uh talk about what it is and all the terms uh get everybody oriented uh for the discussion and then from there we'll hop into a demo and actually show uh some example pipelines and kind of how airflow is used and how you can interact with it so but to start uh oh i started talking about the agenda but here it is uh so we'll cover um kind of what airflow is uh in case you're not aware core components and concepts uh sort of the benefits of running your pipelines as code um and then i'll talk about how to get airflow up and running or at least one method of doing so and then like i said go through some demo dags uh and then we will leave time at the end for uh q a all right so to start uh with just a little bit of background on airflow um i was originally born inside of airbnb uh it's obviously since been open source and it's now a top-level apache software foundation project uh it has a really robust and active community uh it's one of the things that makes airflow such a great open source tool uh and as astronomer we're obviously heavily invested in open source airflow so we truly believe it's the de facto standard for data orchestration so to get started with a definition what is airflow uh it is simply a way to programmatically author schedule and monitor your data pipelines uh you'll hear me come back to this term programmatically a lot and that is one of the key sort of um core attendance of airflow is that everything is written in code uh there are lots of benefits to using airflow but a couple that we like to kind of highlight here are that it's dynamic so again i'm already going to come back to this programmatic piece uh everything in airflow is written in python code which means you have kind of the power of python behind you so it's really dynamic there's really no limit to what you can do with it second is that it's extensible so we'll talk a lot here about um the ways airflow interacts with other systems um but there are a lot of connections available out of the box for airflow to talk to any number of different tools but if there isn't one that you need already pre-built it's also easy to create your own so highly extensible there and finally it's infinitely scalable so you have a lot of flexibility in how you set up your airflow infrastructure at astronomer we have customers that are running thousands of tasks every day on airflow and can manage that all easily so um lots of benefit there from anything if you have you know a very small environment with just a couple of pipelines all the way up to thousands and thousands of tasks running every day airflow can deal with all of that so here i'll go through some of the core components of airflow i won't go through them in too much depth for this conversation in terms of kind of the technical background behind them but i think they're really good things to know um if you are running airflow yourself you will certainly run into these terms and have to deal with them to some extent and they're just generally good background to have uh so the three kind of core components that are needed to run airflow are the web server the scheduler and the metastore so your web server is a flash server and that's going to provide you with your ui so we'll look at that later um your scheduler is the demon that's responsible for scheduling your jobs and then the metastore is your uh database and the back end where all of the metadata are stored so those are the three really important components you absolutely have to have those for or flow to work and to run a couple of other components that aren't necessarily the you know three core ones that i talked about there but i think you're still um sort of fall into this category a little bit and are helpful to touch on are the executor so this is gonna define how your tasks get executed uh and then you will likely have workers so these might be processes that are executing your tasks as defined by the executor so again how you set this up in your own infrastructure uh there's many different ways that you can do that but these are kind of the core components that you're dealing with hey kenton i know a question we get all the time is around the meta store you know like what would you recommend using as the meta store for airflow if you're getting started yeah so you have a couple of options um the three uh database engines that are supported by airflow are um postgres mysql and sqlite i would say by far as far as i'm aware the most commonly used is postgres and it's probably the easiest to implement especially if you're using like a cloud database so that was if you aren't coming in with any opinion uh on what you're going to use for your team or your organization i would recommend using postgres perfect yeah that's a great question uh so diving deeper into one of those core concepts i want to talk briefly about executors because again this is something that uh if you are going to run airflow you will have to make the choice of which executor you're going to use you have a couple of options available to you uh which one you use kind of depends on what you're trying to do with airflow and you know we could spend a whole webinar talking about uh just executors so i won't go into too much depth but in general the three main ones that uh you have available to you are the local the salary and the kubernetes um local is gonna mean that everything is going to run with your scheduler so it's great for local development and development environments but it you know it does offer some degree of parallelism but it's obviously not as scalable the other two salary and kubernetes are both meant to scale so they're going to require their own infrastructure they're both sort of good in different scenarios celery is really good if you have a very high volume of short running tasks kubernetes gives you a lot of power in terms of auto scaling and task level configuration so again we can talk about this all day and i won't worry with that but uh these are you know considerations when you're deciding how to run your flow which is going to work best for your use case um i added a little note about there technically is a fourth executor the sequential one um i would not recommend using it it doesn't offer any parallelism so typically we don't see that when used in practice and then so diving into actually the core concepts of airflow so these are now going to be more related to kind of your pipelines and on the data engineering side as opposed to the infrastructure that we just talked about and within that the most high level concept within airflow is a dag so dag stands for directed acyclic graph you can think about that as each node in that graph is a task and each edge in that graph is a dependency so a dag is simply your data pipeline um they're defined as python code and the only rules for them are that they your tasks flow in one direction and don't have any loops so that's the directed acyclic part uh so this is an example of a super simple uh valid dag with a couple of dependencies between tasks uh something like this on the other hand would be invalid because you have that loop between t1 and t4 doing something like this can create an infinite loop in your code which is obviously bad so that's not allowed but past that you can define your dags in whatever way you need to diving in a level deeper uh you have operators so operators are kind of like the building blocks of dags so you can think of them as like a wrapper around each task that is going to define how that task is going to get run and abstract away a lot of the code that you would otherwise have to write yourself um there are a couple of main types of operators uh in airflow first is action operators so these are going to complete an action of some sort so they're like the python operator which will execute a python function or the bash operator which will execute a batch script things like that there are also transfer operators so those are going to move data from one place to another and so you might have like an s32 snowflake transfer operator so again that's going to be an operator that's fully built out to make that transfer uh automatically and you just input some configuration as opposed to you having to write all the python code to do that and finally sensor operators uh will wait for something to happen so they give you the ability to make your jazz a little more event-driven as opposed to scheduled so uh they might wait for a file to be dropped in a certain location or another dag to be triggered something like that and then the next concept we talked a little bit about already is your task so this is just an instance of an operator so every time you have an operator uh that you are defining you're defining that again as a wrapper around that task and then your task instance is going to be a specific run of that task so you can think of that as your d you have a dag and then you have a task within that dag and then for one run uh at a point in time you have that task instance putting that all together uh what that looks like so again for a full workflow you have your dag is your most high level concept that's your pipeline uh within that you can define your operators uh through a wrapper around each task and then you can define your dependencies again this is uh super flexible so you can define those operators and tasks and dependencies to be whatever you need them to for your use case kind of to highlight that concept a little bit i like to show this slide i just kind of highlight the again flexibility and power of reading your pipelines as code it allows you to take you know something like this where you have a pretty complex uh network of interacting systems that have dependencies and you can implement all of this uh while writing very little code uh using airflow once a new requirement comes up for your business you can integrate that easily again just by adding another operator we'll talk a little bit about providers as well as we go which are going to be um you know the kind of packages for each of these external systems be that snowflake or aws or uh data bricks and it's like that but just note that uh all of these tools that are shown here um have built-in airflow integrations already uh so again that's part of the benefit of having the community be as robust and active as air flows is is lots of people have done all of this work already and so when you're implementing it then you don't have to um so from there i will hop over into a demo and sort of look at some of these concepts in practice and to start off i'm going to pop over to a terminal here and i'm going to talk a little bit about getting airflow up and running so uh there are many ways that you can do this um again airflow is open source there are lots of tools out there for you know how you can run airflow uh either locally or in production and how you can get started with it uh probably one of the most common in general uh ways of running airflow is using docker to do so that's how we run airflow at astronomer and there are lots of you know open source packages available to uh help you run airflow with docker we also have the astronomer cli so this is how i'm going to get started with airflow it's in my obviously non-biased opinion one of the easiest ways so our cli is open source it's available for anybody to use um yeah quick request from the chat could you just make the text on your screen a little bigger um so it's a little easier to read for some of these folks perfect um sorry i didn't mean to cut you off but yeah thanks for letting me know it's uh it's a good good reminder uh yeah so the astronomers eli uh open source is available to any for anybody to use it's a really easy way to get up and running with airflow uh without too much hassle so you can just start playing around with it um once you've installed the cli uh you can initialize an airflow project by running this master dev init command uh i've already run that in this directory what that's going to do is create a docker file and then a set of files and folders needed for me to run airflow locally in a dockerized setup once i've done that i can do astro dev start i've actually already done that here as well so if i do a doctor ps you can see i have three uh containers up and running that were spun up uh by that command using docker compose uh i have one it's a little hard to see in here but i have one for my scheduler one for my web server and you can see the scheduler web it's a little little wrapped here but web server scheduler and then my postgres database so those are the three components that we talked about before i now have those up and running uh before i hop over and sort of show the airflow ui i'll just highlight kind of the file structure for this particular project um the biggest thing here that i want to note is this dags folder so that's where all of my uh [Music] dags or data pipelines are going to go so the typical way of doing this is defining each one in a python file i currently have two here in order for a dag to be defined in airflow you have to have a python that is in the stacks folder that then defines your dag so you can do this in a static or dynamic way we'll go through these tags in just a minute but uh that's how those are going to get picked up by airflows by having code that defines that dag in this folder so if i hop back over to actually just to highlight uh for my web server you can see that i have this container running uh directed to port 8080. uh if i pop back over to my browser here and navigate to my localhost 8080 uh you can see i have airflow up and running so it took me to the airflow homepage uh i'll walk through the ui a little bit kind of what we're looking at here uh before we hop into these two dags specifically so uh home page is gonna show me again a list of uh all my dags uh i have various filters that i can use and i can also trigger or even delete dags directly from here uh other useful views if i click into one of these dags like this example uh common ones used here are this tree view which is going to show me a uh sort of tree view of all of my tasks and dependencies and i haven't run this one but if i hop into this other one it will also show me a list of recent runs and the statuses of all of the tasks and the dag and obviously these are color coded based on the status whether they were successful or failed uh whether they got skipped because of some upstream failure something like that so that's a really useful way of seeing sort of what is going on with your jags uh in the recent runs uh if they go back to this other one another useful view is the graph view so this is going to show me a graphical view of how my tests are laid out um so again this one i have a bunch of different tests and some with dependencies as you can look at that there if i were to turn this on uh so you have to have the dag turned on in order for it to run this is a common mistake that i make myself all the time is i trigger my dag and go why isn't it running if it's not unpaused it won't run i also turn on this auto refresh and go ahead and refresh this and then if i trigger the stag i can run it here uh i can see the status here but i can also watch it in this graph view and you can see oh that one airflow 2.0 is really fast so for this particular example it's not exciting to watch it go through you just watched all those flash screens but there that one just turned so this is another way you can kind of watch as they're running as well other useful views i have this gantt view available to me uh so this is gonna provide uh an overview of the duration of all of the tasks in my tag uh this can be really helpful if you are debugging trying to figure out you know what is taking longest any sort of optimization that you're doing uh that's a helpful place to start for that and then finally i'm going to hop into this code view at this tag so i showed you i have those two python files uh in my dags folder in my code editor that we looked at um obviously this is one of them i can also look at the code here in the ui which is kind of helpful especially if you know people might not have access to the source code but they want to understand like what's going on in the dag uh you can see the code here i'll kind of walk through what this looks like for this example one uh and this one's pretty straightforward uh it's just kind of the dummy example so uh and i'm using those action operators that i talked about before the bash operator and the python operator i define my function that's going to get called for the python operator here i set any default arguments related to the stack so with airflow you can set a configuration like this at the task level um so for each operator that you're instantiating or you can also define default arguments that will be applied to every task in your dac so it's just going to save you from having to specify this every time uh if you have things that you're setting globally at your uh overall dag level um some of the things in here are uh based on email for email alerting so if i want to send an email if the task fails or has a retry how many retries i want so uh again just uh error handling so do i want the task to retry how long do i want it to wait before it does um this is just kind of scratching the surface of how you can configure these things within airflow you have a lot of flexibility over you know what you do if something goes wrong but this is you know kind of a simple starting place that people can use uh from there i go ahead and define the dag uh itself so this is the part that airflow is going to pick up and know that i'm creating a new dag here so i give it a name again some more parameters so the start date the schedule interval so how often i want it to run i'm passing in the default arguments uh whether or not i want it to catch up so uh if i had set this to true uh and turn it on and the start date was you know passed it will airflow will fill in all of the missed runs if there are any and from there i define all of my tasks so i have my dummy operator that was that little start that you saw all the way to the left and then i have a couple of bash operators we saw so for those i'm passing in i'm just giving it a task id passing in a bash command for this one uh down here i'm creating those python operators and so this is a good example to show that you can dynamically create your tasks in airflow so they don't necessarily need to be static with a code block for every single one in fact if you have a lot of similar tasks where you might have you know a different function python function for each one or something like that we would recommend setting it up in this way just because your code is a lot cleaner it's easier to update so this is going to loop through and create multiple different python operators uh with different task ids and then obviously we're calling that function that we just looked at up above uh and then finally down here i'm setting the dependencies so there are multiple ways to set task dependencies uh in airflow this is probably my preferred way but you can also use there's a set up stream and set down stream functions so multiple options for you available there but again uh kind of core concept here is that again my entire pipeline is defined in python code here and like i said this results in a dag that looks like this so that was kind of a simple example uh from there i want to pop over and look at one that's maybe a little bit more interesting uh not just kind of dummy python operators and bash operators so for that i have this other dag in here called adf great expectations and what i want to highlight with this one is that again with that slide that i showed with all of the uh external systems like databricks and uh aws like that um airflow has these built-in um hooks and operators available to interact with all of these systems all ready to go so if you are you know a data engineer or data scientist and you haven't been using airflow but you're already working with obviously a set of tools it's very likely that airflow can easily integrate with them this farmer we like to talk about how um you know it's typically airflow and some other tools so it's you know this is what airflow is built to do is be an orchestrator that talks to other systems uh it's i think as far as i'm concerned rare to see people that you know are justified like writing their own python code for for dags so i want to show an example of that in practice here raj where are you going to say something no i'm just excited to see the code view okay well before i'm going to make you wait one more minute uh before i pop into the code view for this tag just to kind of uh talk about that point a little more i do want to make a plug for uh astronomer registry um just as a great place to go if you're getting started with airflow to kind of discover all of these things so um this is a way to look at again those provider packages so that's uh you know packages that are used to integrate airflow with external systems um so for azure uh i can come here and see like the operators and things that are available to me so if you're just getting started and you want to know what's out there this is a great place to start uh just to kind of discover all of that so uh now i'll hop over here uh into the code for this one so what i wanted to show from this tag is a use case of airflow interacting with a couple of common tools um so adf in this case is short for azure data factory um so that is a you know kind of similar to a pipeline tool in itself uh it's one we oftentimes see used in conjunction with airflow because uh you can kind of get the best of both worlds in that way and then great expectations is another open source tool for running checks and kind of quality control per say in a pythonic way on your data um so it's something we oftentimes see you know analysts data engineers data scientists using um for both of these uh what i'm going to show here is that there's a really really easy way of integrating them both with airflow which is super nice so again i'm going to start with actually i should show you the graph view first so we know what we're looking at i'm just pleasing barrage here so what this dag is going to do is it's going to uh again using built-in uh ready to go books and operators it's going to trigger a remote azure data factory pipeline uh that pipeline is going to generate some data and then going to download that data run a great expectations checkpoint on it to check quality and then at the end if everything's all good i'm going to send an email that says hey this data is all good um so pretty straightforward if i look at the actual code for this uh again the important parts here are i'm importing this azure data factory hook uh this wasp hook which is uh what is used to interact with azure blob storage so that's where my azure data factory pipeline stores the data that it generates and then the great expectations operator so again another different provider package there uh at the top here i define uh the date that i'm going to use that i'm grabbing the data for so i have it set to grab yesterday's data just a note sort of on general airflow best practice if you can it's really good to use airflow built-in variables or macros like this one so this is going to help make my dog item potent by pulling the yesterday's date based on my dag execution time as opposed to me sort of hard coding the date or calculating it up here i'm going to define some file paths that are used by great expectations i then have a python function that runs my azure data factory pipeline uh you can see there's not much code here because again i'm making use of this built-in azure data factory hook i then have another one that makes use of that clause book to download the data and just simple two lines there uh define my default arguments and then down here i'm gonna define my tasks so i have my first two python operators that run the pipeline download the data uh then i i run my grade expectations check so for that i'm using the grid expectations operator i pass it in uh the data that i want to call that check on as well as a grid expectations suite uh so this is a supporting file that tells uh great expectations what checks i want to run on the data and then finally i'm using the email operator to send an email so uh i'm passing in saying the great expectations check successfully and finally i uh define my task dependencies there so if i go back to the graph view and trigger this this one will actually take a second so it's a little more interesting to watch as it goes through uh but fun feature interflow 2.0 is that auto refresh so uh we can see it's from the pipeline it's downloaded the data just finished the great expectations checkpoint which passed and then it sent me an email this is really useful so uh for these data checks things like that and defining the dependencies if this great expectations checkpoint had failed so there was something wrong with the data it didn't pass the test i would not have sent me the email so uh if i look at the trivia on sort of some past runs you can see uh there's this string here where this great expectations check was failing and those are red and then you can see the send email ones are all orange which means upstream failed so they were not run um so i can sort of build in some controls around okay if this sort of thing happens i don't want to move on to anything downstream then can i alert on those failures so we have the successful email sent but can i do one if it fails or maybe if the next task doesn't run there's a question from the chat yes absolutely um so one probably the lowest hanging fruit way of doing that is um turning on these email on failure and email on retries it means airflow is going to uh just send you an email you do have to configure smtp in your in your airflow environment but as long as it is configured if you have these turned on uh airflow will send you an email saying hey this task failed um and it will include a link to the logs and things like that you can set these at the dag level uh you can also set them for specific tasks and you can get much more kind of fancy uh if you want to so you can set up slack alerting and that's something i've seen folks do a lot uh you can define you know functions that get called if you have a failure and so you can kind of define in that function whatever you want to do but yes there are lots of built-in alerting possibilities the other thing i want to highlight kind of with the trivia and some of these failures is the logs in the airflow so if i came in here um say i got an email that said hey one of your tasks failed and i want to figure out what's going on i can come in here and click on that task and this log available here so this is going to give me a log from the task and i can come in and see what went wrong so here's my error down here for this one that i've since fixed but you have all of that kind of at your fingertips in the ui here one other kind of airflow key piece that i will highlight if i pop back to this code this pipeline is you'll notice so in this case i'm interacting with a couple of external systems uh this azure data factory this azure blob storage um so kind of a natural question is how am i actually getting airflow to talk to those uh systems you can see in the code here here's an example that i'm passing in uh this azure data factory con parameter and for this clause book i'm passing in an azure blob connection idd so anytime that you are using an airflow hook or an operator that is talking to some external system you will have to define a connection that airflow will use to talk to that system uh one easy way of doing that is coming up here and going to admin and then connections and here you can see those two that i just mentioned where i've defined so this is where i have given the credentials uh to connect to my azure data factory instance all connections work a little bit differently so what you put in these kind of depends on which system you're talking to a nice thing is so any passwords are going to be encrypted so there actually is a password here you just can't see it because uh your flow blinks that out for extra security uh if i go back um so i've defined define those there and would have to do so for any other hook um you don't necessarily have to define them in the manually create them in the ui like this but if you're just getting started this is a good place to start is to come in here find the connection type that you're looking to use and then go ahead and define how you're going to talk to that system your credentials where else could i define those credentials if i'm not going to put them in the ui here yeah so you can set up a um secrets backend as one common way of doing this so uh lots of people like to do that because if you're running this especially at scale you might have like rotating passwords and access tokens things like that you don't want to come into your airflow to change them in another place so if you're using uh something like a vault or aws secrets manager things like that you can connect that to airflow so that airflow looks in that location uh for your secrets uh there are also other ways you can do it depending on the infrastructure that you have set up um but those are probably the two most common awesome so ken we have about 21 questions in the chat here for you so that's great um perfect timing so i was just gonna say that was what i wanted to show let's open it up in the last 15 minutes for questions we'll try and get to as many of them as we can it's like the lightning round or something here yeah all righty so i will just start uh start as they came in here to make sure that we get to everyone um so you know one question that is i think we get a lot right is you had a couple of dummy operators in your your dad like what's what's a good time to use dummy the dummy operator yeah i think usually just when you're trying to uh sort of organize things in a certain way um is one that's kind of this use case here is it's just easiest to have sort of a start as opposed to having these floating uh especially if you're defining them in a loop or something like that um the other way i've seen is i have occasionally wanted to use branching within my jags which means i have you know in some different paths that the dag might take based on some external criteria uh the branch operators that are available for you to use in python require something to be downstream in both parts of the branch so if you have a use case where i want to move forward if this condition is met but do nothing if the condition is not met one way of doing that is to have just a dummy operator on the other side of your branch so that you just don't do anything uh so basically anytime you need sort of a task for some functional reason you just don't want it to actually do anything as a good way of using those yeah almost like an organizer um can you do dependencies between dags uh yes to some extent so they're not sort of native in the way of like i can't just define them like this in my code but there are ways to make kind of different tags talk to each other um one way is the trigger i think it's trigger dag run operator uh which is an operator that will trigger another dag so that's a good way of making two kind of dependent um another way i've seen is using the api so especially with airflow 2.0 uh where there's a fully stable rest api uh you can trigger a dag via an api call um so that's another way you could do that with a http operator within another dag what you need yeah i think mark even linked to a pull request coming soon in airflow that'll let you visualize that in the ui a little better as well so i'm pretty excited about that exciting yeah um so these questions came up a few times um so you know if i'm just looking to get started with airflow or i want just like introductory like 101 what's the best where's the best place to go uh so i would say if you're looking to get started just to get working with it um again the astronomer cli is an easy way to get up and running locally if you just want to play around if you're looking for more sort of in-depth instruction on how to use airflow i'm going to shamelessly plug mark's course because it's really it's really in-depth uh it's really easy to understand this is i not actually getting uh it's how i learned airflow like a couple of years ago um when i was first getting started uh as a data engineer so covers all the core concepts it goes through really great examples um obviously there are lots of great resources out there but that's just the one that i personally am familiar with and really like yeah we'll send out a link to it afterwards it's also available on astronomer.academy.astronomer.io um lots of great stuff there's even a certification for you um so another question on some of the alerting right um if you want to make it so that you want a notification if a task never starts um so you know my first task runs and i want to know if my second task never starts um is there are there ways to accomplish that inside of airflow yeah that's an interesting question um i'm trying to think of the best way to do that um you know you can definitely build in alerting for just in general like if your dag doesn't finish if certain timeouts occur things like that um you can also you know depending on kind of your infrastructure uh sometimes folks will build alerts kind of based on you know their scheduler uh health and and things like that that um might not necessarily be like oh a specific task didn't run but it would let you know if there was a problem with your scheduler that's probably what's causing your task to not run um so that's kind of another thing that comes to mind that i see frequently yeah a lot of ways a lot of ways to to approach that um what's another so how do you handle transferring data between one operator to another that's a great question uh and we have uh we'll reflect on a webinar that we've done in the past that you can find the link for uh test flow api but uh kind of the short answer is the built-in native way of doing that within airflow is using x-coms so it stands for cross-communication uh it's a 10 sort of messaging service uh so that you can pass information between your tasks i will note that it is not designed to pass data uh or at least large chunks of data between tasks because it is making use of the airflow metadata database on the backend so you're writing information to the database and then pulling it from the next task so depending on kind of what the data is that you're looking to pass in between there are multiple ways of handling that um again we have a i would reference uh we can even find the link for it um we have a guide on astronomers guides website on passing data between tasks that talks about kind of different methods that you have of using that based on kind of the scale and things like that yeah that's such a great question there's so many ways to think about that um we did a whole webinar on that on that topic um what about you know another question from uh purna here is is there an api to get the status of my dag runs and my task instances um yes um yeah so the uh again with um airflow 2.0 there's a fully stable rest api and you can add there are absolutely endpoints for getting uh task instance just like that there's great documentation for the airflow api on on the airflow website yeah the new airflow api is great um question around triggering dags um when you were triggering dags from the ui earlier you saw that um you could pass in like context for your trigger is it possible to trigger a dag for a date in the past using using that functionality uh yeah so there are a couple of ways that you can um trigger a date in fact this is let me go to um this other dag is a pretty good example of this um given that it actually is date parameterized so um obviously in this case i'm using a built-in airflow variable for yesterday's date i don't have to so i can define kind of my own variable here that i want to use and then pass that in at runtime so that would be one way of doing something like that you can also airflow has a lot of functionality around um like back filling uh so if you have uh dabs that are date parameterized and you want to run them for you know past states there are other options for doing that that don't require you to like manually pass in a date but if you had just a specific date that you wanted to run for that could also be an option it's a question from uh i think what we showed so far was around using airflow as a data orchestrator um you know kind of coordinating your pipelines uh could you use it to host kind of your heavy uh scalable machine learning pipelines as well uh so yes and no depending on how i understand the question so you can absolutely use airflow to sort of orchestrate machine learning use cases we see that all the time um what you don't want to do is use airflow itself for kind of heavy duty data processing so if you have really large data you're going to want to offload that somewhere else because that's not what airflow is built for so if you have machine learning models where the training of the model is super computationally uh intensive because you have a lot of data you're probably still going to want to do that in like a spark or um you know some kind of processing framework like that and then or airflow would come in as orchestrating all of that and managing you know failures and dependencies and things like that um he said there are a lot of kind of built-in hooks and operators for uh you know the tools that are oftentimes used in machine learning use cases yeah i sort of there are a couple of standard architectures that we've implemented with our customers in the past around that you know if you're interested please reach out to us and we'll be happy to talk you through it look that question is done let's see what else do we got here uh is there an easy way to run airflow on a windows computer is the cli also windows compatible so the cli is windows compatible in general airflow somebody correct me if i'm wrong because i'm saying this from my knowledge of two years ago when i tried to do it but airflow cannot be run locally just by itself on windows um and you can on uh mac or linux uh however like we talked about before a very frequent um uh way of you know running airflow is using docker and so that you absolutely could do on windows yeah i think the windows store has gotten a lot better over time i'm not sure what the most up-to-date one is but like you said the docker runs really well on airflow and airflow just really well on docker and that's all that's all you need i think you might have answered this one but just uh just to make sure here um jose is asking can you have conditional branching within airflow yes absolutely um there are a couple of built-in operators to help manage that for you one is the type brand branch python operator um which is going to basically you pass it a function that determines some logic and then you tell it which direction to take based on what is returned from that function um there's also a short circuit operator which will basically like stop your pipeline at a particular point so you can determine whether or not to move forward based on a condition um so those are two kind of really easy built-in ways to do that um but again i i have seen other sort of more creative ways too just because everything is written in python you have a lot of flexibility there a question that i know you've thought about recently is can you have dags that create new dags in airflow that is a very timely question based on what i've spent the past couple weeks doing uh the answer is yes uh you can dynamically create your dags so again kind of the core piece of airflow is airflow is going to parse any code in the dags folder that has the word dag or the word airflow in it if you have a script in that folder that then generates other dags as opposed to defining the dag itself like i've showed here airflow will execute that code and generate the tags so uh we have a guide on dynamically generating dags um and it's about to get a pretty big update with some other methods uh it is something that we think about a lot we get that question a lot um i would definitely recommend it because there are sort of some gatches depending on uh your scale and uh how you're doing that but uh that was a long way of saying yes you you can't have a tag that creates other tags um yeah i'm looking forward to that updated updated guide because i think that's one of our most highly highly viewed guides as well so it'll be good to give that a refresh questions around this great expectations operator that that you were using um is that something that is specific to just astronomer or can i use it with open source airflow as well no you can absolutely use it with open source airflow um astronomer does not have any operators that are only available on astronomer so the grid expectations one is just a python package that you can install um if you check it out on the registry there's a great page for it here that kind of talks about how to use it so we still have a ton of questions left here but unfortunately we're just about at time because i do want to take the last two minutes to pub and i'm going to grab the screen from you for a second but if this was interesting to you and you love airflow we encourage you to sign up for the airflow summit it is 65 days away there's going to be all sorts of talks across all sorts of use cases on here we are very excited for it and if you're interested in talking please reach out to us and yeah we very much want you to attend we'll have some more information on it coming in the next couple of weeks but shameless plug for the airflow summit on july 8th here uh please sign up because we know we know there's some great stuff coming so that said i think i think we're ready to call this one call this one call this one done um sorry for the folks whose questions we didn't answer please email us you'll receive some contact info afterwards and we'd be happy to answer them kind of uh offline there but thanks to everyone for joining we really appreciate uh appreciate you giving us some time of your day here yeah thank you everybody and thanks for the great questions uh we love to see this type of engagement out in the community so we will definitely do more of these and look out for an announcement for the next topic which we should be getting to you pretty soon here all righty
Info
Channel: Astronomer
Views: 1,572
Rating: undefined out of 5
Keywords: A Complete Introduction to Apache Airflow, airflow tutorials, apache airflow tutorials, airflow for beginners, apache airflow for beginners, Introduction to Apache Airflow, Intro to Airflow, what is airflow, what is apache airflow, apache airflow tips, apache airflow guide, apache airflow learn, apache airflow step by step, apache airflow use cases, apache airflow, how to deploy airflow, deploy apache airflow
Id: GIztRAHc3as
Channel Id: undefined
Length: 53min 25sec (3205 seconds)
Published: Mon May 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.