Intro to Airflow for ETL With Snowflake

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um so yeah you know with that being said i would love to introduce kenton who is one of our field engineers based in burning hot seattle right now and she's going to take you through the uh thanks barrage yeah so hi everybody uh again thanks for joining we're so happy to have you here with us i'm kenton i'm a field engineer at astronomer um as far as i said today be going through an introduction to airflow and with a focus on using airflow for etl which is a really common use case uh we'll get even a little more granular there and go into etl with snowflakes since it's such a popular tool um see that just throw any questions as you go through in the chat um and with that i will dive right in so i'd imagine there are some people who are familiar with airflow at least a little bit here um maybe more interested in the snowflake side but as we said we are going to start just with an introduction so i'll run through some slides kind of going through the core um components and concepts within airflow for anybody who's less familiar or if you just need a refresher if you haven't been working with airflow a lot lately um from there we're gonna dive into a demo of actually using airflow for etl with snowflake uh we'll leave plenty of time at the end for just general q a um on you know airflow interaction or topics like that again if there isn't a time during the presentation for those questions so with that i'll dive right in and uh just give a bit of an overview again just in case uh anybody here is on the very new side with airflow and is less familiar with it as a tool um airflow was originally developed uh inside of airbnb it's obviously something open source so it's now a top-level apache software foundation project uh currently has over a million downloads every month so it's really popular it has a really robust and active community which is part of what makes it such a great open source tool so um obviously an astronomer we're heavily invested in open source airflow and we really believe it's in the standard for a code-based data orchestration um so what is airflow just really simply it's a tool for programmatically authoring scheduling and monitoring your data pipelines so one of the core tenets of airflow that we'll see here is pipelines as code and so everything is obviously written in python code that makes your pipeline super dynamic anything you can do in python you can do an airflow uh so that's really powerful it's also super extensible so we'll talk a lot about that it's very relevant for the snowflake use case but how airflow plays well with lots of other services other tools that's what it's designed to do as an orchestrator and finally airflow is super scalable so as a runner we have customers that are running thousands of tasks every day using airflow and it handles all of that easily so we'll start kind of on the airflow basics with some of the core components and i'll note that these are more on the airflow infrastructure side so uh if you're you know writing your data pipelines in airflow we won't uh you don't need to know too much about these in terms of how they actually work and i won't go through them in too much depth but uh they are useful to understand because anytime you're running airflow even if it's just locally like i'll show today uh these are the components that are going to get spun up that you need to run airflow so you'll see these terms uh it's useful to kind of know what they are so the three main components that you need to run airflow are the web server which is your flash server that's going to serve your ui your scheduler just the demon responsible for scheduling your jobs really important for an orchestrator and scheduling tool and finally your metastore so this is your backend database where all of your metadata are stored so this is typically postgres but there are a couple of different options with airflow and then we'll go through a couple of others that aren't necessarily part of the core components but they are important uh for running air flow and they do sort of live more on the infrastructure side uh the first being your executor so this is going to define how your tasks get executed and i'll talk a little bit more in a minute about the different ones that are available and finally you have workers so these are your processes executing your tasks that are defined by the executor what your workers look like will depend on which executor you choose and your airflow infrastructure so you may or may not have separate workers but again another moving piece that you might have to keep track of uh so in terms of the executors that you have available to you um there's a couple of main ones we'll cover here namely local celery and kubernetes um they're good in different situations uh we see people using all three and even combinations of them for different airflow deployments local which i'll be using today to obviously run on my local machine as good for local and development environments in that case all of your tasks run along with your scheduler so in the same pod as your scheduler if you're running in a dockerized setup which is how we do things at astronomer um uh so that's again great for local work or development environments where you want sort of lower overhead in terms of cost infrastructure definitely not as scalable obviously um if you are looking in production and looking to scale out you would want to look to either celery or kubernetes again both are really great in different situations and celery tends to be better for if you have a high volume of shorter running tasks uh with celery you have workers that are sitting there ready to go when your chest spin up uh the kubernetes executor is going to spin up a kubernetes pod for each different task so they're really great for auto scaling and task level configuration and there's also no downtime with kubernetes executor if you're making deployments or changes to your airflow environment so if you have long running tasks that require you know a lot of ur memory maybe you're training a machine learning model something like that that can be a good choice um so next i'm going to dive into the core concepts of airflow so these are going to be more related to actually writing your dags um or your data pipelines uh within airflow the three that i'll go through are again your dag so that's the highest level concept within airflow stands for directed acyclic graph tasks which are the nodes within that dag that actually define a unit of work and finally your operators which are the task template so i'll go through each of these in detail again first is your dag which represents your data pipeline um it's the highest level of unit within airflow and it is a collection of either one or many tests so you can think of each task as like a node in that graph um and your dag is just going to define how those tasks will be run um so the order you can define any dependencies any rules for running them um we'll also define when tasks get run so uh start date schedule interval uh the only rules for a dog are that um your tasks flow in one direction and have no loops sometimes directed and acyclic so this example on the top here is valid uh with a couple of simple dependencies between your tasks um something like this on the bottom would be invalid because you have um t1 is dependent on a task that comes after it t4 so that could create an infinite loop in your code which is obviously bad don't want that um so you can't do something like that but otherwise you're free to define your dag in whatever way makes sense for your use case uh so the next drilling down kind of one little deeper uh the next concept is a task um so again this is a unit of work within your dag um it's airflow's basic unit of execution um you can have dependencies between your tasks so defining things upstream of another task or downstream of another task such as in this example when you actually run your dag uh you will get a task instance for each task so that's a specific run of that task so a dag four dag for a task um for a point in time that's your test instance that's another um term that you'll probably see come up [Music] and then the last kind of core concept is an operator so this is going to determine what gets done by each of those tasks so you can think of it as kind of like a wrapper around each task as like shown in this diagram and that's going to again define how your task is run or what it does and abstract away a lot of the code that you would otherwise have to write yourself um so there are a couple of main types of operators um operators which are going to run once with a defined unit of action so that might be something like run a python function using the python operator or run a batch script using the bash operator um or they might do more complicated things like stuff we'll talk about today like uh transfer data from s3 to snowflake using the s3 to snowflake transfer operator um the other type uh that i'll mention here is sensors um so rather than just doing a unit of work sensors will wait for some external condition to be satisfied and so that might be looking for [Music] a particular status of another task within airflow that could be looking for a file to land an s3 and something like that sensors are a great way to make your dags more kind of event driven airflow is a batch scheduling tool but it can be made more dynamic if you don't have you know pipelines that run on a particular schedule sensors can be a great way to implement that um and then the next thing i'll cover kind of related to operators uh are airflow providers so providers are separate packages um they're distinct from the core airflow distribution as of airflow 2.0 they're often maintained by the community and the you know maintainers of the actual tools that they're supporting so these packages will likely contain hooks and operators or sensors to interact with some external system so we'll talk in depth today about the snowflake provider which is obviously a provider package used to have airflow interact with snowflake there are many others uh for you know most of the common um data related services out there um our registry which is linked here and also show it a little bit later um is a great place to discover all of the providers out there and how to use them but again i'll come back to you this is one of the great things about airflow means having such a robust open source community and a benefit to airflow 2.0 having these provider packages separate from the core distribution means that the community can develop and maintain and update all of these great packages that again just make your life a lot easier so you don't have to figure out how to write the code to do these things somebody else has already done it and because they're separate from the core distribution and airflow 2.0 um they can be maintained much more regularly and you can update them kind of at any point without having to update your core airflow instance which is really useful so kind of combining all of those concepts um this is just to show kind of highlight the flexibility of having your data pipelines in code with airflow and using those provider packages to interact with external services and you can have a pipeline that's really complicated as you can see here and interacts with all sorts of different services and you can implement this without really a lot of code and when a new business requirement comes up it's pretty easy to add that in um you can just you know add in a new operator to your dag set your dependencies and you're off and running so with that i'll hop into the demo so we can actually see some of this in practice for that i'm not going to spend too much time talking about actually running air flow in this uh demo but i will add that if anybody has questions about how to do that we're happy to address them during the q a or we're always happy to talk more um about how to run airflow that's where bread and butter is stronger obviously um but for this example um i'm just going to uh run airflow locally using the astronomer cli uh our cli is uh free it's open source it's available for anybody to use uh it's the easiest way to get airflow up and running locally we do so using docker that's the only prerequisite for you to use it um to do that i can just do aster dev init to initialize a project i've already done that in this directory so what that's going to do is it's going to create me a docker file used to run again airflow and a dockerized setup as well as supporting a folder structure and files for me to add my tags and any dependencies so we'll look in depth at kind of what one of those projects looks like in just a minute but from there i can do aster dev start which is going to spin me up a couple of containers i've actually already done that in this case you might have heard me talking seattle is really really hot right now and um docker is uh requires a lot of cpu so i wanted to make sure everything was good and running before this started um but if i do a docker ps uh you can see that i have three containers up and running here um again these are going to correspond to those core components that i talked about earlier uh your scheduler web server and this is the opposite web this one's the web server scheduler and postgres database for the metastore so i have all of those up and running if i pop back over to my browser here navigate to my localhost e and you can see here i have airflow up and running uh this is i am running airflow 2.0 um so i'll kind of walk through the ui a little bit for anybody who's less familiar with airflow just so you can get a sense of what's going on here but just note that if you are using an older version of airflow you might not have all of this airflow functionality um because 2.0 had quite a few ui updates that are really nice so on the home page here uh this is gonna show me kind of a list of all of my dags um so again each one of these is a data pipeline they correspond to a python file um i'll look here in just a second i'm gonna start on this example tag before i dive into the snowflake specific um use case i will get there in just a minute i promise um but before i do i again just want to highlight some of the uh functionality here within the ui so for each of my dogs i have a couple of views that are going to be useful to me um the one i'll start with is the graph view um so this is exactly like what it sounds like it's going to show me a graphical overview of my dag this is really useful for learning kind of what's going on in any given dag what tasks you have and how the dependencies are when i'm developing dags i use this view to ensure that i'm setting the dependencies correctly in the code um obviously in this case they're pretty simple but you can imagine they could get really complex for a real world use case and this is a great way especially working locally to uh figure out if things are you know laid out the way you think they should be um other view that's uh interesting here is the tree view um so again this is also going to show me kind of a layout of my dag and all of the tasks but importantly it will also show an overview of all of my recent dag runs uh each of these boxes represents a task instance and they are obviously color coded to show the status so in this particular case they were all successful but if there were any failures or skipped tasks um et cetera i would see those show up here also point out the code view here um and kind of walk through this a little bit for the example dag um so within airflow uh you can't actually edit the code from the ui here um but it is a really useful way to look at the code so either if you don't have access to kind of the underlying dag code and you want to know what's going on within the dag this can be a great place to look it's also helpful when you're developing to look to make sure that your updates are being brought in so you can check and make sure that the code that airflow thinks it has for that dag is what you expected um for this particular dag again this is just kind of a simple example um but i have all of my imports here at the top so and i'm importing all of my various operators i have a python function that i'm defining that's going to get called by my python operator i then have some default arguments that are applied to all of my different tasks so airflow has a lot of different configurations uh super flexible on how you set up uh how your tasks and your tags get run i won't go through all of them in detail today but again we're happy to answer any questions about anything specific that people are wondering about um and then finally i'm actually instantiating my dag itself uh so again this is going to pass in my default arguments as well as give my name set a schedule interval on any other information that i want to set at the dag level um for how the stag is going to behave i can set here and then i'm going to define all of my tasks so again each task is going to be wrapped by an operator so you tell airflow you want to create a new task by instantiating an operator and then providing it with any information that that operator might need so you can see in the case of a dummy operator there's obviously no information that's not actually doing anything in the case of a bash operator uh you can pass in bash command uh or a python operator you can pass in your python callable the final thing i'll note here on this example dag is uh this loop so uh we are generating these python operators dynamically um so again you have kind of the power of python behind you when you're working with airflow so um you don't necessarily have to follow this exact you know script where you define each operator if you you know need to define your tasks dynamically this is one way in which you can do so oh and then the last you all show here as well and this is a new perk with airflow 2.1 uh that's really useful is the calendar view um so and this is going to show me all of my dag runs uh and their status but actually laid out on a calendar and it's not all that interesting on this particular dag uh because i've just run it manually but you can imagine if um you have more real world use cases where um you're running things on a certain schedule maybe those are on a business schedule um and you know you're running things you know on you know only weekdays or things like that um you can see that easily here so from there i'm going to go ahead and dive into the snowflake specific etl case and talk more about that um for that i'm going to be using this covid data s32 snowflake tag and usually before i talk about it in the ui i'm going to hop over to my code editor and show it here um so again this is in that directory that i initialized my astronomer project i have a dags folder so you can see has these um five different tags in it and those are corresponding to the ones that we were looking at in the ui um so again uh the coveted snowflake dag is the one that i'm going to go through here um i'll note that uh in order to use this so again the snowflake um is a provider package uh within airflow so um i have and i have to install that package separately so again it's separate from the core airflow distribution and this particular case i've added it to my requirements.txt file so it will get installed for me um you can do this in different ways depending on your airflow infrastructure setup um but at the end of the day they're just python packages that you would pip install um actually before i go through that in too much step i'll also highlight the astronomer registry that i talked about earlier if i look for the snowflake provider you can see this is going to give me information both on the current version uh when it was last updated how you can install it so again this is what i'm doing in my requirements.text but you could also pip install uh and then it's going to show me kind of the available modules within uh this operator so uh within this provider so any operators the snowflake operator um any transfer operators things like that so if you're just getting started especially if you're particularly interested in snowflake uh this is a great place to look and to see kind of how these things work so from that again popping back to my code editor so i've installed my provider package uh that's going to give me access to uh the operators and transfer operators within the snowflake provider uh that i can then use in this tag so you can see again at the top i'm importing both of them um the goal of this dag is to grab some data from a covet api endpoint save it to s3 and then i'm going to transfer it from s3 into my snowflake instance i'm then going to do some transformations on that data within snowflake so one thing i'll note here is that um what i'm describing is actually more of an elt framework where i'm doing the transformations at the end and i'm offloading them to actually do them within the snowflake this is an ideal way of doing sort of an etl or elt framework using airflow while you do have you know the option of making transformations with an airflow itself using python airflow is meant to be an orchestrator not necessarily a processing framework so that will work if your data are small but if you have larger data sets we would definitely recommend offloading that processing framework to something that was designed for it snowflake is a great option where you have a data warehouse that has built-in compute so that you can use that there and so that's what i'm doing in this example so after i import my operators i define a python function that's going to get my data so again this is just grabbing covid data from an api endpoint it's then going to stick it to a flat file on s3 so to do that i'm using the s3 hook you can also see up here i'm importing the s3 or excuse me the amazon aws provider package so that's another provider package that we're making use of from there again i instantiate my dag to find some uh basic uh parameters on how i want this tag to behave and then i get into the operators themselves so i'm actually going to start down here where i'm generating these tasks dynamically so in this case each one of my endpoints grabs a grabs a file for a different u.s state and so i'm defining those here if i wanted to add another one i could obviously do that easily by just adding it there so again this is just saving me some time if i want to make changes so that i generate those tasks dynamically so my first operator is that set of python operators so again this is calling that upload to s3 function it's parameterized for those different endpoints and the date that i want to pull i'm defining that date here using a built-in airflow variable this is a great way making use of airflow variables or macros is a great way of making your dag's item potent so that if you have to rerun any past dag runs uh you can still do so for the right date so you do that there uh i'm then making use of the s3 to snowflake transfer operator and again this operator is ready to go out of the box super nice you don't have to write any code beyond this to bring data in from s3 to snowflake and so benefit of community development i just define my uh task id and the s3 keys so these are the files or the file in this case that it's going to look for i defined my stage so this is something that you set up on the snowflake side snoop like has great documentation for how to do this and this is for any time you're uploading files from an external system so in this case i set up an s3 stage uh you would have to do this you know within your snowflake instance um the rest of this is defining uh where that data is going to go um so the table and the schema the role i want to use and the file format and then i have my snowflake connection id so i'll go over the connection again um in the ui in just a second but this is going to be how airflow interacts with um or can connect to snowflake um finally i mentioned that after that i uh performed some transformations on the data so i have this pivot data task and this is just using the snowflake operator and the way i implemented this in this case was to define a stored procedure within snowflake that does the transformations and then so i'm passing in a sql [Music] command that is going to call that stored procedure um there are obviously lots of ways that you could do this i i could have defined a sql script that gets called here that does the transformations for me and stored that sql script uh within my airflow instance here um probably lots of other ways but again the main point here is that i'm trying to offload that processing onto the snowflake side so airflow is just going to orchestrate it's going to tell snowflake to run my stored procedure and do those transformations um but none of that processing of the data is actually going to happen within airflow the rest of this is similar i defined the role that i want to use and the schema and then the snowflake connection id so if i go back to the airflow ui uh what this dag looks like in a graphical view is again i have just kind of a dummy start task i then have my python operators here that generate pull my files from the api endpoint and save them into s3 i then have my s3 to snowflake transfer operators that transfer that those files into my table in snowflake uh and then i have my transformation uh using the snowflake operator by calling the stored procedure in this case it's just doing some pivots on the data if i go to the tree view you can see i have some successful runs of this tag you can also see before here where i had some failures so i'll note that within airflow on the ui can be a great way to you know view what's happening in your dogs for any tasks if i click on the task instance it will give me a link to the logs where i can go and see you know what's going on especially if there's any errors that you're trying to debug this is a great place to start and then finally i mentioned uh in the code earlier that an important piece of this is how is airflow going to actually connect to snowflake so uh for that i'm defining this snowflake connection id on both of these tasks so what i've done here is defined an airflow connection uh so you can view those by uh in the airflow ui by going to admin and then connections uh you can see i have the snowflake one here um this is going to again provide all of the information that airflow needs in order to connect to my snowflake instance um for snowflakes specifically if you're running with again airflow 2.0 or greater and using the most recent version of the snowflake provider package the connection fields will actually update to show you what you need to input so i'm super nice usually airflow connection fields are in historic versions of airflow those are just standard and so you would have to figure it out and put it into the extras this will actually update to kind of show you what you need to put in here so the main pieces are again your host so you're going to have some sort of host domain for your snowflake instance uh you'll need your schema as well as credentials for that um it's a login and password and you'll need your snowflake account and then other information again that you can put in um like the database you want to connect to uh the warehouse the role that you're going to use you probably noticed in the code that again you can pass those into the operators as well um so kind of multiple different ways that you can have your operator make use of those parameters um but again this having this setup is important so the airflow can actually connect to your snowflake instance um so from there uh i can go ahead and run this um just to show watching something in the ui again if anybody has not upgraded to airflow 2.0 but is already using airflow i highly recommend doing so for uh lots of reasons but one of them being that uh this auto update feature is really nice you can watch things as they're going through um i used to be one of those people that would sit and just click the refresh button over and over again uh waiting for a tag to finish so you don't have to do that anymore you can just sit and watch but you can see that's now gone through uh and loaded all of my data into snowflake if i wanted to actually look at the snowflake instance um this is what the data looks like so here's my state data table um and you can see i have all this data in there um four different states and then um i have another date table um that contains the pivoted data so again using that elt framework [Music] pretty easy to set up within airflow making use of that provider package i'll kind of pause there uh if there it seems like there might be questions uh that folks have thrown in the chat um that was amazing i got a couple questions for you that uh we'll feed you as they've been coming in [Music] so uh first question is what's the difference between airflow's ui and astronomer the variables in airflow and the is an astronomer um is that so when you say variables is that like environment variables or both the amount of variables on the airflow variables got it got it so um i'll start with that and then i'll come back to the ui but uh so airflow variables are and built-in variables within airflow for you to you know template your dags and um pull information in at runtime so uh this is one example of using a built-in one you can also define your own it's just that airflow has these built in so that you don't have to worry about them mostly like dates and things like that and that's separate from an environment variable in your airflow instance that you can also set um they're kind of used for different things um environment variables are typically used to control your airflow configuration or they can be used to set things like connections so you saw me go over the snowflake connection rather than filling that out here in the ui um you could also set something like this using environment variables um you can access them uh within your dags as well um so just kind of a couple of different different ways of getting it sort of different things um in terms of the why uh airflow versus well i guess i'll also say that all of that is the same on the astronomer platform so you can still use airflow environment variables or templative variables just like i've done here um the ui is also going to be the same from an airflow perspective it's just astronomer provides a control plane where you can manage all of your airflow instances so you'll have a separate astronomer ui as well um to look at things um across airflow deployments and manage your infrastructure things like that i'm not sure if that answered that question that was a lot of different things but [Laughter] hopefully got some of it i think you got it i think you got it awesome cool um another question is can we pass a sql three query through a file yes absolutely i actually have an example of that um just to show you and i i will also say that uh in general as a best practice i would actually even recommend passing it through a file as opposed to what i did here um obviously if you just have one line of sql probably doesn't matter that much but your dag's gonna be a lot cleaner if you pass it in through a separate file um as an example of that i have this other dag called bram query and you can see this one's also using a snowflake operator um in this case i'm passing in this param query.sql file the only difference in the dag that allows me to do this is i have to define a search path that tells airflow where that file is going to be like where to look for it um if you put everything in your dags folder you wouldn't need this but i did not as a sort of organizational uh preference so i have this pram query.sql file in my include directory you can see it here and then this is just sql and you can notice in this case it's also parameterized and similar way as my other dag was so you can still do things like that um the include directory is something that we ship with uh with the astronomer cli and astronomer airflow projects again just as a way to organize things it's meant for external files that your dags might be pulling in [Music] but you can you know sort of organize your project in whatever way works best for you and but that's the short answer there is yes you definitely can [Music] awesome so wayne is asking what are some other tools recommended for the transformation trip for the transformation piece between s3 airflow and snowflake is doing a transforms a snowflake like this demo best practice or are there other things that you can do yeah um i would say you know snowflake is particularly good for that because again it's designed for that and that you have compute built into your data warehouse so it works really well for an elt type framework uh certainly not the only tool so um you could use you know any other processing framework um i've seen you know people use spark and data bricks and and things like that is another one that works well in that sort of end step um use case for um you know doing transformations or models with your data those are the ones that come to mind for me barrage i don't know if you have any others that you've seen i think you hit the nail on the head there right nice part is you can choose whatever tool best fits your needs another pattern we've seen a lot of times is actually once that data's in snowflake you're pushing it back out into some other tool using a tool like census or any of the other reverse etl tools out there so you can really pick and choose everything you need for your uh for your platform a few more questions here so i'm going to keep keep throwing them right at you um do most of the other operators also support templating michael glaros was asking um yes i believe that you can do that with most any of them yeah for sure um [Music] question we got also before or before the webinar as well um do you see use cases where airflow dbt are combined together um are there overlaps or are they like separate tools yeah absolutely um we have seen them uh used together i would say you know again we talked a lot through this talk about how airflow is meant to be an orchestrator it's designed to play with different tools um sometimes through provider packages or sometimes otherwise uh it can be a really really good um tool paired with vp so that you have sort of a you know first class orchestrator um and scheduler and also get again the benefits of you know doing data modeling and transformations in dbt so we definitely have seen them used together uh successfully i think we have a couple of blog posts about doing so and kind of methods by which you might go about that um that's something that the community has been pretty active in is making the story for those tools you know using those tools together stronger um so we expect that we'll get you know even better from a user perspective kind of as things go forward [Music] yeah i would say that a lot of our customers are doing that exactly right where dbt is handling things within the warehouse and airflow will help feed the data to the warehouse itself um can can you run a jar file using airflow uh yes i would say probably the bash operator would be the best way to do that that comes to mind for me um and you would run it as a as a bash command um maybe there are better ways that's that that's the top of mind one um but yes you definitely could yeah our customers usually do that or they'll uh use like the kubernetes pod operator and just run a separate pot for that jar file um cool let's see here there's a couple questions to choose from here um so you know i think question around how sensor could work here um but with the with the snowflake snowflake provider triggered the job as soon as there's files to load or do you have to do something in between the files arriving and the airflow triggering the job molly is asking that question yeah that's a great question and just like you said barrage you think a really good opportunity for sensors that's exactly what they're designed for uh depending on where so in this particular use case there is an s3 file sensor that will specifically watch an s3 bucket for files to arrive and then trigger uh move on to the rest of the tasks in your dag so for this very specific use case that would work really well if you had files landing you know somewhere else that maybe there's not a specific sensor for there are still more generic sensors that you could make use of there um so yes you absolutely can make it again if you you know if you have something like this and you're gonna have files arriving you don't know when i would definitely recommend using sensors so that you don't have to you know run your dag on a schedule and have it not do stuff a lot of the time that's what those are built for yeah so you know on that note when you're looking to accomplish use case like that right you're waiting for data to land in s3 then maybe you want to push snowflake um are you usually writing all the code from scratch is it auto generated or is there somewhere that you can pull examples from um the registry is a great place to pull examples so uh i'm just gonna go back and actually go to one of these uh options and you can see so i'm looking at the s3 to snowflake operator here that i was using and you can see there's an example dags section here that is going to give me again example code where it's all done for me so this would probably be my recommendation for best place to start there are example dags for a lot of the provider packages out there a lot of the operators and again it's just both kind of boiler plate code and some also kind of more sophisticated examples um for different use cases so that's um in that case you're likely not going to have to start from scratch for the dag that i showed here uh i didn't write any of that code from scratch it was all pulled from examples like this and kind of other decks that we use so um yeah there's a lot of stuff out there uh in the community to make that easier for you i love all the example data to the registry it makes any new workflows so much easier um do you have any best practices on dealing with staging intermediary tables if you're doing the processing within the warehouse itself that's a good question that's going back to my uh engineering days as opposed to specifically airflow um i think staging tables can be really useful um both just for a recovery purposes and for you know helping design like item coat and dags um you know using like incremental loads uh is definitely a best practice that we recommend if possible so not having your dag just do all of your data having each dag run do a chunk of data um and using staging tables along with that method can be a really useful way of doing that because you oftentimes need to be able to put the data somewhere in the middle before you load it into the table that has everything um yeah i'm not sure if i'm not sure if that's helpful just uh i i would recommend using them i think i think they're really useful in a lot of use cases and i um in my past life uh chris farmer use them a lot myself um when you know doing data engineering for clients so um yeah yeah um what about uh ci cd so do you have any recommendations around ci cd with airflow um yes um my my really quick recommendation is do it um one of the uh you know benefits having code based data pipelines is that you can integrate them in with just any other software development life cycle i'm using cicd is a big part of that at astronomer we recommend that all of our customers do so and it's very straightforward to deploy to the astronomer platform using ci cd you know obviously how you use ci cd with airflow is going to come down to sort of where your airflow is deployed and like where your dags are coming from and how you know all of your airflow infrastructure is set up so um certainly reference you know documentation on how to do that with the astronomer platform and most of the common ci cd tools out there um that's available on our website um but again i no matter where you're running airflow i would absolutely recommend i'm using ci cd to deploy your dags especially to like production instances yeah there's so many ways you can do ci um it really depends on so much of that infrastructure setup that you have question from the linkedin stream can you use airflow operators for event trigger operations like the same way you can for cron scheduled ones i'm not sure uh i'm not sure i understand the question so is that similar to just like you want your operator to run based on an event as opposed to a schedule yeah exactly right so how is our flow's event driven story got it um so that i think also comes back to um sensors uh are probably the you know easiest way of implementing something like that um oftentimes the way sensors will work is you know they're gonna wait for something to happen before moving on to the rest of the tasks in your dag so um they're you know can be implemented in that way so if you have an operator that you only want to run you know after something else has happened you might have a sensor task before it um to sort of stop it from happening if that event hasn't occurred um you could also get creative with you know branching and things like that um within your dag so you don't have to just run you know your tasks all the way through you can make them more conditional in general through things like branching or short marketing so that could be another way if you had you know other tasks within your dag that were doing things or checking for external things after they've run uh kind of just depends on your use case and like what those tasks are actually doing um but you can absolutely do something like that yeah awesome kenton i think we could sit here and take questions all day but i think we're just about at time today um thanks everybody for coming and for the wonderful questions and discussion uh i know we can get to all of these but if it's really pressing to you please feel re please feel free to reach out to us and we'll definitely get it answered um like we were saying all the code and the presentation will be sent out to all as a follow-up um and you know lastly uh please register for the airflow summit you know if you're a fan of this content there's gonna be this and so much more at uh at the airflow summit starting on july 8th um it's all from my side canton anything from you before we before we call things a wrap i don't think so uh like brows said again thanks everybody for joining um we love having these conversations and having um such a vibrant community out there with lots of engagement so that's really great to see and yeah i was also just gonna reiterate um register for the airflow summit uh i'm really excited shameless self blog i will be speaking um so um there are tons of other good talks uh i really wish i could join all of them um and they're in lots of different time zones as well so that no matter where you are there will be something that you can tune into all right thanks everyone we will see you next time
Info
Channel: Astronomer
Views: 4,087
Rating: undefined out of 5
Keywords: etl, etl pipelines, etl pipelines snowflake, etl airflow, etl software, etl tools, etl snowflake, etl tips, airflow snowflake
Id: 3-XGY0bGJ6g
Channel Id: undefined
Length: 50min 34sec (3034 seconds)
Published: Tue Jun 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.