The Realities Of Airflow - The Mistakes New Data Engineers Make Using Apache Airflow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there guys welcome back to another video with me Ben Rogan aka the data guy today we're going to talk about airflow and more specifically the mistakes I've seen when people deploy airflow in some form of production state that is to say that airf flow is a great solution but it's also deceptively easy if you're to go through any tutorial you think you know you just spin up a Docker container maybe you accidentally leave the wrong executor on maybe you don't think through how you actually deploy the various components like web server schedule where the dags live all of these things that will eventually cause your airflow instance to not scale So today we're going to talk about some of the real life use cases that I've seen people in terms of clients make mistakes in terms of how they deployed airflow um there's just so many ways that airflow can get deployed and we'll cover some of that uh by going through some articles where companies have done it successfully or at least try to so yeah we're going to cover a lot in this video but really one of the core areas you're going to figure F out is that you need to become very familiar with airflow CFG or airflow config the file that basically has all of the ways that you actually end up setting variables that make airflow operate in specific ways so if you open up your airflow. CFG you'll see a parameter that let you set the database you're going to use set where the dags will live where logs will live how parallelism will be taken care of and all these other important factors that you need to think through as you're trying to think through a scalable airflow deployment not just making air flow work right cuz that's arguably pretty easy but making sure that airflow works at scale so with that let's talk about some key mistakes I've seen uh some of my clients make uh when they deploy airflow all right so the first mistake I've made I kind of alluded to in the intro is having your dag folder connected to the rest of your airflow project what I mean is some people will end up setting up their dag folder their web server their whole entire Docker container in one repo the problem is when this happens and you have an amazing cicd program every time you make a change to your dag folder or where your dags live guess what ends up happening as soon as you hit push as soon as you end up pushing that uh to ruction you push the entire uh new Docker image that you've run um in your Hub and push it and replace your old airflow instance meaning if you had any airflow jobs running before they're all likely killed and if you had a three-hour uh airflow job running well too bad it's now dead uh maybe occasionally I notice it would find it again but usually they wouldn't find each other again so you just put up a new airflow instance it's like it just knows that it failed the last time so it's going to restart again but again it's starting from square one and you know this not really a great situation to be in especially if you have long running uh jobs instead your dag folder should be somewhere else generally the way I see people do this is your dag folder will live in something like S3 and then uh again you'll have a separate repo for that you know it just be your dag repo it'll then be pushed to S3 and then S3 will likely put it into some sort of network folder system somewhere else maybe depending on how scaled out your system is it might have multiple instances of air flow that it gets pushed to or maybe it's just one uh folder system that it gets mounted to essentially internally on the docker container right so it's still lives somewhere on a Docker container but it gets pushed separately rather than getting pushed as one giant system the entire Docker container every time and actually there's a few articles that cover this you'll see images both in scribs uh and Shopify uh they both have a similar process here or at least kind of cover it um in their images the same the process behind it probably different but you can kind of tell similar idea um but you have to have some sort of storage system like dual cloud storage S3 Etc then gets pushed um into some sort of network folder system and it just causes this issue that every time you want to make a code change uh it's a massive uh impact on what's actually going on so it seems like a small thing it is big thing and if you don't get that fixed uh it does cause a lot of problems down the line especially once you have hundreds of job Runing um when you're just starting it's like you don't even notice but once you have hundreds of jobs or you know you don't just want to maybe push airf flow changes once a day when none of the jobs are running this is going to impact it cuz in theory I will say that the the other solution is um besides separating them is you could if you don't have too many jobs just at a specific time of day where you're always going to push changes and those are generally around the times when you're not running dags but then again you're now having to take this extra step and think okay I can only push uh this change during certain times of day because it will break what I'm running so it works if you don't want to spend the time uh to set things up but again uh it will eventually no longer scale now the next problem we're going to talk about is not going to be about scale but we'll cover scale as well the next problem I'm going to talk about is less of a specific problem and more of just something that I've noticed people sometimes forget or miss and that's that airfl has a ton of functionality that provides value that you might miss for example Hooks and variables I'm just going to talk about those for this section Hooks and variables you've either heard about or you haven't are two features that airf provides uh hooks for those who don't know essentially allow you to abstract connections so think about your postgress connection or even they even offer things uh like custom connections that you can make so you could set up a net Suite hook I've done seen that one I've seen people set up slack hooks basically a way to create um your object your system object whatever it might be bigquery post slack and interact with it without having to every time specify the credentials so one this has abstracted uh the setting up of your whole entire connection string every time but two it also tends to add another layer of security because you're not having to reference the credentials every time and it just makes it so much smoother because the other thing that I think sometimes I'd run into is um you know if you don't have those hooks set up you're having to like test connections and like go through that little bit like does it work does it not work whereas if you know that hey if I put in net Suite or if I put in you know hcp hook uh whatever the name of the hook is cuz it's just basically key value pair I know it works because it works everywhere else so I know it's going to work here and so there's something about that that just speeds up your overall development process um so hooks are great um I find them very Nifty I'm sure there's some downsides I haven't found them yet but I'm sure we can there are some downsides to everything um the other thing is variables variables are just Nifty in terms of the fact that they can help you if you've got something that either is like I need to set up something that handles prod versus Dev like maybe you've got a prod versus Dev airflow environment maybe set up some variables to set up the various different locations of where you're actually pointing things um or maybe occasionally you just have like some one oneoff strings that you need to make secure uh maybe it's a pgp key maybe it's a pgp password for a key what whatever it might be you can use um variables and actually use an encrypted version of a variable usually by using underscore password or undor secret um and these will force the variables to store them as a uh encrypted version so it's just a basic key value pair so you can call it like pgp secret um and then it will be able to be pulled without you actually having to reference the actual um string or secure piece of information in the code base so again just two uh Nifty Things that really do make a difference um it has less to do with your skill ability but if you don't use these I've just seen people use a whole other way of custom code to handle some of this stuff um so instead of writing a bunch of custom code you can just get it right out of the box um with what airflow offers now before we going any further I do want to take a quick moment and take a quick pause and say hey if you're out there and you are are looking to use airflow for your team for your company or even some other orchestrating tool our team uh has definitely had to set up and work with various uh instances of airflow whether it's with mwaa or a custom solution so if you need help our team is a set of expert uh data infrastructure Consultants as well as an ml consultant who are all eager to help you on your data projects so feel free to set up a calendly uh meeting with me below if you want to talk more about how we can help your team optimize their data infrastructure and really find Value out of your data all right back to problems we have with setting up airf flow one of the biggest problems that I think people don't think through about airflow is setting it up for scale now maybe you'll never have to deal with this and truly uh you can use things like mwaa or Cloud composer just out of the box and they scale decently well they have their own problems but you know if you're trying to deal with scale and you have a lot of basic tasks they do great uh um but if you're doing it yourself and you don't plan for scale you'll run into problems almost immediately um once you start passing that 10 20 100 jobs you're going to start realizing that certain jobs start bumping into each other you can only have so many jobs running per worker and there are just limitations based on the size of your worker how many you have how many schedulers you have running if you have a que set up and all of these other components which is why you'll see uh when people put airflow in production they'll start diagrams like the two I'm going to put up here because it does get very complicated in fact I love the quote uh that came from the Shopify article that said um there's a lot of possible points of resource contention within airflow and it's really easy uh to end up chasing bottlenecks through a series of experimental configuration changes some of these resource conflicts can be handled within airf flow While others may require some infrastructure changes and I think that's one of the important points to point out is that yes some of these changes might be things that your team can fix as a data engineering team maybe you can go in uh and make some configuration changes in airflow. CFG or maybe you can use a different executor or you know do some basic things um in terms of how you offload work to different workers but eventually you have to likely reach out to devops or someone else to figure out how to scale um airf flow effectively again more than likely most people watching this video might not get to that point you know we're talking about people who have thousands of air flow jobs running essentially at um Facebook we had I think easily on our team alone maybe 10,000 jobs running and that's on data swarm but it's similar to airflow um and there's a whole team to manage it there and in the same way if you're at that scale running 1,000 10,000 you know 100,000 jobs in airf flow cuz you're a massive company that's when you really should just have a team managing air flow that trade-off where yes you're not maybe paying for a service but you do need to essentially pay for a service worth of a team right you're going to have to pay for three four people to manage air flow at a 100,000 task uh scale maybe more maybe less depending on the tasks but that these are important things to think about again there's a lot of little things that in the airflow CFG I'll put up here um that will help you uh manage some of the scale right like priority pools and things that needer some of that can be set up in the actual airflow dck themselves but these are just easy fixes they're not necessarily the solution so airflow is a is a great solution I I love it it has a dear place in my heart maybe because I've used it a lot um but it also has its trade-offs um and it is difficult to scale so if you're out there and you're planning to work with airf flow hopefully this helps you just have a few quick fixes um if you run into a problem like you realize you're connecting your dag folder to the rest of your project or maybe you're not using all of the useful features that flow offers hopefully this helps you realize that there's so much you can do with airflow but it does need to be deployed well and it's not just as easy as putting together your first dag or running it locally so with that guys hopefully you found this video helpful um please if you have any comments leave them below please take a moment to smash that like button and I will see you in the next video thanks all and goodbye
Info
Channel: Seattle Data Guy
Views: 13,853
Rating: undefined out of 5
Keywords: airflow, apache airflow, intro to apache airflow, how to deploy apache airflow to production, data engineering, should i become a data engineer, data engineering skills, docker, airflow webserver, airflow scheduler, airflow data, data science, data analytics, high paying data engineering skills
Id: gkKY6Q3GApw
Channel Id: undefined
Length: 12min 26sec (746 seconds)
Published: Thu Oct 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.