Why Data Engineers LOVE/HATE Airflow (FT. @mehdio , @startdataengineering and more!)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys and welcome back to another video with me ben rogeshawn aka the seoul data guy today we're going to talk about airflow and kind of the love hate relationship i feel a lot of people have with it now if you would have started in data engineering about a decade ago like myself you probably didn't start with airflow you likely wrote some form of mapreduce jobs in hadoop or maybe did some stuff in ssis or like a lot of us did put together a lot of scripts using cron and just you kind of guessing when you should start each dependency now this would push a lot of us to use airflow because it took care of dependency management it had a scheduler and it just seemed to have a great ui and just everything that you needed if you needed to run some sort of data pipeline and for this video i want to get more than just my perspective on airflow and how people feel about it because at the end of the day i am merely one brain and i've said this before you could spend a lifetime learning the data world and more than likely only get to about 10 of it that's why for this video i collaborated with a lot of other data experts to just see what was their opinion and how did they come across airflow and what are their feelings now and all of these people had varying experiences in terms of where they started with data pipelines some started with airflow others started with python others ssis and still others solutions that i had never even heard of especially when you start kind of the custom script routes and building things like one of the first jobs i worked at where we did everything in powershell you eventually realize this is not a great way forward you're having to manually kind of manage dependencies and so you start thinking is there a better way we eventually needed to start googling better ways of managing things like dependencies and schedulers so for example again with joseph he found the classic argument of luigi versus airflow and like many of us joseph found that well the ui was good enough and the scheduler was great and it really did manage a lot of the problems we as data engineers face and here's the thing joseph was not the only one that was called by the siren song of airflow medio of medihodada tv which you can check out his channel um in the link below also found that airflow's ui and most of its functionality were great right medio the other thing that i like i mean as we mentioned earlier in the discussion is the ui it was definitely and that's before uh prefect uh came in um it was the best ui and there were tons of other features that people really enjoyed about airflow especially when they came out with their kubernetes operator then honestly most people i talked to discuss the fact that the kubernetes operator was great and honestly there was a whole article about how most of us are using airflow wrong and talking about this whole concept but let's just let some other people bring this up i think one big thing which was really fun is that it's the kubernetes operator i don't know why you can kind of see why we did engineers really were drawn to airflow it honestly met a lot of our needs it had a scheduling option it provided the ability to do dependency management which i remember how difficult that was at one of my jobs like my manager literally told me i had to write down the dependencies so that we knew exactly which scripts to write and run in what order and it was crazy that we were doing this basically manually so having a solution that you know essentially made sure you could manage your templated sql was great but like anything else there are some gotchas that people will have as they're trying to put airflow into production and a lot of the conversations that i had with experts in the data field were focused around this in particular and we'd come out of like big it so like really looking at airflow from a how do you scale it what's interesting is that every big shop that uses airflow then explains to you the layer that they built on top of it so that none of their employees actually write dags and you go okay that that's the answer for you know it's the most powerful thing you can have and most people shouldn't use it like natively which i'm sure you're aware there's a bunch of different ways that you can handle airflow you know you can create it yourself you can use cloud service tools like and manage airflow that aws supports if you want to go for the software as a service you can take the astronomer route and they they can handle a lot of things for you so there's a bunch of different options but if you just try to do it all yourself and you don't know what you're doing it's going to be a steep hill to overcome the fact that scaling as you can see is hard scaling airflow isn't something that's as easy as just clicking a button and spinning up a bunch of instances and managing just one or two ec2 instances it often requires multiple components similar to the way that if you had to manage hadoop in the past it would require multiple components that honestly made it arguably unmanageable for companies that didn't have some sort of infrastructure team sure you could do it yourself but that becomes very hairy and there's a whole lot of challenges you're going to run into as you're trying to manage things via docker talking to experts like sarah krasnick she also ran into an issue that i had when i was working with airflow which is passing data between tasks was always challenging so if you're trying to do something that was more dynamic which airflow does not even suggest you do it is borderline impossible i first honestly came up with this more in data swarm which is version one almost of airflow so whenever i need to pass in data from one task to another i literally had to write it to a file and reread it in a python task and it was very complex and clunky obviously airflow does provide a few options here and how you can do that with like xcoms but even there it just is a little bit clunky and that's why i do think we're seeing a lot of growing popularity and possibly other tool sets one that was brought up was just using step functions or something similar like google workflows with just lambda functions essentially additionally there's just other solutions all together like daxter prefect i tried to do prefix first but it was there were like so many different versions i i gave up pretty quickly i should i should go back and try it because i was like i remember like core and that was like an open source one and i was like it was confusing so i gave up but i i really like daxter like it was it was really good like so easy to set up worked really well with docker and the ui is really good i think it was it was developed by the same person who worked on graphql or led graphql at facebook so you can see the ui kind of updating in real time which is which is really nice and and the testability there was really good like i i was like playing around with it um testability is really good you can write this much more easily than an airflow now in terms of managed services airflow does have some options here astronomer for example is a great solution as is cloud composer it's just going to depend on the complexity of the pipelines that you're building but those are both great options and i am very excited to see where tools like dagster and prefect go because they are going to challenge what the current status is and hopefully make all of the tooling better so that we can start as a community building better practices around data pipelines with these solutions like i wrote about in my recent newsletter which if you're not signed up you should sign up below the modern jstack and all of the solutions that we've kind of taken hold uh over the last about three to five years have taught us a lot of new lessons and i think one of those is with airflow going from airport one to two with prefect being developed with daxter being developed all of these solutions are challenging us as data engineers to figure out what is the best way to build a data pipeline thanks guys for watching this i hope you guys learned a ton not just for me but from all of these uh wonderful creators and experts in the data field and i will see you next time thank you and goodbye you
Info
Channel: Seattle Data Guy
Views: 37,541
Rating: undefined out of 5
Keywords: data engineering, seattle data guy python, airflow, apache airflow, airflow tutorial, airflow tutorial for beginners, airflow tutorial python, data science, data science skills, data engineering skills, how to become a data engineer, how to scale apache airflow, data engineer skills, data engineering tutorials, data engineer vs data scientist, ben rogojan
Id: h5X3124R61U
Channel Id: undefined
Length: 8min 0sec (480 seconds)
Published: Wed Jul 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.