Airflow Vs. Dagster: The Full Breakdown!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey y'all data guy here and today I'm coming at you with a very highly requested video on dagster versus airflow so you guys love the airflow versus prefect video but now it's time for airflow to square up and take on Dexter so airflow dagster they're both popular open source workflow management systems um you know they're designed for almost the same purpose you know helping developers and data Engineers manage complex data pipelines and so they have some similarities but they also have a lot of key differences that really make them better suited for different use cases so before I get into you know piece by piece what those differences are I kind of just wanted to lay the groundwork for you know what airflow and what dagster are and what their main purposes are I think that'll really help set the stage for some of the differences I'll lay out later and so airflow really popular some may say the number one choice for many companies and developers due to its flexibility it's easy use and it's got really strong Community Support so airflow as we all know mainly uses python to Define workflows as directed acyclic graphs dags and that allows users to you know easily scheduler Monitor and manage those workflows and because it's an open source system it has a massive ecosystem of plugins Integrations that make it really easy to connect with whatever system we're using and if there isn't a connection already it's really easy to design your own and so it's really cloud native you know it's designed to connect to databases cloud services all kinds of web applications and it also has a web-based UI that makes it easy to kind of visualize that workflow progress and manage those tasks and it really acts as an orchestration tool where you're not doing a lot of things within airflow you're using airflow to command all of these other tools to do what they're best at now on the other hand you have dagster dagster was kind of born out of the air airflow 1.x days where airflow kind of sucked it wasn't Enterprise ready it had a lot of bugs it was annoying to use and so you had prefect and dagster kind of come on the scene and say hey we have a better way to do workflows and so it's gaining a little bit of popularity within the data engineering community because it's really focused on data quality and testing so whereas airflow doesn't really have a ton of data quality and testing Frameworks built into it out of the box you know you can build modularly on top of it dagster is all about at every stage of your workflow checking that data quality and testing it so similarly to airflow you're still using dags and you're using a python based API so instead of just straight python you're using an API but at the end of the day it's still python obviously Styles still have a lot of that flexibility um and with dagster and this is kind of the main key difference even though they both use dags and they both are doing workflows is that with dagster users can Define data quality checks at each step of the pipeline to ensure that data is correct and consistent throughout the pipeline and that just kind of comes built in with dags or versus you having to build out yourself as you do with airflow finally dagster also has a pretty powerful built-in testing framework that makes it easy to test and debug workflows again not saying there's not those testing Frameworks for airflow but you do have to implement them yourselves versus with dagster they all come out of the box so now that we've kind of laid the groundwork let's get into the nitty-gritty of what the real differences are between airflow and Daxter so you can decide which is best for you so to do this I'm just going to kind of walk through the two uis just as a backdrop but I really have five main points I want to talk on which is workflow Focus data quality testing framework Community Support and language of choice I think those are kind of the main differences in how both of these function so first with workflow Focus so if you look within here you know we have a pretty typical airflow tag and that is you know extracting transforming and loading some data into a system and so while this is a really simple one what airflow excels at is even more advanced tags where you can have tons of different branching possibilities capabilities where everything is conditional so like for something like this you know based on which day of the week conducting some kind of action and so while this is you know based on scheduling for you know a family basically this is really impactful when you when it comes to business business logic you can say hey based on some condition from some data set trigger some Downstream workflow so you can programmatize and automate all of the workloads that your data Engineers may be doing manually or through a bunch of different disparate systems and having to go on each of them and Trigger them independently you can basically write the human logic check into your workflows so that your airflow environment kind of takes on the role of the data engineer now with dagster on the other hand if we go into an example workflow for them you'll see it's not as complex when it comes to kind of the inner relationships between uh steps in a pipeline so you can see here you know we're collecting from an API and then we are plotting it within a word cloud and So within here what you're really doing is it's mainly conducted around taking data from apis and visualizing them in a workflow and you'll also see that you have metadata plots too built in here configs and it will also tell you hey what type of data sources is coming from give you a description of the data source and also tell you what API docs they're using so here their focus is you can kind of see is really not on you know workflows and conditional and kind of you know having the different optionality in your workflows and programming you know for production grade this is more about just saying hey you know taking data running it through some process and then visualizing it so it's very much a data collection and kind of analytics Focus tool and that's why there's such a focus on data quality because data quality is incredibly important for analytics because if you have bad data and you have bad visualizations and you're giving bad reports so if your main focus is just generating reports then then diagster can be great but if you want to have you know complex branching logic saying hey you know based I want to look at this table and then decide based on the contents that table what Downstream path I want to send that data on that's when airflow is a lot more impactful so it gives you that options without you need to actually actively manage it and so that kind of leads me to my second point which is data quality and so how it does is by giving you the capability of including within your actual dags you can see here as I'm running you can have data quality checks that run as I'm pulling in data as I'm transforming this data to make sure that at no point is that data you know non-quality to just go to the most base layer um so this is really useful for organizations that prioritize data quality and want to ensure that data is accurate and consistent across all their pipelines and so while airflow doesn't have anything you know built directly into the UI that performs a similar operation airflow's modularity takes precedence there where instead of it building built into airflow airflow leverages tools like great expectations and other data quality Frameworks that you can bring in to your airflow dags and actually run them there so while you don't have a data quality check built in you have the modularity to bring in whatever data quality checking tool that you'd like to use mainly Great Expectations and actually use that within the context of your pipelines you can still build airflow out to have the same data quality checking capabilities another thing that is built into dagster but you know isn't really built into airflow and kind of take a similar approach therefore in terms of modularity is a testing framework so with dagster you have an automated testing framework that when you're materializing it it's going to test and debug your dags tell you what's wrong tell you what's right and give you all that kind of information so if I look at a run go into this particular run ID it will tell me if these steps are successful or not um it'll tell me in a little bit of a less python-esque log way of what actually happened and what caused a failure now within airflow as we all know logs when airflow aren't the most readable but they still give you all the information you do know there and while there isn't a built-in testing framework within airflow because all of the airflow dags are written as Python scripts you can just run a pi test include any kind of python test you want in your testing file for your local airflow development environment and then run those within there so with dagster no you have some testing Frameworks that are built in but with airflow while you don't have testing Frameworks built in you have the full capability to build out whatever kind of testing framework you want within Pi test and then be able to run that on your airflow environment so again while this starting point with airflow is a little bit lower the ceiling of really the complexity you can get to in your testing Frameworks is way higher than dagster whereas the floor with indexer is a bit lower but the ceiling is also a bit lower because it's a built-in testing framework sorry this floor is higher for dagster the ceiling is lower um and so it's really a matter of what you want do you want it to build out your own testing framework or do you just want to get something that runs out of the box and that'll kind of help you inform you what weather airflow or dagster is better for you um finally or not finally but fourth so penultimately we have community support and this one's a pretty easy one um airflow has absolutely massive community that is only continuing to grow so if we look at the downloads of Apache airflow uh it's up at 10 million a month recently and so obviously a lot of these years people redownloading as part of CI CD but if we also look at Pi Pi downloads by provider you're looking at 30 million downloads per month so that really kind of shows you a more organic level of growth in terms of people you know not just downloading airflow but installing packages and building with it um and then finally you can see kind of Docker image pulls as well for how many airflow Docker images are being served between community members now dagster because it isn't an open source project it is a close to worst proprietary piece of software has much lower numbers obviously and so while Dax here isn't published download numbers again because it's a private company we can kind of glean a sense of what their actual numbers are by looking at their total visits their website which is 250k a month oh you know it's not that bad um it's not amazing and also we can see that dagster has over 3 000 community members across 400 organizations just within the airflow selecting your loan there's about 30 000 people so factor of 10x um and again not just trying to dunk on dagster but Community is important because Community is what develops all those Integrations and all the providers and all those packages that everyone needs to use to do their jobs with that tool and so obviously that's what I do airflow having you know thousands of providers versus dagster isn't really as widely supported so take with that what you want um and so finally code and language of choice so airflow is purely python based you're using python to Define your workflows Define your dags I mean obviously you can inject SQL bash commands all that with operators but at the core you're having a python wrapper around Mac to actually conduct that action so dagster on the other hand has a similar approach in that it is all python based but the main difference here is that when you look at the code and you'll see that this is mainly just calling from an API so with dagister you're bringing in all the different API endpoints you want and then just using python to connect to them under the hood so what this looks like in practice is dags that are really built around your data assets that you're using so you can see here instead of you know an app task decorator you have an at asset decorator and this is just calling an API pulling it and bringing it into a panus data frame and you can see the whole kind of diagster process is just built around python functions that are generating data frames generating data and doing Transformations on that so it's just orchestrating python functions um whereas with airflow you get more granularity in between you know hey the logic between tasks and how data is passed between tasks and calling out to external applications rather than just bringing data into dagster and doing some processing here and it's really just a matter of those those are two different focuses stackster is all about just doing you know data processing within dagster versus airflow is really about being the central point of access for your entire data ecosystem and managing data relationships across every part of your data stack um and so to kind of wrap it up you know when it comes to choosing between airflow and dagster there's a lot of factors to consider um and so I'll try to distill it down to just if you're looking for a flexible easy to use workflow management system that has a massive Community massive ecosystem of plugins you need to connect to a lot of different services within your stack airflow is an amazing choice you know it's going to be able to do everything you needed to do because you can extend it to do whatever you need it to do even if it's not built out already you can build it but most people have already already gone down the same path that you are trying to charge down so you can leverage all their experience and all of their work through that clearing all those providers they've built if your focus is just slowly on hey I need to take some data in do some processing the data quality needs to be top-notch and I want to be able to test it every step of the way dagster might be the better option just because airflow isn't really great at doing onto data processing dagster is designed to bring in Python files process them and then you're off the races that's pretty much it whereas airflow again is better for those more complex workflow situations where it's not just taking a pandas data frame and doing some processing on it and so I really hope that I've been able to leave you with a good impressionable systems that you can make an informed decision for whatever the requirements of your data pipelines are if you like this video toss it a like toss it or subscribe I really appreciate it um love when you guys comment any video ideas I mean this video idea came from some comments I got so I hope everyone that commented enjoyed this and has a really great rest of their day adios amigos
Info
Channel: The Data Guy
Views: 5,580
Rating: undefined out of 5
Keywords:
Id: 72bu7fBWX7o
Channel Id: undefined
Length: 14min 51sec (891 seconds)
Published: Thu Apr 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.